Intel® Core™ Microarchitecture Intel® Software College
Intel® Software College
Objectives After completion of this module you will be able to describe • Components of an IA processor • Working flow of the instruction pipeline • Notable features of the architecture
Intel® Processor Micro-architecture - Core® microarchitecture 2 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge preparation Notable features Micro-architecture tour Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 3 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge preparation Notable features Micro-architecture tour Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 4 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Industrial Recognition
Intel® Software College
PC Format May 2006
“Intel Strikes Back! Conroe is the name. Pistol-whipping Athlon 64s into burger meat is the game..“ Intel's Next Generation Microarchitecture Unveiled Real World Tech “Just as important as the technical innovations in Core MPUs, this microarchitecture will have a profound impact on the industry. “
Intel Dishes the Knockout Punch to AMD with Conroe, GD Hardware.com “…the results were far more than we could hope for and it'll be amusing to see AMD's response to this beat-down session
Intel Regains Performance Crown, Anandtech “… At 2.8 or 3.0GHz, a Conroe EE would offer even stronger performance than what we’ve seen here.”
Intel Reveals Conroe Architecture, Extremetech “… And not only was the Intel system running at 2.66GHz— a slower clock rate than the top Pentium 4—it was outpacing an overclocked Athlon 64 FX-60. Wrap your brain around that idea for a bit…”
Conroe Benchmarks - Intel Showing BigMicro-architecture Strength Hot Hardware.com Intel® Processor - Core® microarchitecture “… Intel is poised to change the face of the desktop computing landscape…” 5 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Performance Summary Intel® Core™ Microarchitecture dramatically boosts Intel platform performance • Conroe & Woodcrest drive clear Desktop/Server performance leadership • Merom extends Intel Mobile performance leadership
Intel® Core™ Microarchitecture-based platforms set the bar in Performance and Energy Efficiency for the MultiCore era • Intel’s 3rd generation dual-core (while competition stuck on 1st generation) • New Intel high-performance ‘engine’: Wider, Smarter, Faster, More Efficient Best Processor on the Planet: EnergyEnergy-Efficient Performance 1
The “Core™ Effect”: Intel® Core™ Microarchitecture 20% (Merom), 40% (Conroe), 80% (Woodcrest) Performance Boosts1 ! ramp fuels broad roadmap accelerations Intel® Processor Micro-architecture - Core® microarchitecture 6
1
Based on SPECint*_rate_base2000
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge preparation • Architecture VS Microarchitecture • CISC VS RISC • Performance Measurements • Pipeline Design • Power and Energy • Chip Multi-Processing Notable features Micro-architecture tour Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 7 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Architecture and Micro-architecture What is Computer Architecture? • Architecture is the set of features which are externally visible: • • • •
Instruction set Registers Addressing modes Bus protocols
Intel Architectures (IA) • IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture) • • •
X87 (Floating Point extension) MMX (Multi-Media extension) SSE, SSE2, SSE3 (SIMD Streaming Extension)
• Intel® 64/EM64T (64-bit Integer extension of IA32)
? Go to detail!
• IA64 (Intel new 64-bit architecture) •
Itanium/Itainium2 processor family Intel® Processor Micro-architecture - Core® microarchitecture 8 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Architecture and Micro-architecture (cont.) What is Micro-architecture? • Same as m–Architecture or u-Architecture • “Invisible” features that provide meaningful value to the end user (whatever makes you buy a new compatible PC) • Programs run faster Improved Performance • Reduced Power consumption Extended Battery life • H/W fits into Smaller Form Factor
Intel® Processor Micro-architecture - Core® microarchitecture 9 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Architecture History * IXA – Intel Internet Exchange Architecture/
Architecture:
EPIC – Explicitly Parallel Instruction Computing
Examples:
EPIC* (Itanium®)
Instruction set definition and compatibility
IA-32
IXA* (XScale)
Microarchitecture: Examples: Hardware implementation maintaining instruction set compatibility with high-level P5 architecture
P6
Intel NetBurst®
Banias
Processors: Productized implementation of Microarchitecture
Examples:
Pentium®
Pentium®
Pro Pentium® II/III
Pentium® 4 Pentium® D Xeon®
Pentium® M
Intel® Processor Micro-architecture - Core® microarchitecture 10 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture Processors
Intel® NetBurst®
+ New Innovations
Mobile Microarchitecture
Intel® Core™ 2 Duo/Quad/Extreme processors Intel® Processor Micro-architecture - Core® microarchitecture 11 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
RISC Approach to CPU design
(RISC = Reduced Instruction Set Computers) Optimize H/W for common basic operations • Fixed instruction length • •
Shorter Execution Pipeline Ease of Instruction Level Parallelism
• Large number of registers •
Less memory accesses
• ‘Load/Store’ architecture • •
Shorter Execution Pipeline Ease of advancing Loads
• Branch Hints •
Reduce pipeline flush events
• ‘Exotic’ stuff to be implemented in S/W with minimal H/W support • •
No ‘complex’ H/W instructions Handle exceptional conditions in S/W
Examples: MIPS, IBM Power and PowerPC, Sun Sparc
Achieve Maximum performance by right partitioning between H/W and S/W Intel® Processor Micro-architecture - Core® microarchitecture
12 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
CISC Approach to CPU design
(CISC = Complex Instruction Set Computers) Rich architecture • Variable length instructions. • Complex addressing modes. On-chip HW / SW partitioning required • H/W keeps executing ‘simple’ stuff • Complex instructions are ‘emulated’ using u-code routines from ROM • More instructions treated as ‘simple’ as more H/W is available COMPATIBILITY has some major advantages: • Large (and forever increasing) software base • Code development tools • Expertise • H/W - S/W spiral Example: Intel IA32, Motorola 680X0
Maximize information passed to the HW Intel® Processor Micro-architecture - Core® microarchitecture 13 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Performance Measurement Performance is the reciprocal of the “Time of execution”:
1 1 Performance ≈ = Time _ of _ Execution L * CPI * TC Were: L = Code Length (# of machine instructions) CPI = Clock cycles Per Instruction Tc = Clock period (nSecs) Substitute: IPC = Instructions Per Cycle = 1/CPI F = Frequency = 1/Tc
Improve ILP
Improve Timing
IPC * F Performance ≈ L Arch Enhancements Intel® Processor Micro-architecture - Core® microarchitecture 14 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Performance Measurement (cont.) Performance considerations: • Which Code/Application to run? • Which OS? • Which other components in the platform? • Under which thermal conditions? • Multithreading? Multiprocessing?
Benchmarks examples • Industry Standard
• • •
Commercial
• • • • • •
SysMark MobileMark PCMark Sandra ScienceMark
Applications
• • • • • •
Spec (ISPEC, FSPEC) TPC
Video (Windows Media encoder, DivX) Audio (Lame MP3) Compression (RAR) Content creation (3DSM, Photoshop, Premiere) Latest Games (Doom III, FarCry, but changes fast)
Specific industries use specific benchmarks
•
Linux compilation, POVRay, LinPack, lmbench
Intel® Processor Micro-architecture - Core® microarchitecture 15 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Design Considerations for Different Market Segments Constrains: • Thermally, area constrained Desktop • Unconstrained Extreme • Very area constrained Value • Thermally, Energy and Area constrained Mobile • Thermally, Energy Servers Micro-architecture is the Art of Tradeoffs between: • Schedule • Requirements / Standards • Performance • Features • Power / Energy • Area / Cost Intel® Processor Micro-architecture - Core® microarchitecture 16 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Design Metrics IPC = Instructions per Cycle • The more the better Latency – same as Response Time • The time interval between • •
when any request for data is made and when the data transfer completes
• The less the better Throughput • The amount of work completed by the system per unit of time. • The more the better • ops/sec
Intel® Processor Micro-architecture - Core® microarchitecture 17 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
CPU Pipeline Break the work to smaller pieces
• Four basic stages of instruction life • • • •
Fetch - bring instruction to core Decode - read operands from register Execute - perform the operation Writeback - save result to register
• Execution timing of simple instructions (legend: “op src1,src2 dst”) add eax, ebx eax F sub ecx, edx ecx
D F
E D
W E
W
Increased throughput • increased number of completed instructions per cycle
Intel® Processor Micro-architecture - Core® microarchitecture 18 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Pipeline Design - Explore Parallelism New instruction not always depends on previous one • Can start new instruction before previous one is finished • ...if different stages use different H/W resources Run instructions in parallel (pipeline) Add eax, ebx eax F D E W Sub ecx, edx ecx F D E W Or edi, esi edi F D E W Need to balance pipe stages • Each stage should take same time for best throughput and utilization
Clock cycle is determined by the longest path! Fetch
Decode Exec WB Fetch Decode Exec WB Fetch Decode Exec WB Fetch Decode Exec
WB
Intel® Processor Micro-architecture - Core® microarchitecture 19 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Pipeline Design – Fighting Stalls Data flow dependency (instructions output/input) • Solved by bypasses, renaming etc Control flow dependencies • Solved by branch prediction Others (Cache misses, long latency instructions) • Solved by other dynamic scheduling techniques ? Go to detail!
Intel® Processor Micro-architecture - Core® microarchitecture 20 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Race of CISC vs. RISC In modern CPUs Advanced µ-Architecture Techniques minimize the advantages of RISC over CISC
• Branch Prediction • Reduces the effect of extra pipeline stages
• Register Renaming • Effectively Increase the Number of Registers
• Out Of Order • Reduce Number of stalls caused by shortage of registers
• Speculative Execution • Further Reduce Number of stalls
• Power saving features • Reduce the overhead when not needed.
Intel® Processor Micro-architecture - Core® microarchitecture 21 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
µop – Intel’s Take of the CICS/RISC Race (CISC) Instructions are translated into one or more (RISC) uop(micro-operation)s • Fixed format • Wide and simple • Temp registers Usually one uop per instruction Complex instruction can be thousands of uops Stores divided into two uops (STA and STD) Fusion play games here
Intel® Processor Micro-architecture - Core® microarchitecture 22 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Power and Energy Maximum power (TDP): • Cooling requirements • Cooling solution • Computer form factor and acoustic noise Average power • Battery life • Electricity bill General calculation: • P = frequency * voltage^2 * activity factor * capacitance + leakage Reducing TDP • Less transistors and wires • Smaller transistors and wires • Power features less activity • Low leakage transistors Reducing average power • Energy efficiency • Power states • Lower leakage
Intel® Processor Micro-architecture - Core® microarchitecture 23 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Dual/Multi Core and SMT Put more than one core per package Architectural change: • Software must be multi-threaded or multi-process • …but backward compatible with multiprocessor systems (MP) Several ways of implementing it • All of them being used
I/O
I/O LLC
I/O LLC
LLC
LLC
Core
Core
Core
Core
I/O LLC Core
Core
SMT: Run two (or more) threads on the same core, simultaneously Intel® Processor Micro-architecture - Core® microarchitecture 24 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel Approach
?
Intel® Intel® XQ6700* Intel® Intel® Pentium® Pentium® D Processor
Intel® Intel® Pentium® Pentium®
Intel® Intel® Core 2 Duo® Duo®
80 Threads
Intel® Intel® Pentium® Pentium® With HT
4 Threads 2 Threads State Execution Units Cache Bus
2 Threads 2 Threads 1 Threads Q4 2000
Q2 2003
Q2 2005
Q3 2006
Q4 2006
While While single single core core performance performance has has increased increased due due to to clock clock speed, speed, increased increased cache cache and and improved improved ILP ILP the the biggest biggest performance performance increases increases Intel® Processor Micro-architecture - Core® microarchitecture have from the thread parallelism. have come come from the thread level level parallelism. 25 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
A “Acronym Cheat Sheet” of Parallel Computing CMP: Chip Multi Processor (two or more cores per package) • Dual Core: two cores in same package • Quad Core: four cores in same package DP: Dual Processor (two packages) MP: Multi Processor (four or more packages) SMT: Symmetric Multi Threading (virtual multi core: HyperThreading)
Intel® Processor Micro-architecture - Core® microarchitecture 26 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge preparation Notable features • Wide Dynamic Execution • Smart Memory Access • Advanced Smart Cache • Advanced Digital Media Boost • Intelligent Power Capability Micro-architecture tour Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 27 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features Instruction Fetch and PreDecode
Intel® Wide Dynamic Execution • 14-stage efficient pipeline • • •
Wider execution path Advanced branch prediction Macro-fusion
Instruction Queue 5 uCode ROM
4
• Roughly ~15% of all instructions are conditional branches • Macro-fusion fuses a comparison and jump to reduce micro-ops running down the pipeline
•
Micro-fusion • Merges the load and operation micro-ops into one macro-op
Rename/Alloc Retirement Unit (ReOrder Buffer)
up to 10.4 Gb/s FSB
4
Schedulers
• 64-Bit Support •
Decode
2M/4M shared L2 Cache
Merom, Conroe, and Woodcrest support EM64T
ALU Branch MMX/SSE FPmove
ALU FAdd MMX/SSE FPmove
ALU FMul MMX/SSE FPmove
Load
Store
L1 D-Cache and D-TLB Intel® Processor Micro-architecture - Core® microarchitecture 28 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.) Intel® Advanced Memory Access • Improved prefetching • Memory disambiguation • Advance load before a possible data dependency (pointer conflict) • Earlier loads hide memory latencies
Intel® Processor Micro-architecture - Core® microarchitecture 29 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.) Intel® Advanced Smart Cache • Multi-core optimization • • • • •
Shared between the two cores Advanced Transfer Cache architecture Reduced bus traffic Both cores have full access to the entire cache Dynamic Cache sizing
Intel® Processor Micro-architecture - Core® microarchitecture 30 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.) Advantages of Shared Cache Memory Front Side Bus (FSB)
Shipping L2 Cache Line ~Half access to memory
Cache Line CPU1
CPU2
Intel® Processor Micro-architecture - Core® microarchitecture 31 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.) Advantages of Shared Cache (cont.) Memory Front Side Bus (FSB) L2 is shared: No need to ship cache line Cache Line CPU1
CPU2
Intel® Processor Micro-architecture - Core® microarchitecture 32 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.) Intel® Advanced Digital Media Boost • Single Cycle SIMD Operation • 8 Single Precision Flops/cycle • 4 Double Precision Flops/cycle
SIMD Operation (SSE/SSE2/SSE3/SSSE)
SOURCE
128-bit 128-bit 128-bit 128-bit
packed packed packed packed
Add Multiply Load Store
• Support for Intel® EM64T instructions
0
X4
X3
X2
X1
Y4
Y3
Y2
Y1
SSE/2/3 OP
• Wide Operations • • • •
127
DEST
Core™ µarch CLOCK CYCLE 1
Previous CLOCK CYCLE 2
X4opY4 X3opY3 X2opY2 X1opY1 CLOCK CYCLE 1
X2opY2 X1opY1
X4opY4 X3opY3
Intel® Processor Micro-architecture - Core® microarchitecture 33 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features Intel® Advanced Digital Media Boost • Additional Media Instructions - Supplemental Streaming SIMD Extensions 3 (SSSE3) • 16 new packed integer instructions • Targeting video encode/decode
• Significantly improved strings • REP MOVS and REP STOS • ~8 bytes / cycle throughput •
mileage may vary
Intel® Processor Micro-architecture - Core® microarchitecture 34 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features Intel® Advanced Digital Media Boost • Supplemental SSE-3 (SSSE-3) Horizontal Addition/Subtraction
PHADDW, PHADDSW, PHADDD, PHSUBW, PHSUBSW, PHSUBD
Packed Absolute Values
PABSB, PABSW, PABSD Multiply and Add Packed Signed/Unsigned bytes
PMADDUBSW
Packed multiply High with Round and Scale
PMULHRSW PSHUFB
Packed Shuffle Bytes
PSIGNB/W/D
Packed SIGN Packed Align Right
PALIGNR Intel® Processor Micro-architecture - Core® microarchitecture
35 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.) Intelligent Power Capability • Advanced power gating & Dynamic power coordination • • • • • • •
Multi-point demand-based switching Voltage-Frequency switching separation Supports transitions to deeper sleep modes Event blocking Clock partitioning and recovery Dynamic Bus Parking During periods of high performance execution, many parts of the chip core can be shut off
Intel® Processor Micro-architecture - Core® microarchitecture 36 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge preparation Notable features Micro-architecture tour • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 37 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Drill-down
icache
branch prediction predecode unit
instruction queue
page miss handler
data cache unit
memory order buffer
instruction decode
register alias table
MS
ALLOC
store address load store data
integer FP SIMD (3x)
Reservation Station
Re-Order Buffer
Intel® Processor Micro-architecture - Core® microarchitecture 38 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge refreshment Notable features Micro-architecture tour • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 39 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End Instruction preparation before executed
icache branch prediction unit
• Instruction Fetch Unit predecode
• Instruction Queue • Instruction Decode Unit • Branch Prediction Unit
instruction queue
instruction decode MS Intel® Processor Micro-architecture - Core® microarchitecture 40 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Instruction Queue Buffer between instruction pre-decode unit and decoder • up to six predecoded instructions written per cycle • 18 Instructions contained in IQ • up to 5 Instructions read from IQ Potential Loop cache Loop Stream Detector (LSD) support • Re-use of decoded instruction • Potential power saving
Intel® Processor Micro-architecture - Core® microarchitecture 41 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Macro - Fusion Scheduler Roughly ~15% of all instructions are conditional branches.
cmpjae eax, [mem], label
Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.
Execution
Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch.
Branch Eval
Not supported in EM64T long mode
flags and target to Write back Intel® Processor Micro-architecture - Core® microarchitecture 42 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Macro-Fusion Absent
Instruction Queue addps xmm0, [EAX+16]
Read four instructions from Instruction Queue
mulps xmm0, xmm0 movps [EAX+240], xmm0
Each instruction gets decoded into separate uops
cmp eax, 100000
Enabling Example
jge label
for (int i=0; i<100000; i++) { …
Cycle 1
}
mulps xmm0, xmm0 movps [EAX+240], xmm0
cmp eax, 100000 Cycle 2
dec0
addps xmm0, [EAX+16]
jge label
dec1 dec2 dec3 dec0
Intel® Processor Micro-architecture - Core® microarchitecture 43 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Macro-Fusion Presented
Instruction Queue addps xmm0, [EAX+16]
Read five Instructions from Instruction Queue
mulps xmm0, xmm0
Send fusable pair to single decoder
movps [EAX+240], xmm0
cmp eax, 100000
Single uop represents two instructions
jae label
Enabling Example for (unsigned int i=0; i<100000; i++) { …
Cycle 1
addps xmm0, [EAX+16] mulps xmm0, xmm0 movps [EAX+240], xmm0
}
cmpjae
eax, 100000, label
dec0 dec1 dec2 dec3
Intel® Processor Micro-architecture - Core® microarchitecture 44 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Instruction Decode / Micro-Op Fusion Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation
Micro-op fusion effectively widens the pipeline Intel® Processor Micro-architecture - Core® microarchitecture 45 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Decode / Micro-Fusion (cont.) u-ops of a Store “movps [EAX+240], xmm0”
sta eax+240 st xmm0, [eax+240] std xmm0, [eax+240]
Intel® Processor Micro-architecture - Core® microarchitecture 46 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Branch Prediction Improvements Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements:
Indirect Branch Predictor
Loop Detector
Branch miss-predictions reduced by >20%
Intel® Processor Micro-architecture - Core® microarchitecture 47 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge preparation Notable features Micro-architecture tour • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 48 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Execution Core store address integer
Accepted decoded u-ops, assign resources, execute and retire u-ops
load
• Renamer
store data
• Reservation station (RS) register alias table
• Issue ports • Execution Unit
ALLOC
FP SIMD (3x)
Reservation Station
Re-Order Buffer
Intel® Processor Micro-architecture - Core® microarchitecture 49 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core
Execution Core Building Blocks
Renamer
Ports (number)
RS
0,1,5 SIMD Integer
ROB
SIMD/Integer MUL
0,1,5 Integer
0,1,5 Floating Point
Execution Unit
2 Load 3,4 Store
Memory Sub-system Intel® Processor Micro-architecture - Core® microarchitecture 50 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Execution Core
Intel® Software College
Issue Ports and Execution Units 6 dispatch ports from RS • 3 execution ports • (shared for integer / fp / simd)
• load • store (address) • store (data) 128-bit SSE implementation • Port 0 has packed multiply (4 cycles SP 5 DP pipelined) • Port 1 has packed add (3 cycles all precisions)
Intel® Processor Micro-architecture - Core® microarchitecture 51 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core
Retirement Unit ReOrder Buffer (ROB) • Holds micro-ops in various stages of completion • Buffers completed micro-ops • updates the architectural state in order • manages ordering of exceptions
register alias table ALLOC
Reservation Station Re-Order Buffer
Intel® Processor Micro-architecture - Core® microarchitecture 52 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge preparation Notable features Micro-architecture tour • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 53 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Memory SubSystem Memory Ordering Buffer • Store Address Buffer • Stores the address of each store not actually performed • Loads compare address to any store older than itself • If it find a hole…
• Store Data Buffer • Stores data of each store not actually performed • If load hit on the SAB, it forward the data from here
• Load Buffer • Stores address of non-retired loads • For snoops and re-dispatch
• One 128-bit load and one 128-bit store per cycle to different memory locations • Out of order Memory operations Intel® Processor Micro-architecture - Core® microarchitecture 54 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Core® Micro-architecture Memory SubSystem (cont.) 32k D-Cache (8-way, 64 byte line size) Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache Cache to cache transfer • improves producer / consumer style MP Wider interface to L2 • reduced interference • processor line fill is 2 cycles
Core1
Core2
Higher bandwidth from the L2 cache to the core • ~14 clock latency and 2 clock throughput
Bus
Load & Store Access order 1. 2. 3. 4.
L1 cache of immediate core L1 cache of the other core L2 cache Memory
2 MB L2 Cache
Intel® Processor Micro-architecture - Core® microarchitecture 55 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Advanced Memory Access / Enhanced Data Pre-fetch Logic Speculates the next needed data and loads it into cache by HW and/or SW
Door Valet Parking Area (L1 Cache) (L2 Cache)
Main Parking Lot (External Memory)
Intel® Processor Micro-architecture - Core® microarchitecture 56 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Memory Sub-system
Intel® Software College
Advanced Memory Access / Enhanced Data Pre-fetch Logic (cont.) • L1D cache prefetching • Data Cache Unit Prefetcher • Known as the streaming prefetcher • Recognizes ascending access patterns in recently loaded data • Prefetches the next line into the processors cache
• Instruction Based Stride Prefetcher • Prefetches based upon a load having a regular stride • Can prefetch forward or backward 2 Kbytes •
1/2 default page size
• L2 cache prefetching: Data Prefetch Logic (DPL) • Prefetches data to the 2nd level cache before the DCU requests the data • Maintains 2 tables for tracking loads • Upstream – 16 entries • Downstream – 4 entries
• Every load is either found in the DPL or generates a new entry • Upon recognition of the 2nd load of a “stream” the DPL will prefetch the next load Intel® Processor Micro-architecture - Core® microarchitecture 57 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Memory Sub-system
Intel® Software College
Advanced Memory Access / Memory Disambiguation Memory Disambiguation predictor • Loads that are predicted NOT to forward from preceding store are allowed to schedule as early as possible • increasing the performance of OOO memory pipelines
Disambiguated loads checked at retirement • Extension to existing coherency mechanism • Invisible to software and system
Intel® Processor Micro-architecture - Core® microarchitecture 58 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Advanced Memory Access / Memory Disambiguation Absent Load4 must WAIT until previous stores complete Memory
Store1
Y
Load2
Y
Data W
Data Z
Store3
W
Load4
X
Data Y Data X
Intel® Processor Micro-architecture - Core® microarchitecture 59 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Advanced Memory Access / Memory Disambiguation Presented Loads can decouple from stores Load4 can get its data WITHOUT waiting for stores Memory
Load4 Store1
X Y
Load2
Y
Store3
W
Data W
Data Z
Data Y Data X Intel® Processor Micro-architecture - Core® microarchitecture 60 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system
Advanced Memory Access / Stores Forwarding If a load follows a store and reloads the data that the store writes to memory, the micro-architecture can forward the data directly from the store to the load
Memory
Store1
Y
Load2
Y
Internal Buffers Data Y Intel® Processor Micro-architecture - Core® microarchitecture
61 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access / Stores Forwarding: Aligned Store Cases store 16
store 32 bit
store 64 bit
load 16
load 32 bit
load 64 bit
ld 8 ld 8
load 16 load 16
load 32 bit
ld 8 ld 8 ld 8 ld 8
load 16 load 16 load 16 load 16
load 32 bit
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 store 128 bit load 128 bit load 64 bit load 32 bit
load 64 bit load 32 bit
load 32 bit
load 32 bit
load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16 Micro-architecture - Core® microarchitecture ld 8 ld 8 ld 8 ld 8 ld 8 Intel® ld 8Processor ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 62 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access / Stores Forwarding: Unaligned Cases Note that unaligned store forward does not occur when the load crosses a cache line boundary store 16
store 32 bit
store 64 bit
load 16‡
load 32 bit‡
load 64 bit
ld 8 ld 8
load 16‡ load 16
load 32 bit‡
ld 8 ld 8 ld 8 ld 8
load 16‡ load 16 load 16 load 16
load 32 bit
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 Store forwarded to load ld 8 No forwarding ‡:
No forwarding if the load crosses a cache line boundary
Note: Unaligned 128-bit stores are issued as two 64-bit stores. This provides two alignments for store forwarding
Intel® Processor Micro-architecture - Core® microarchitecture 63 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge preparation Notable features Micro-architecture tour Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 64 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing for Instruction Fetch and PreDecode Avoid “Length Changing Prefixes” (LCPs) • Affects instructions with immediate data or offset • Operand Size Override (66H) • Address Size Override (67H) [obsolete] • LCPs change the length decoding algorithm – increasing the processing time from one cycle to six cycles (or eleven cycles when the instruction spans a 16-byte boundary) • The REX (EM64T) prefix (4xH) is not an LCP • The REX prefix does lengthen the instruction by one byte, so use of the first eight general registers in EM64T is preferred
Intel® Processor Micro-architecture - Core® microarchitecture 65 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing for Instruction Queue Includes a “Loop Stream Detector” (LSD) • Potentially very high bandwidth instruction streaming • A number of requirements to make use of the LSD • • • •
Maximum of 18 instructions in up to four 16-byte packets No RET instructions (hence, little practical use for CALLs) Up to four taken branches allowed Most effective at 70+ iterations
• LSD is after PreDecode so there is no added cost for LCPs • Trade-off LSD with conventional loop unrolling
Intel® Processor Micro-architecture - Core® microarchitecture 66 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing for Decode Decoder issues up to 4 uOps for renaming/ allocation per clock • This creates a trade off between more complex instruction uOps versus multiple simple instruction uOps • For example, a single four uOp instruction is all that can be renamed/allocated in a single clock • In some cases, multiple simple instructions may be a better choice than a single complex instruction • Single uOp instructions allow more decoder flexibility • For example, 4-1-1-1 can be decoded in one clock • However, 2-2-2-1 takes three clocks to decode
Intel® Processor Micro-architecture - Core® microarchitecture 67 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing for Execution Up to six uOps can be dispatched per clock • “Store Data” and “Store Address” dispatch ports are combined on the block diagram Up to four results can be written back per clock Single clock latency operations are best • Differing latency operations can create writeback conflicts • Separate multiple-clock uOps with several single uOp instructions •
Typical instructions here: ADC/SBB, RWM, CMOVcc
• In some cases, separating a RMW instruction into its piece might be faster (decode and scheduling flexibility) When equivalent, PS preferred to PD
(LCP)
• For example, MOVAPS over MOVAPD, XORPS over XORPD
Intel® Processor Micro-architecture - Core® microarchitecture 68 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing for Execution (cont.) Bypass register “access” preferred to register reads Partial register accesses often lead to stalls • Register size access that ‘conflicts’ with recent previous register write • Partial XMM updates subject to dependency delays • Partial flag stall can occur, too much higher cost •
Use TEST instruction between shift and conditional to prevent
• Common zeroing instructions (e.g., XOR reg,reg) don’t stall Avoid bypass between execution domains • For example: FP (ADDPS) and logical ops (PAND) on XMMn Vectorization: careful packing/unpacking sequence • Use MXCSR’s FZ and DAZ controls as appropriate
Intel® Processor Micro-architecture - Core® microarchitecture 69 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing for Memory Software prefetch instructions • Can reach beyond a page boundary (including page walk) • Prefetches only when it completes without an exception General techniques to help these prefetchers • Organize data in consecutive lines • In general, increasing addresses are more easily prefetched
Intel® Processor Micro-architecture - Core® microarchitecture 70 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Summary What has been covered • Notable features of Core® Micro-architecture • • • • •
Wide Dynamic Execution Advanced Memory Access Advanced Smart Cache Advanced Digital Media Boost Power Efficient Support
• Core® Micro-architecture components • Front End • OOO execution core • Memory sub-system
Intel® Processor Micro-architecture - Core® microarchitecture 71 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Processor Micro-architecture - Core® microarchitecture 72 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Platform Legacy & Debug I/O
Intel provides most of the silicon on any computer
Core LLC Core
Classical platform partition • CPU – Computation
FSB
HD video
Graphics speed and memory latencies will require different partition This presentation focuses on the core microarchitecture
ME
PCIe TVout
PEG
Graphics Display
Analog
DMI
Wireless PCI (IO) SATA USB KBRD others
DDR
• MCH – high speed IO • ICH – low speed IO
CPU
FSB
MEM
MCH
DMI
ICH
Intel® Processor Micro-architecture - Core® microarchitecture 73 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® 64 = Extending IA-32 to 64 Bit
Extended ExtendedMemory Memory Addressability Addressability
64 -Bit Pointers, 64-Bit Pointers,Registers Registers
+
Additional AdditionalRegisters Registers 88-SSE -SSE &&88-Gen -Gen Purpose Purpose
=
Double -bit) DoublePrecision Precision(64 (64-bit) Integer IntegerSupport Support
With 64-Bit Extension Technology
Added to Intel XEON™ and Pentium® 4 Processor in 2004; today available in all main stream Intel IA-32 processors – in particular in all processors based on Intel® Core™ Architecture Intel® Processor Micro-architecture - Core® microarchitecture 74 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® 64 - New Modes of Operation
Mode
Compile required
OS Req’d
64-bit Mode Long Mode
Compa tibility
Yes
New 64-bit OS
No
New Features
Defaults
64bit IP
RIP Rel.
New Regs
GPR Widt h
Addr Size
Operand Size
Yes
Yes
Yes
64
64
32
32
32
16
16
32
32
16
16
Yes
No
No
32
Mode Legacy Mode (IA32 Mode)
Legac y 32bit or 16-bit OS
No
No
No
No
32
Intel® Processor Micro-architecture - Core® microarchitecture 75 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Registers : Extensions and Additions RIP
63
79
0
32 31
127
64 63
RAX
EAX
XMM0
RBX
EBX
XMM1
RCX
ECX
XMM2
RDX
EDX
XMM3
RBP
EBP
XMM4
RSI
ESI
XMM5
RDI
EDI
XMM6
RSP
ESP
XMM7
R8
XMM8
R9
XMM9
R10
XMM10
R11
XMM11
R12
XMM12
R13
X87/ MMX
0
EIP
R14 R15
0
XMM13 XMM14 XMM15
Intel® Processor Micro-architecture - Core® microarchitecture 76 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Registers : Availability in different modes
Intel® Processor Micro-architecture - Core® microarchitecture 77 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
64-bit Mode of Operation Default data size is 32-bits • Override to 64-bits using new REX prefix All registers are 64-bit, 32-bit, 16-bit and 8-bit addressable REX prefixes • A family of 16 prefixed, encoded 0x40-0x4F • Allows the use of general purpose registers as 64-bits • Allows the use of new registers (like r8-r15) Instructions that set a 32 bit register automatically zero extend the upper 32-bits
Intel® Processor Micro-architecture - Core® microarchitecture 78 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
REX Prefix A new instruction-prefix byte used in 64-bit mode • Specify the new GPRs and SSE registers • Specify a 64-bit operand size. • Specify extended control registers (used by system software) An instruction can only have one REX prefix and if used, must immediately precede the opcode or the two-byte opcode escape prefix . The legacy instruction-size limit of 15 bytes still applies to instructions that contains a REX prefix.
Intel® Processor Micro-architecture - Core® microarchitecture 79 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Physical and Linear Addressing Linear Addressing • Initial Intel® 64 implementation support 48 bits of Virtual addressing. • Addresses are required to be in canonical form – bits 47 thru 63 must all be 1 or all be 0. Physical Addressing • Initial Netburst™ Intel® 64 implementation support 36 bit, today all current processors support 40bit at least • Entries in page tables expanded for up to 52 bits of physical address. Intel® Processor Micro-architecture - Core® microarchitecture 80 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel®64 - Large Memory Considerations
Canonical addressing for 64 bit addresses • Although the architecture now allows calculating flat addresses to 64 bits, today’s processors limit virtual addressing to 48 bits • Canonical address definition: An address that has address bit 63 through 47 set to either all ones or all zeros • Canonical addresses are a requirement • Values for addresses that are not canonical will cause faults when put into locations expecting a valid address, such as segment registers
Return Intel® Processor Micro-architecture - Core® microarchitecture 81 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Introducing SIMD: Single Instruction Multiple Data Scalar processing
SIMD processing
• traditional mode
• with SSE / SSE2
• one operation produces one result
• one operation produces multiple results
X
X
x3
x2
x1
x0
+
+ Y
Y
y3
y2
y1
y0
X+Y
X+Y
x3+y3
x2+y2
x1+y1
x0+y0
Intel® Processor Micro-architecture - Core® microarchitecture 82 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
X86 Register Sets SSE-Registers introduced first in Pentium® 3
MMX™ Technology / IA-FP Registers
IA-INT Registers
SSE Registers
80
32
128
64
eax
st0
xmm0
mm0
… xmm7 edi
st7
Fourteen 32-bit registers Scalar data & addresses Direct access to regs
mm7
Eight 80/64-bit registers Hold data only Stack access to FP0..FP7 Direct access to MM0..MM7 No MMX™ Technology / FP interoperability
Eight 128-bit registers Hold data only: 4 x single FP numbers 2 x double FP numbers 128-bit packed integers Direct access to the registers Use simultaneously with FP / MMX Technology
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Set Extensions New Instructions Added to Intel® Processors 160
144
140 120 100
70
80
56
60
~ 50
32
40
32
13
20 0
Process (nm)
Jan-97
Feb-99
MMX™
Streaming SIMD Extensions (SSE)
350
Dec-00
Feb-04
Streaming SIMD Streaming SIMD Extensions 2 (SSE2) Extensions 3 (SSE3)
250
180
90
Jul-06 Supplemental SSE3 (SSSE3) 65
2008+ Future FutureSSE-4 Intel instruction set extensions
45 45 nm
Beginning in 2008: ~50 new instructions in 13 groups All function in 32-bit and 64-bit modes Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D & 3D Imaging, Vectorizing Compiler Performance
Intel® Processor Micro-architecture - Core® microarchitecture 84 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE and SSE-2 Data Types SSE
4x floats 2x doubles 16x bytes 8x 16-bit shorts
SSE-2 4x 32-bit integers 2x 64-bit integers 1x 128-bit(!) integer
Intel® Processor Micro-architecture - Core® microarchitecture 85 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE-Instructions Set Extensions Introduced by Pentium® 3 in 1999; now frequently called SSE-1 Only new data type supported: 4x32Bit (Single Precision) floating point data Some 70 instructions • Arithmetic, compare, convert operations on SSE SP FP data • PACKED, UNPACKED
• • • • •
Data load/store Prefetch Extension of MMX Streaming Store (store without using cache in between) … 2001 PTE Engineering Enabling Conference
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE Sample: Branch Removal R = (A < B)? C : D //remember: everything packed
A
0.0
0.0
-3.0
3.0
cmplt B
0.0
1.0
-5.0
5.0
00000
11111
00000
11111
and
nand
c3
c2
c1
c0
d3
d2
d1
d0
00000
c2
00000
c0
d3
00000
d1
00000
or
Intel® Processor Micro-architecture - Core® microarchitecture 87
d3
Copyright © 2006, Intel Corporation. All rights reserved.
c2
d1
c0
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE-2 Instructions Set Extensions
Introduced by Intel® Pentium®4 processor in 2000 Some 140 new instructions Added double precision floating point data (2x64Bit) and all related instructions including conversion Again some extensions to MMX Added all possible combinations of integer data to SSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and related operations 2001 PTE Engineering Enabling Conference
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SIMD Single vs. SIMD Double SIMD SP FP Operand = 4 Elements
4 x Single Precision: SSE-1
Element = SP FP Number 127
0
X3
X2
X1 31 30
X0 0
23 22
S Exponent
Significand
SIMD DP FP Operand = 2 Elements Element = DP FP Number 127
2 x Double Precision: SSE-2
0
X1 63 62
S
X0 52 51
Exponent
0
Significand
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Sample for SSE-2:
SIMD Double ↔ SIMD Int Conversion
SIMD Double SIMD Int: conversion to two lower ints, two higher ints cleared
x1 00000
x0
00000 (int)x1 (int)x0
__m128d x; __m128i ix; ix = _mm_cvtpd_epi32(x);
Int SIMD Double: conversion from two lower ints
SIMD ????
????
ix1
ix0
x = _mm_cvtepi32_pd(ix);
Intel® Processor Micro-architecture - Core® microarchitecture 90
(double)x1
(double)x0
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE3: No new Data Types but new Instructions
FISTTP
FP to integer conversions ADDSUBPD, ADDSUBPS,
Complex arithmetic
MOVDDUP, MOVSHDUP, MOVSLDUP
Video encoding SIMD FP using AOS format*
LDDQU HADDPD, HSUBPD
Thread Synchronization
HADDPS, HSUBPS MONITOR, MWAIT
* Also benefits Complex and Vectorization
Intel® Processor Micro-architecture - Core® microarchitecture 91 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Streaming SIMD Extensions 3 13 new instructions
Three have limited use for application performance improvement • FISTTP - X87 to integer conversion (requires –longdouble switch) • MONITOR/MWAIT - thread synchronization • Available today in Ring 0 only; being used by newer Windows* and Linux* thread packages
The other ten have some potential for specifc application domains
Intel® Processor Micro-architecture - Core® microarchitecture 92 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE-3 Sample Complex Arithmetic: ADDSUBPS ADDSUBPS OperandA OperandB • OperandA (xmm register; 4 data elements) • a3, a2, a1, a0 • OperandB (xmm reg. Or memory addr; 4 data elements) • b3, b2, b1, b0 • Result (Stored in OperandA) • a3+b3, a2-b2, a1+b1, a0-b0 __m128 _mm_addsub_ps(__m128 a, __m128 b)
a3 b3 Add
93
a3+b3
a2 b2
a1 b1
Sub
a0 b0
Add
Sub
Intel® Processor Micro-architecture - Core® microarchitecture
a2-b2
a1+b1
a0-b0
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Sample SSSE-3 Inst.: Byte Permute PSHUFB mm, mm/m64 PSHUFB xmm, xmm/m128 • • • • •
A complete byte-granularity permutation The source operand is used as the control field (variable control) The destination operand gets permuted Each byte of the source field selects the origin of the corresponding destination byte Also includes force-byte-to-zero flag (bit 7) src
0x7
0x7 0xFF 0x80 0x01 0x00 0x00 0x00
dest
0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01
dest
0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01
Intel® Processor Micro-architecture - Core® microarchitecture 94 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Ways to SSE/SIMD programming Coding using SSE/SSE2/3/4 assembler instructions • Very tedious (manually schedule) – discouraged: Don’t do it ! • E.g.: How do you exploit the benefits of having now 16 instead of 8 SSE registers for Intel® 64 without maintaining two versions ?
Intel® compiler’s C/C++ SIMD intrinsics • No need to take care of register allocation, scheduling etc
Intel® compiler’s C++ Vector Class Library • Use this if you are heavy into C++ classes
Vectorizer of Intel® C++ and Fortran Compilers • Recommended for most cases – easy and efficient
Use ready-to-go vectorized code from a library like Intel® Math Kernel Library (MKL) 2001 PTE Engineering Enabling Conference
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Compiler Based Vectorization
Intel® Software College
Processor Specific
Generate Code and Optimize for
Linux*
Pentium® 3 compatible and Athlon XPprocessors including code generation for MMX and SSE
-axK -axK
Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode, including code generation for MMX, SSE and SSE2
-xW -axW
Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE2 - depreciated switch: use xW instead
-xN -axN
Pentium® M processors including code generation for MMX, SSE and SSE-2
-xB -axB
Intel® processors with SSE3 capability including Pentium 4 (both 32 and 64bit mode) – including code generation for MMX, SSE, SSE2 and SSE-3
-xP, -axP
Intel® processors with MNI capability – Intel® Core™2 Duo processors ( Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE3 and MNI
-xT, -axT
Intel® Processor Micro-architecture - Core® microarchitecture 96 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Return Features (cont.) New Instructions Instruction name
Description
psignb/w/d mm, mm/m64
Per element, if the source operand is negative, multiply the destination operand by -1.
psignb/w/d xmm, xmm/m128 pabsb/w/d mm, mm/m64 pabsb/w/d xmm, xmm/m128 phaddw/d/sw mm, mm/m64 phaddw/d/sw xmm, xmm/m128 phsubw/d/sw mm, mm/m64 phsubw/d/sw xmm, xmm/m128 PMADDUBSW mm, mm/m64
Per element, overwrite destination with absolute value of source. Pairwise integer horizontal addition + pack. Pairwise integer horizontal subtract + pack.
PMADDUBSW xmm, xmm/m128
Multiply signed & unsigned bytes. Accumulate result to signed-words. (Multiply Accumulate)
PMULHRSW mm, mm/m64
Signed 16 bits multiply, return high bits.
PMULHRSW xmm, xmm/m128 PSHUFB mm, mm/m64 PSHUFB xmm, xmm/m128 PALIGNR mm, mm/m64, imm8
A complete byte-granularity permutation, including force-to-zero flag.
Extract any continuous 16 (8 in the 64 bit case) bytes from the pair [dst, src] and PALIGNR xmm, xmm/m128,Intel® imm8 Processor Micro-architecture - Core® microarchitecture store them to the dst register. 97 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Dependencies and Bypasses “Read-after-Write” Dependency - 1 clock stall assuming register file can be written-through add eax, ecx eax F D E W sub ebx, eax ebx F D D E W “E to D” Bypass - save clock penalty add eax, ecx eax F D E W sub ebx, eax ebx F D E W Long Latency operations Load [ecx+edi] eax F D E E E W add ebx, eax ebx F D D D E W
Intel® Processor Micro-architecture - Core® microarchitecture 98 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Branch Handling Given the code: for (i=100, a=0; i>0; i--) a+=B[i]; Compiler would generate • // eax initiated with zero, edi initiated with 100 loop:
load
B[edi] ebx
// read B[i] from memory
add
eax, ebx eax // a+=B[i]
add
edi,-1 edi
jnz
edi, loop
store
eax a
// i-=1 // store result
Intel® Processor Micro-architecture - Core® microarchitecture 99 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Branch Handling (cont.) load add add jnz store xxx load
B[edi] ebx F D E W eax,ebx eax F D E W edi,-1 edi F D E W edi, loop F D E W eax a F D E W F D E W B[edi] ebx F D E W
Only after branch Execute stage we know that next fetch was wrong
• Need to flush the pipe • IPC: 4 instructions in 6 clocks (IPC = 0.66 vs. optimum IPC = 1) • ‘Pipe break’ penalty = 2 clocks • Adding a stage?: IPC = 0.57 ~14% slower!!! Prolonging the pipeline achieves higher frequencies however pipe break penalty increases! MUST solve the pipe break penalty problem! Intel® Processor Micro-architecture - Core® microarchitecture 100 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Branch Handling (cont.) H/W can ‘learn’ about SW behavior • Same branch goes same direction in most cases • Learn branch address and target •
Branch Target Buffer (BTB)
• Predict based on branch history, surrounding branch behavior, loop behavior. •
We are at ~95% correct prediction.
• Looks in BTB while fetching instruction • Lee&Smith or Yeh&Patt algorithms New (and correct) pointer calculated in Fetch stage of branch load add add jnz load
B[edi] ebx F D E W eax,ebx eax F D E W edi,-1 edi F D E W edi, loop F/P D E W B[edi] ebx F D E W Intel® Processor Micro-architecture - Core® microarchitecture 101 Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Pipeline Techniques Limitations of the Typical Pipeline Scheme • IPC is theoretically limited by 1 • Actually IPC is less than 1 because of long latency operations, stalls (e.g. cache miss), pipeline flushes (due to branch miss prediction) etc.
• Pipeline stages are frequently not balanced • Cycle Time (Tc) is determined by the longest pipeline stage
Advanced Pipeline Techniques • Super pipeline • Super-scalar
Intel® Processor Micro-architecture - Core® microarchitecture 102 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Pipeline Techniques (cont.) Super pipeline: shorter stages allows higher frequency F1 F2 D1 D2 E1 E2 W1 W2 F1 F2 D1 D2 E1 E2 W1 W2 F1 F2 D1 D2 E1 E2 W1 W2
Super-scalar: perform more in a single cycle F F
D D F F
E W E W D E W D E W
Intel® Processor Micro-architecture - Core® microarchitecture 103 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting stalls: Out Of Order Execution (OoO) Instructions are executed based on “data flow” rather than program order (Tomasulo’s algorithm ) Avoid the stall that occurs on this 1. Instruction Fetch and Decode. 2. Instruction queue @ Reservation Station.
stage in an in-order processor
3. Instruction • waits in the queue until all input operands are available • leaves the queue before earlier, older instructions.
4. Instruction Execution 5. Results are queued. 6. Instruction Reorder and Writeback.
Intel® Processor Micro-architecture - Core® microarchitecture 104 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting stalls: Register Renaming Creates new opportunities for OOO execution • Eliminates Write-after-write (WAW) and Write-afterread (WAR) dependencies = hazards. Architectural vs physical registers dispatch 1. mov eax, [m1] 2. add eax, 2 MULTD F4,F2,F2 reads from F2 3. mov [m2], eax 4. F2,F0,F6 mov eaxwrites , [m3] ADDD to F2 5. add eax, 4 6. mov [m4], eax MULTD F4,F2,F2
4, 5, 6 can be executed parallel with 1, 2, 3 ADDD F8,F0,F6 (assume F8 is in unused) but after registers renaming only!!!
Intel® Processor Micro-architecture - Core® microarchitecture 105 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Re-Order Buffer (ROB) Mechanism for renaming and retirement Table contains in-order instructions order instructions • Instructions are entered in order • Registers renamed by the entry number • Once assigned: execution order unimportant • After execution: entries marked • An executed entry can be “retired” once all prior instruction have retired. That is: instruction have retired • Update “real registers real registers” with value of renamed regs • Update memory • Leave the ROB Intel® Processor Micro-architecture - Core® microarchitecture 106 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Reservation Station(s) Pool(s) of all “not yet executed” instructions Maintains operands status “ready / not-ready” Each cycle, executed instructions make more operands “ready” Instructions whose all operands are “ready” can be “dispatched” for execution Dispatcher chooses which of the “ready” instructions will be executed next
Intel® Processor Micro-architecture - Core® microarchitecture 107 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Memory Order Buffer (MOB) Idea - allow out of order among memory operations Problem Memory dependencies cannot fully resolved statically (memory disambiguation) Structure similar in concept to ROB Every access is allocated an entry Address & data (for stores) are updated when known Load is checked against all previous stores: Load is checked against all previous stores
Return Intel® Processor Micro-architecture - Core® microarchitecture 108 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.) Intelligent Power Capability - Split Busses (core power feature)
Many buses are sized for worst case data (x86 instruction of 15 bytes) (ALU can write-back 128 bits)
Improved Energy Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 109 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.) Intelligent Power Capability - Split Busses (core power feature)
By splitting buses to deal with varying data widths, we can gain the performance benefit of bus width while maintaining C dynamic closer to thinner buses
Improved Energy Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 110 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge refreshment Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 111 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Overview System Bus Bus Unit
1st Level Cache (Data)
2nd Level Cache
Instruction Fetch Unit
Decode /IQ
Front End
Renamer/Allocator Buffers(Retirement) Scheduler
Execution Unit
Execution Core
Branch Prediction Unit
Intel® Processor Micro-architecture - Core® microarchitecture 112 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Drill-down
icache
branch prediction predecode unit
instruction queue
page miss handler
data cache unit
memory order buffer
instruction decode
register alias table
MS
ALLOC
store address load store data
integer FP SIMD (3x)
Reservation Station
Re-Order Buffer
Intel® Processor Micro-architecture - Core® microarchitecture 113 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Example Code to Be Used … addps xmm0, [EAX+16] mulps xmm0, xmm0 movps [EAX+240], xmm0 cmp EAX, 100000 jge label …
Intel® Processor Micro-architecture - Core® microarchitecture 114 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge refreshment Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 115 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End Instruction preparation before executed • Instruction Fetch Unit • Instruction Queue • Instruction Decode Unit • Branch Prediction Unit
Intel® Processor Micro-architecture - Core® microarchitecture 116 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit
Intel® Processor Micro-architecture - Core® microarchitecture 117 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Fetch Unit Prefetches instructions that are likely to be executed
icache branch prediction unit
predecode
Caches frequently-used instructions Predecodes and Buffers instructions
instruction queue 2nd Level Cache
1st Level Cache (Data)
IQ/ Decode
Instruction Fetch Unit Front End
Renamer/Allocator Buffers(Retirement) Scheduler
Execution Unit
instruction decode
Execution Core
MS
BTBs/Branch Prediction
Intel® Processor Micro-architecture - Core® microarchitecture 118 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Instruction Fetch Unit (cont.) I-Cache (Instruction Cache) • 32 KBytes / 8-way / 64-byte line • 16 aligned bytes fetched per cycle ITLB (Instruction Translation Lookaside Buffer) • 128 4k pages, 8 2M pages Instruction Prefetcher • 16-byte aligned lookup through the ITLB into the instruction cache and instruction prefetch buffers Instruction Pre-decoder • Instruction Length Decode (predecode) •
Avoid Length Changing Prefix, for example • The REX (EM64T) prefix (4xH) is not an LCP
Avoid in loop: MOV dx, 1234h Opcode ModR/M SIB Displacement Instruction Prefixes (66H/67H)Intel® ModR/M Processor Micro-architecture - Core® microarchitecture
Immediate
119 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit
Intel® Processor Micro-architecture - Core® microarchitecture 120 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Queue Buffer between instruction pre-decode unit and decoder • up to six predecoded instructions written per cycle • 18 Instructions contained in IQ • up to 5 Instructions read from IQ Potential Loop cache Loop Stream Detector (LSD) support • Re-use of decoded instruction • Potential power saving 2nd Level Cache
branch prediction unit
predecode
instruction queue
1st Level Cache (Data)
Renamer/Allocator Buffers(Retirement) Scheduler
IQ/ Decode
Instruction Fetch Unit
icache
Front End
Execution Unit
instruction decode
Execution Core
MS
BTBs/Branch Prediction Intel® Processor Micro-architecture - Core® microarchitecture 121 Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit
Intel® Processor Micro-architecture - Core® microarchitecture 122 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Decode Decode the instructions into micro-ops
icache branch prediction unit
Ready for the execution in OOO core predecode
instruction queue 2nd Level Cache
1st Level Cache (Data)
Renamer/Allocator Buffers(Retirement) Scheduler
IQ/ Decode
Instruction Fetch Unit Front End
Execution Unit
instruction decode
Execution Core
MS
BTBs/Branch Prediction Intel® Processor Micro-architecture - Core® microarchitecture 123 Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking
Intel® Processor Micro-architecture - Core® microarchitecture 124 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Instruction Decode / Decoders Instructions converted to micro-ops (uops) • 1-uop includes load+op, stores, indirect jump, RET... 4 decoders:1 “large” and 3 “small” • All decoders handle “simple” 1-uop instructions • One large decoder handles instructions up to 4 uops All decoder working in parallel • Four(+) instructions / cycle Micro-Sequencer takes over for long flows (handling instruction contains 2~4 uops, uCodeRom handles more complex)
Intel® Processor Micro-architecture - Core® microarchitecture 125 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Code Sequence in Front End cmp EAX, 100000 these instructions took more than one fetch as they are 22 bytes
jne label
IQ buffers them together
mulps xmm0, xmm0 addps xmm0, [EAX+16]
IQ
movps [EAX+240], xmm0
all instructions are decodable by all decoders CMP and adjacent JCC are “fused” into a single uop. up to 5 instructions decoded per cycle
Large (dec0)
small small small (dec1) (dec2) (dec3)
cmpjne EAX, 100000, label sta_std [EAX+240], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16]
Intel® Processor Micro-architecture - Core® microarchitecture 126 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking
Intel® Processor Micro-architecture - Core® microarchitecture 127 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Decode / Macro - Fusion Scheduler Roughly ~15% of all instructions are conditional branches.
cmpjae eax, [mem], label
Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.
Execution
Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch.
Branch Eval
Not supported in EM64T long mode
flags and target to Write back Intel® Processor Micro-architecture - Core® microarchitecture 128 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Decode / MacroFusion Absent Read four instructions from Instruction Queue
Instruction Queue addps xmm0, [EAX+16] mulps xmm0, xmm0 movps [EAX+240], xmm0
Each instruction gets decoded into separate uops
cmp eax, 100000
Enabling Example
jge label
for (int i=0; i<100000; i++) { …
Cycle 1
}
mulps xmm0, xmm0 movps [EAX+240], xmm0
cmp eax, 100000 Cycle 2
dec0
addps xmm0, [EAX+16]
jge label
dec1 dec2 dec3 dec0
Intel® Processor Micro-architecture - Core® microarchitecture 129 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Decode / MacroFusion Presented Read five Instructions from Instruction Queue
Instruction Queue addps xmm0, [EAX+16] mulps xmm0, xmm0
Send fusable pair to single decoder
movps [EAX+240], xmm0
cmp eax, 100000
Single uop represents two instructions
jae label
Enabling Example for (unsigned int i=0; i<100000; i++) { …
Cycle 1
addps xmm0, [EAX+16] mulps xmm0, xmm0 movps [EAX+240], xmm0
}
cmpjae
eax, 100000, label
dec0 dec1 dec2 dec3
Intel® Processor Micro-architecture - Core® microarchitecture 130 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Instruction Decode / Macro – Fusion (cont.) Benefits • Reduces latency • Increased renaming • Increased retire bandwidth • Increased virtual storage • Power savings
Enabling Greater Performance & Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 131 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking
Intel® Processor Micro-architecture - Core® microarchitecture 132 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Instruction Decode / Micro-Op Fusion Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation
Micro-op fusion effectively widens the pipeline Intel® Processor Micro-architecture - Core® microarchitecture 133 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Decode / Micro-Fusion (cont.) u-ops of a Store “movps [EAX+240], xmm0”
sta eax+240 st xmm0, [eax+240] std xmm0, [eax+240]
Intel® Processor Micro-architecture - Core® microarchitecture 134 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking
Intel® Processor Micro-architecture - Core® microarchitecture 135 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Instruction Decode / Stack Pointer Tracker (Extended Stack Pointer folding) ESP is calculated by dedicate logic PUSH EAX
• No explicit Micro-Ops updating ESP • Micro-Ops saving ESPd=8
• Power saving
PUSH EDX
Decoder 4 Decoder 0 1
Recovery Information
POP EBX
0 …
Decoder N
. . .
Intel® Processor Micro-architecture - Core® microarchitecture 136 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit
Intel® Processor Micro-architecture - Core® microarchitecture 137 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Branch Prediction Unit Allow executing instructions long before the branch outcome is decided
icache branch prediction unit
• Superset of Prescott / Pentium-M features predecode
• One taken branch every other clock • Branch predictions for 32 bytes at a time, twice the width of the fetch engine
2nd Level Cache
1st Level Cache (Data)
Renamer/Allocator Buffers(Retirement) Scheduler
IQ/ Decode
Instruction Fetch Unit
instruction queue
Front End
Execution Unit
instruction decode
Execution Core
MS
BTBs/Branch Prediction Intel® Processor Micro-architecture - Core® microarchitecture 138 Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Front End
Intel® Software College
Branch Prediction Unit (cont.) 16-entry Return Stack Buffer (RSB) Front end queuing of BPU lookups Type of predictions • Direct Calls and Jumps • Indirect Calls and Jumps • Conditional branches
Intel® Processor Micro-architecture - Core® microarchitecture 139 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End
Branch Prediction Improvements Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements:
Indirect Branch Predictor
Loop Detector
Branch miss-predictions reduced by >20%
Intel® Processor Micro-architecture - Core® microarchitecture 140 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda Introduction Knowledge preparation Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations
Intel® Processor Micro-architecture - Core® microarchitecture 141 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Execution Core store address integer
Accepted decoded u-ops, assign resources, execute and retire u-ops
load
• Renamer
store data
• Reservation station (RS) register alias table
• Issue ports • Execution Unit
ALLOC
2nd Level Cache
Reservation Station
Re-Order Buffer
1st Level Cache (Data)
Renamer/Allocator Buffers(Retirement) Scheduler
IQ/ Decode
Instruction Fetch Unit
FP SIMD (3x)
Execution Unit
Execution Core
Front End BTBs/Branch Prediction
Intel® Processor Micro-architecture - Core® microarchitecture 142 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core
Execution Core Building Blocks
Renamer
Ports (number)
RS
0,1,5 SIMD Integer
ROB
SIMD/Integer MUL
0,1,5 Integer
0,1,5 Floating Point
Execution Unit
2 Load 3,4 Store
Memory Sub-system Intel® Processor Micro-architecture - Core® microarchitecture 143 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core
Rename and Resources 4 uops renamed / retired per clock • one taken branch, any # of untaken • one fxchg per cycle Uops written to RS and ROB • Decoded uops were renamed and allocated with resource by RAT and sent to ROB read and RS • RS waits for sources to arrive allowing OOO execution • Registers not “in flight” read from ROB during RS write
register alias table ALLOC
Reservation Station Re-Order Buffer
Intel® Processor Micro-architecture - Core® microarchitecture 144 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core
Issue Ports and Execution Units 6 dispatch ports from RS • 3 execution ports • (shared for integer / fp / simd)
store address
• load
load
• store (address) • store (data)
store data
128-bit SSE implementation
integer FP SIMD (3x)
• Port 0 has packed multiply (4 cycles SP 5 DP pipelined) • Port 1 has packed add (3 cycles all precisions) FP data has one additional cycle bypass latency • Do not mix SSE FP and SSE integer ops on same register Avoid:
Addps XMM0,XMM1 Pand xmm0,xmm3 Addps xmm2,xmm0
Better:
Addps XMM0,XMM1 Addps xmm2,xmm0
Pand xmm0,xmm3
Intel® Processor Micro-architecture - Core® microarchitecture 145 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Execution Core
Intel® Software College
The Out Of Order each uop only takes a single RS entry load + add dispatches twice (load, then add) mulps dispatches once when load + add to write back sta + std dispatches twice sta (address) can fire as early as possible std must wait for mulps to write back cmpjne dispatches only once (functionality is truly fused) no dependency, can fire as early as it wants
cmpjne EAX, 100000, label sta_std [EAX+240], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16]
RS
Intel® Processor Micro-architecture - Core® microarchitecture 146 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core
Dispatching to OOO EXE cmpjne EAX, 100000, label RS sta_std [EAX+240], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16] cmpjne EAX, 100000, label sta_std [EAX+244], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16] cmpjne EAX, 100000, label sta_std [EAX+248], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16]
5 GP (incl jmp)
4 STD
3 STA
2 Load
1 GP (incl FP add)
cmpjne EAX, 100000, label 0 GP (incl FP mul) sta_std [EAX+24C], xmm0 mulps xmm0, xmm0, Intel® xmm0 Processor Micro-architecture - Core® microarchitecture load_add xmm0, xmm0, [EAX+16] 147 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Core™ Microarchitecture – Memory Sub-system
Intel® Software College
Advanced Memory Access 3 clk latency and 1 clk thrput of L1D; 14 and 2 for L2 Miss Latencies • L1 miss hits L2 ~ 10 cycles • L2 miss, access to memory ~300 cycles (server/FBD) • L2 miss, access to memory ~165 cycles (Desk/DDR2) • C step broadwater is reported to have ~50ns latency
Cache Bandwidth • Bandwidth to cache ~ 8.5 bytes/cycle Memory Bandwidth • Desktop ~ 6 GB/sec/socket (linux) • Server ~3.5 GB/sec/socket Intel® Processor Micro-architecture - Core® microarchitecture 148 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing for Intel® Core™ Microarchitecture Use CMP = employ both Cores • Go to multithreading! Prefer SSE as much as possible. If you didn’t do it so far, vectorize the code now!!
• Intel Compiler has very good vectorization engine Align data and data layout (sequential) • To align use
__declspec(align (16)) float a[1000];
Intel® Processor Micro-architecture - Core® microarchitecture 149 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing for Intel® Core™ Microarchitecture (advanced) Use Intel VTune™ Performance Analyzer for performance problems revealing • CPI • Specific CPU events for Core-arch: RESOURCE_STALLS.RS_FULL, L2_IFETCH.SELF.MESI, RESOURCE_STALLS.RS_FULL, RESOURCE_STALLS.ROB_FULL etcsee VTune help
Intel® Processor Micro-architecture - Core® microarchitecture 150 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Front End Issue Debugging Look for Front End optimization only when code is FE bound • Reservation station (RS) is the front end and allocation target • Low RESOURCE_STALLS.RS_FULL and poor CPI should be debugged as front end issue •
If there are no issues in the FE the RS should be full above 30% of the time
Front End typical issues: • Code is too big to fit in the L1: • • •
When L2_IFETCH.SELF.MESI happens every 10-15 instructions Code that could have been with CPI 1 will be around 2 14 cycles penalty for L1 demand miss
• Average instruction size above 6 bytes • •
Happens typically with SSE code and more with EM64T Can have impact only in case of otherwise excellent CPI
• Code with length changing prefix issues (LCP) • •
Penalty of 6 cycles or more Look at ILD_STALL VTune event
Front-End should not be the bottleneck. Focus on Front End issues only if it is the issue. Intel® Processor Micro-architecture - Core® microarchitecture 151 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Execution micro architecture The busiest port may determent the potential execution speed
Single clock latency operations are best • Different latency operations can create writeback conflicts Creating bubble in the port
Look at the dependency chains to see the potential parallelism • Remember that the RS has only 32 entries and only those instructions are candidates for scheduling to the execution ports • High RESOURCE_STALLS.RS_FULL percentage if the code is latency bound
• The ROB has 96 entries • High RESOURCE_STALLS.ROB_FULL percentage only if
Execution stage: The key good performance. • Code has long latency instructions (L2 for misses) Intel® Processor Micro-architecture - Core® microarchitecture Focus oncanport utilization and dependency chains •152 Other code be executed while waiting Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Execution micro architecture The Divider is a big potential stall source
• DIV for the number Divide operations executed • IDLE_DURING_DIV for number of cycles of no port issue while the diverter is busy • Try to find some useful work to do in parallel with divide operations Extra cycle latency for bypass between execution domains • For example: FP (ADDPS) and logical ops (PAND) on XMMn • DELAYED_BYPASS.FP • DELAYED_BYPASS.LOAD • DELAYED_BYPASS.SIMD
EXE Data Cache Unit
0,1,5 SIMD Integer
0,1,5
0,1,5
Integer
Floating Point
integer / SIMD MUL
dtlb memoryorderring store forwarding
2 load store (address) 3 store (data)
4
Intel® Processor Micro-architecture - Core® microarchitecture 153 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Enhancements and Optimization Opportunities IP Prefetcher • Prefetches stride loads associated with the same IP • •
Uses History table Use VTune events to identify misses when expected prefetches
Memory Disambiguation • Predicts when OK to fire load before preceding stores with unknown address • •
Misprediction triggers Pipeline flash and load restart Disambiguation is temporarily disabled if frequently fails
• LOAD_BLOCK.STA where Loads blocked by a preceding store with unknown address •
In case not to the same address: Possible reasons for not working: Address collision with other load(s)
Intel® Processor Micro-architecture - Core® microarchitecture 154 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Other Opportunities for Performance Gain in the memory sub-system 4k Aliasing
• OOO engine can fire Load before preceding Store if not collides on the Store’s address •
Address collision serializes execution
• Address checking uses only the last 12 bits (4K) •
False blocking - if Load’s & Store’s addresses have 4KB offset • e.g. accessing large, power of two, sized arrays in a loop
• Resolve 4K aliasing conflicts by changing memory layout •
VTune event LOAD_BLOCK.OVERLAP_STORE
Load block cases
• Increase the distance between the store and the dependant load, so that the store data/address is known at the time the load is dispatched •
Store address unknown - LOAD_BLOCK.STA • Loads blocked by a preceding store with unknown address
•
Store data unknown - LOAD_BLOCK.STD • Loads blocked by a preceding store with unknown data
• Loads blocked until retirement LOAD_BLOCK.UNTIL_RETIRE •
This includes mainly uncacheable loads and split loads (loads that cross the cache line boundary) Intel® Processor Micro-architecture - Core® microarchitecture 155 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.