Intel Processor Architecture-core

  • Uploaded by: jinish.K.G
  • 0
  • 0
  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Intel Processor Architecture-core as PDF for free.

More details

  • Words: 17,553
  • Pages: 155
Intel® Core™ Microarchitecture Intel® Software College

Intel® Software College

Objectives After completion of this module you will be able to describe • Components of an IA processor • Working flow of the instruction pipeline • Notable features of the architecture

Intel® Processor Micro-architecture - Core® microarchitecture 2 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge preparation Notable features Micro-architecture tour Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 3 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge preparation Notable features Micro-architecture tour Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 4 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Industrial Recognition

Intel® Software College

PC Format May 2006

“Intel Strikes Back! Conroe is the name. Pistol-whipping Athlon 64s into burger meat is the game..“ Intel's Next Generation Microarchitecture Unveiled Real World Tech “Just as important as the technical innovations in Core MPUs, this microarchitecture will have a profound impact on the industry. “

Intel Dishes the Knockout Punch to AMD with Conroe, GD Hardware.com “…the results were far more than we could hope for and it'll be amusing to see AMD's response to this beat-down session

Intel Regains Performance Crown, Anandtech “… At 2.8 or 3.0GHz, a Conroe EE would offer even stronger performance than what we’ve seen here.”

Intel Reveals Conroe Architecture, Extremetech “… And not only was the Intel system running at 2.66GHz— a slower clock rate than the top Pentium 4—it was outpacing an overclocked Athlon 64 FX-60. Wrap your brain around that idea for a bit…”

Conroe Benchmarks - Intel Showing BigMicro-architecture Strength Hot Hardware.com Intel® Processor - Core® microarchitecture “… Intel is poised to change the face of the desktop computing landscape…” 5 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Performance Summary Intel® Core™ Microarchitecture dramatically boosts Intel platform performance • Conroe & Woodcrest drive clear Desktop/Server performance leadership • Merom extends Intel Mobile performance leadership

Intel® Core™ Microarchitecture-based platforms set the bar in Performance and Energy Efficiency for the MultiCore era • Intel’s 3rd generation dual-core (while competition stuck on 1st generation) • New Intel high-performance ‘engine’: Wider, Smarter, Faster, More Efficient Best Processor on the Planet: EnergyEnergy-Efficient Performance 1

The “Core™ Effect”: Intel® Core™ Microarchitecture 20% (Merom), 40% (Conroe), 80% (Woodcrest) Performance Boosts1 ! ramp fuels broad roadmap accelerations Intel® Processor Micro-architecture - Core® microarchitecture 6

1

Based on SPECint*_rate_base2000

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge preparation • Architecture VS Microarchitecture • CISC VS RISC • Performance Measurements • Pipeline Design • Power and Energy • Chip Multi-Processing Notable features Micro-architecture tour Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 7 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Architecture and Micro-architecture What is Computer Architecture? • Architecture is the set of features which are externally visible: • • • •

Instruction set Registers Addressing modes Bus protocols

Intel Architectures (IA) • IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture) • • •

X87 (Floating Point extension) MMX (Multi-Media extension) SSE, SSE2, SSE3 (SIMD Streaming Extension)

• Intel® 64/EM64T (64-bit Integer extension of IA32)

? Go to detail!

• IA64 (Intel new 64-bit architecture) •

Itanium/Itainium2 processor family Intel® Processor Micro-architecture - Core® microarchitecture 8 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Architecture and Micro-architecture (cont.) What is Micro-architecture? • Same as m–Architecture or u-Architecture • “Invisible” features that provide meaningful value to the end user (whatever makes you buy a new compatible PC) • Programs run faster Improved Performance • Reduced Power consumption Extended Battery life • H/W fits into Smaller Form Factor

Intel® Processor Micro-architecture - Core® microarchitecture 9 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Architecture History * IXA – Intel Internet Exchange Architecture/

Architecture:

EPIC – Explicitly Parallel Instruction Computing

Examples:

EPIC* (Itanium®)

Instruction set definition and compatibility

IA-32

IXA* (XScale)

Microarchitecture: Examples: Hardware implementation maintaining instruction set compatibility with high-level P5 architecture

P6

Intel NetBurst®

Banias

Processors: Productized implementation of Microarchitecture

Examples:

Pentium®

Pentium®

Pro Pentium® II/III

Pentium® 4 Pentium® D Xeon®

Pentium® M

Intel® Processor Micro-architecture - Core® microarchitecture 10 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture Processors

Intel® NetBurst®

+ New Innovations

Mobile Microarchitecture

Intel® Core™ 2 Duo/Quad/Extreme processors Intel® Processor Micro-architecture - Core® microarchitecture 11 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

RISC Approach to CPU design

(RISC = Reduced Instruction Set Computers) Optimize H/W for common basic operations • Fixed instruction length • •

Shorter Execution Pipeline Ease of Instruction Level Parallelism

• Large number of registers •

Less memory accesses

• ‘Load/Store’ architecture • •

Shorter Execution Pipeline Ease of advancing Loads

• Branch Hints •

Reduce pipeline flush events

• ‘Exotic’ stuff to be implemented in S/W with minimal H/W support • •

No ‘complex’ H/W instructions Handle exceptional conditions in S/W

Examples: MIPS, IBM Power and PowerPC, Sun Sparc

Achieve Maximum performance by right partitioning between H/W and S/W Intel® Processor Micro-architecture - Core® microarchitecture

12 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

CISC Approach to CPU design

(CISC = Complex Instruction Set Computers) Rich architecture • Variable length instructions. • Complex addressing modes. On-chip HW / SW partitioning required • H/W keeps executing ‘simple’ stuff • Complex instructions are ‘emulated’ using u-code routines from ROM • More instructions treated as ‘simple’ as more H/W is available COMPATIBILITY has some major advantages: • Large (and forever increasing) software base • Code development tools • Expertise • H/W - S/W spiral Example: Intel IA32, Motorola 680X0

Maximize information passed to the HW Intel® Processor Micro-architecture - Core® microarchitecture 13 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Performance Measurement Performance is the reciprocal of the “Time of execution”:

1 1 Performance ≈ = Time _ of _ Execution L * CPI * TC Were: L = Code Length (# of machine instructions) CPI = Clock cycles Per Instruction Tc = Clock period (nSecs) Substitute: IPC = Instructions Per Cycle = 1/CPI F = Frequency = 1/Tc

Improve ILP

Improve Timing

IPC * F Performance ≈ L Arch Enhancements Intel® Processor Micro-architecture - Core® microarchitecture 14 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Performance Measurement (cont.) Performance considerations: • Which Code/Application to run? • Which OS? • Which other components in the platform? • Under which thermal conditions? • Multithreading? Multiprocessing?

Benchmarks examples • Industry Standard

• • •

Commercial

• • • • • •

SysMark MobileMark PCMark Sandra ScienceMark

Applications

• • • • • •

Spec (ISPEC, FSPEC) TPC

Video (Windows Media encoder, DivX) Audio (Lame MP3) Compression (RAR) Content creation (3DSM, Photoshop, Premiere) Latest Games (Doom III, FarCry, but changes fast)

Specific industries use specific benchmarks



Linux compilation, POVRay, LinPack, lmbench

Intel® Processor Micro-architecture - Core® microarchitecture 15 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Design Considerations for Different Market Segments Constrains: • Thermally, area constrained  Desktop • Unconstrained  Extreme • Very area constrained  Value • Thermally, Energy and Area constrained  Mobile • Thermally, Energy  Servers Micro-architecture is the Art of Tradeoffs between: • Schedule • Requirements / Standards • Performance • Features • Power / Energy • Area / Cost Intel® Processor Micro-architecture - Core® microarchitecture 16 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Design Metrics IPC = Instructions per Cycle • The more the better Latency – same as Response Time • The time interval between • •

when any request for data is made and when the data transfer completes

• The less the better Throughput • The amount of work completed by the system per unit of time. • The more the better • ops/sec

Intel® Processor Micro-architecture - Core® microarchitecture 17 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

CPU Pipeline Break the work to smaller pieces

• Four basic stages of instruction life • • • •

Fetch - bring instruction to core Decode - read operands from register Execute - perform the operation Writeback - save result to register

• Execution timing of simple instructions (legend: “op src1,src2  dst”) add eax, ebx  eax F sub ecx, edx  ecx

D F

E D

W E

W

Increased throughput • increased number of completed instructions per cycle

Intel® Processor Micro-architecture - Core® microarchitecture 18 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Pipeline Design - Explore Parallelism New instruction not always depends on previous one • Can start new instruction before previous one is finished • ...if different stages use different H/W resources Run instructions in parallel (pipeline) Add eax, ebx  eax F D E W Sub ecx, edx  ecx F D E W Or edi, esi  edi F D E W Need to balance pipe stages • Each stage should take same time for best throughput and utilization

Clock cycle is determined by the longest path! Fetch

Decode Exec WB Fetch Decode Exec WB Fetch Decode Exec WB Fetch Decode Exec

WB

Intel® Processor Micro-architecture - Core® microarchitecture 19 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Pipeline Design – Fighting Stalls Data flow dependency (instructions output/input) • Solved by bypasses, renaming etc Control flow dependencies • Solved by branch prediction Others (Cache misses, long latency instructions) • Solved by other dynamic scheduling techniques ? Go to detail!

Intel® Processor Micro-architecture - Core® microarchitecture 20 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Race of CISC vs. RISC In modern CPUs Advanced µ-Architecture Techniques minimize the advantages of RISC over CISC

• Branch Prediction • Reduces the effect of extra pipeline stages

• Register Renaming • Effectively Increase the Number of Registers

• Out Of Order • Reduce Number of stalls caused by shortage of registers

• Speculative Execution • Further Reduce Number of stalls

• Power saving features • Reduce the overhead when not needed.

Intel® Processor Micro-architecture - Core® microarchitecture 21 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

µop – Intel’s Take of the CICS/RISC Race (CISC) Instructions are translated into one or more (RISC) uop(micro-operation)s • Fixed format • Wide and simple • Temp registers Usually one uop per instruction Complex instruction can be thousands of uops Stores divided into two uops (STA and STD) Fusion play games here

Intel® Processor Micro-architecture - Core® microarchitecture 22 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Power and Energy Maximum power (TDP): •  Cooling requirements •  Cooling solution •  Computer form factor and acoustic noise Average power •  Battery life •  Electricity bill General calculation: • P = frequency * voltage^2 * activity factor * capacitance + leakage Reducing TDP • Less transistors and wires • Smaller transistors and wires • Power features  less activity • Low leakage transistors Reducing average power • Energy efficiency • Power states • Lower leakage

Intel® Processor Micro-architecture - Core® microarchitecture 23 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Dual/Multi Core and SMT Put more than one core per package Architectural change: • Software must be multi-threaded or multi-process • …but backward compatible with multiprocessor systems (MP) Several ways of implementing it • All of them being used

I/O

I/O LLC

I/O LLC

LLC

LLC

Core

Core

Core

Core

I/O LLC Core

Core

SMT: Run two (or more) threads on the same core, simultaneously Intel® Processor Micro-architecture - Core® microarchitecture 24 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel Approach

?

Intel® Intel® XQ6700* Intel® Intel® Pentium® Pentium® D Processor

Intel® Intel® Pentium® Pentium®

Intel® Intel® Core 2 Duo® Duo®

80 Threads

Intel® Intel® Pentium® Pentium® With HT

4 Threads 2 Threads State Execution Units Cache Bus

2 Threads 2 Threads 1 Threads Q4 2000

Q2 2003

Q2 2005

Q3 2006

Q4 2006

While While single single core core performance performance has has increased increased due due to to clock clock speed, speed, increased increased cache cache and and improved improved ILP ILP the the biggest biggest performance performance increases increases Intel® Processor Micro-architecture - Core® microarchitecture have from the thread parallelism. have come come from the thread level level parallelism. 25 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

A “Acronym Cheat Sheet” of Parallel Computing CMP: Chip Multi Processor (two or more cores per package) • Dual Core: two cores in same package • Quad Core: four cores in same package DP: Dual Processor (two packages) MP: Multi Processor (four or more packages) SMT: Symmetric Multi Threading (virtual multi core: HyperThreading)

Intel® Processor Micro-architecture - Core® microarchitecture 26 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge preparation Notable features • Wide Dynamic Execution • Smart Memory Access • Advanced Smart Cache • Advanced Digital Media Boost • Intelligent Power Capability Micro-architecture tour Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 27 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features Instruction Fetch and PreDecode

Intel® Wide Dynamic Execution • 14-stage efficient pipeline • • •

Wider execution path Advanced branch prediction Macro-fusion

Instruction Queue 5 uCode ROM

4

• Roughly ~15% of all instructions are conditional branches • Macro-fusion fuses a comparison and jump to reduce micro-ops running down the pipeline



Micro-fusion • Merges the load and operation micro-ops into one macro-op

Rename/Alloc Retirement Unit (ReOrder Buffer)

up to 10.4 Gb/s FSB

4

Schedulers

• 64-Bit Support •

Decode

2M/4M shared L2 Cache

Merom, Conroe, and Woodcrest support EM64T

ALU Branch MMX/SSE FPmove

ALU FAdd MMX/SSE FPmove

ALU FMul MMX/SSE FPmove

Load

Store

L1 D-Cache and D-TLB Intel® Processor Micro-architecture - Core® microarchitecture 28 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.) Intel® Advanced Memory Access • Improved prefetching • Memory disambiguation • Advance load before a possible data dependency (pointer conflict) • Earlier loads hide memory latencies

Intel® Processor Micro-architecture - Core® microarchitecture 29 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.) Intel® Advanced Smart Cache • Multi-core optimization • • • • •

Shared between the two cores Advanced Transfer Cache architecture Reduced bus traffic Both cores have full access to the entire cache Dynamic Cache sizing

Intel® Processor Micro-architecture - Core® microarchitecture 30 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.) Advantages of Shared Cache Memory Front Side Bus (FSB)

Shipping L2 Cache Line ~Half access to memory

Cache Line CPU1

CPU2

Intel® Processor Micro-architecture - Core® microarchitecture 31 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.) Advantages of Shared Cache (cont.) Memory Front Side Bus (FSB) L2 is shared: No need to ship cache line Cache Line CPU1

CPU2

Intel® Processor Micro-architecture - Core® microarchitecture 32 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.) Intel® Advanced Digital Media Boost • Single Cycle SIMD Operation • 8 Single Precision Flops/cycle • 4 Double Precision Flops/cycle

SIMD Operation (SSE/SSE2/SSE3/SSSE)

SOURCE

128-bit 128-bit 128-bit 128-bit

packed packed packed packed

Add Multiply Load Store

• Support for Intel® EM64T instructions

0

X4

X3

X2

X1

Y4

Y3

Y2

Y1

SSE/2/3 OP

• Wide Operations • • • •

127

DEST

Core™ µarch CLOCK CYCLE 1

Previous CLOCK CYCLE 2

X4opY4 X3opY3 X2opY2 X1opY1 CLOCK CYCLE 1

X2opY2 X1opY1

X4opY4 X3opY3

Intel® Processor Micro-architecture - Core® microarchitecture 33 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features Intel® Advanced Digital Media Boost • Additional Media Instructions - Supplemental Streaming SIMD Extensions 3 (SSSE3) • 16 new packed integer instructions • Targeting video encode/decode

• Significantly improved strings • REP MOVS and REP STOS • ~8 bytes / cycle throughput •

mileage may vary

Intel® Processor Micro-architecture - Core® microarchitecture 34 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features Intel® Advanced Digital Media Boost • Supplemental SSE-3 (SSSE-3) Horizontal Addition/Subtraction

PHADDW, PHADDSW, PHADDD, PHSUBW, PHSUBSW, PHSUBD

Packed Absolute Values

PABSB, PABSW, PABSD Multiply and Add Packed Signed/Unsigned bytes

PMADDUBSW

Packed multiply High with Round and Scale

PMULHRSW PSHUFB

Packed Shuffle Bytes

PSIGNB/W/D

Packed SIGN Packed Align Right

PALIGNR Intel® Processor Micro-architecture - Core® microarchitecture

35 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.) Intelligent Power Capability • Advanced power gating & Dynamic power coordination • • • • • • •

Multi-point demand-based switching Voltage-Frequency switching separation Supports transitions to deeper sleep modes Event blocking Clock partitioning and recovery Dynamic Bus Parking During periods of high performance execution, many parts of the chip core can be shut off

Intel® Processor Micro-architecture - Core® microarchitecture 36 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge preparation Notable features Micro-architecture tour • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 37 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Drill-down

icache

branch prediction predecode unit

instruction queue

page miss handler

data cache unit

memory order buffer

instruction decode

register alias table

MS

ALLOC

store address load store data

integer FP SIMD (3x)

Reservation Station

Re-Order Buffer

Intel® Processor Micro-architecture - Core® microarchitecture 38 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge refreshment Notable features Micro-architecture tour • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 39 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End Instruction preparation before executed

icache branch prediction unit

• Instruction Fetch Unit predecode

• Instruction Queue • Instruction Decode Unit • Branch Prediction Unit

instruction queue

instruction decode MS Intel® Processor Micro-architecture - Core® microarchitecture 40 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Instruction Queue Buffer between instruction pre-decode unit and decoder • up to six predecoded instructions written per cycle • 18 Instructions contained in IQ • up to 5 Instructions read from IQ Potential Loop cache Loop Stream Detector (LSD) support • Re-use of decoded instruction • Potential power saving

Intel® Processor Micro-architecture - Core® microarchitecture 41 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Macro - Fusion Scheduler Roughly ~15% of all instructions are conditional branches.

cmpjae eax, [mem], label

Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.

Execution

Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch.

Branch Eval

Not supported in EM64T long mode

flags and target to Write back Intel® Processor Micro-architecture - Core® microarchitecture 42 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Macro-Fusion Absent

Instruction Queue addps xmm0, [EAX+16]

Read four instructions from Instruction Queue

mulps xmm0, xmm0 movps [EAX+240], xmm0

Each instruction gets decoded into separate uops

cmp eax, 100000

Enabling Example

jge label

for (int i=0; i<100000; i++) { …

Cycle 1

}

mulps xmm0, xmm0 movps [EAX+240], xmm0

cmp eax, 100000 Cycle 2

dec0

addps xmm0, [EAX+16]

jge label

dec1 dec2 dec3 dec0

Intel® Processor Micro-architecture - Core® microarchitecture 43 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Macro-Fusion Presented

Instruction Queue addps xmm0, [EAX+16]

Read five Instructions from Instruction Queue

mulps xmm0, xmm0

Send fusable pair to single decoder

movps [EAX+240], xmm0

cmp eax, 100000

Single uop represents two instructions

jae label

Enabling Example for (unsigned int i=0; i<100000; i++) { …

Cycle 1

addps xmm0, [EAX+16] mulps xmm0, xmm0 movps [EAX+240], xmm0

}

cmpjae

eax, 100000, label

dec0 dec1 dec2 dec3

Intel® Processor Micro-architecture - Core® microarchitecture 44 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Instruction Decode / Micro-Op Fusion Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation

Micro-op fusion effectively widens the pipeline Intel® Processor Micro-architecture - Core® microarchitecture 45 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Instruction Decode / Micro-Fusion (cont.) u-ops of a Store “movps [EAX+240], xmm0”

sta eax+240 st xmm0, [eax+240] std xmm0, [eax+240]

Intel® Processor Micro-architecture - Core® microarchitecture 46 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Branch Prediction Improvements Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements:

Indirect Branch Predictor

Loop Detector

Branch miss-predictions reduced by >20%

Intel® Processor Micro-architecture - Core® microarchitecture 47 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge preparation Notable features Micro-architecture tour • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 48 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Execution Core store address integer

Accepted decoded u-ops, assign resources, execute and retire u-ops

load

• Renamer

store data

• Reservation station (RS) register alias table

• Issue ports • Execution Unit

ALLOC

FP SIMD (3x)

Reservation Station

Re-Order Buffer

Intel® Processor Micro-architecture - Core® microarchitecture 49 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Execution Core

Execution Core Building Blocks

Renamer

Ports (number)

RS

0,1,5 SIMD Integer

ROB

SIMD/Integer MUL

0,1,5 Integer

0,1,5 Floating Point

Execution Unit

2 Load 3,4 Store

Memory Sub-system Intel® Processor Micro-architecture - Core® microarchitecture 50 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Execution Core

Intel® Software College

Issue Ports and Execution Units 6 dispatch ports from RS • 3 execution ports • (shared for integer / fp / simd)

• load • store (address) • store (data) 128-bit SSE implementation • Port 0 has packed multiply (4 cycles SP 5 DP pipelined) • Port 1 has packed add (3 cycles all precisions)

Intel® Processor Micro-architecture - Core® microarchitecture 51 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Execution Core

Retirement Unit ReOrder Buffer (ROB) • Holds micro-ops in various stages of completion • Buffers completed micro-ops • updates the architectural state in order • manages ordering of exceptions

register alias table ALLOC

Reservation Station Re-Order Buffer

Intel® Processor Micro-architecture - Core® microarchitecture 52 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge preparation Notable features Micro-architecture tour • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 53 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Memory SubSystem Memory Ordering Buffer • Store Address Buffer • Stores the address of each store not actually performed • Loads compare address to any store older than itself • If it find a hole…

• Store Data Buffer • Stores data of each store not actually performed • If load hit on the SAB, it forward the data from here

• Load Buffer • Stores address of non-retired loads • For snoops and re-dispatch

• One 128-bit load and one 128-bit store per cycle to different memory locations • Out of order Memory operations Intel® Processor Micro-architecture - Core® microarchitecture 54 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Memory Sub-system

Core® Micro-architecture Memory SubSystem (cont.) 32k D-Cache (8-way, 64 byte line size) Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache Cache to cache transfer • improves producer / consumer style MP Wider interface to L2 • reduced interference • processor line fill is 2 cycles

Core1

Core2

Higher bandwidth from the L2 cache to the core • ~14 clock latency and 2 clock throughput

Bus

Load & Store Access order 1. 2. 3. 4.

L1 cache of immediate core L1 cache of the other core L2 cache Memory

2 MB L2 Cache

Intel® Processor Micro-architecture - Core® microarchitecture 55 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access / Enhanced Data Pre-fetch Logic Speculates the next needed data and loads it into cache by HW and/or SW

Door Valet Parking Area (L1 Cache) (L2 Cache)

Main Parking Lot (External Memory)

Intel® Processor Micro-architecture - Core® microarchitecture 56 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Memory Sub-system

Intel® Software College

Advanced Memory Access / Enhanced Data Pre-fetch Logic (cont.) • L1D cache prefetching • Data Cache Unit Prefetcher • Known as the streaming prefetcher • Recognizes ascending access patterns in recently loaded data • Prefetches the next line into the processors cache

• Instruction Based Stride Prefetcher • Prefetches based upon a load having a regular stride • Can prefetch forward or backward 2 Kbytes •

1/2 default page size

• L2 cache prefetching: Data Prefetch Logic (DPL) • Prefetches data to the 2nd level cache before the DCU requests the data • Maintains 2 tables for tracking loads • Upstream – 16 entries • Downstream – 4 entries

• Every load is either found in the DPL or generates a new entry • Upon recognition of the 2nd load of a “stream” the DPL will prefetch the next load Intel® Processor Micro-architecture - Core® microarchitecture 57 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Memory Sub-system

Intel® Software College

Advanced Memory Access / Memory Disambiguation Memory Disambiguation predictor • Loads that are predicted NOT to forward from preceding store are allowed to schedule as early as possible • increasing the performance of OOO memory pipelines

Disambiguated loads checked at retirement • Extension to existing coherency mechanism • Invisible to software and system

Intel® Processor Micro-architecture - Core® microarchitecture 58 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access / Memory Disambiguation Absent Load4 must WAIT until previous stores complete Memory

Store1

Y

Load2

Y

Data W

Data Z

Store3

W

Load4

X

Data Y Data X

Intel® Processor Micro-architecture - Core® microarchitecture 59 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access / Memory Disambiguation Presented Loads can decouple from stores Load4 can get its data WITHOUT waiting for stores Memory

Load4 Store1

X Y

Load2

Y

Store3

W

Data W

Data Z

Data Y Data X Intel® Processor Micro-architecture - Core® microarchitecture 60 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access / Stores Forwarding If a load follows a store and reloads the data that the store writes to memory, the micro-architecture can forward the data directly from the store to the load

Memory

Store1

Y

Load2

Y

Internal Buffers Data Y Intel® Processor Micro-architecture - Core® microarchitecture

61 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access / Stores Forwarding: Aligned Store Cases store 16

store 32 bit

store 64 bit

load 16

load 32 bit

load 64 bit

ld 8 ld 8

load 16 load 16

load 32 bit

ld 8 ld 8 ld 8 ld 8

load 16 load 16 load 16 load 16

load 32 bit

ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 store 128 bit load 128 bit load 64 bit load 32 bit

load 64 bit load 32 bit

load 32 bit

load 32 bit

load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16 Micro-architecture - Core® microarchitecture ld 8 ld 8 ld 8 ld 8 ld 8 Intel® ld 8Processor ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 62 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access / Stores Forwarding: Unaligned Cases Note that unaligned store forward does not occur when the load crosses a cache line boundary store 16

store 32 bit

store 64 bit

load 16‡

load 32 bit‡

load 64 bit

ld 8 ld 8

load 16‡ load 16

load 32 bit‡

ld 8 ld 8 ld 8 ld 8

load 16‡ load 16 load 16 load 16

load 32 bit

ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 Store forwarded to load ld 8 No forwarding ‡:

No forwarding if the load crosses a cache line boundary

Note: Unaligned 128-bit stores are issued as two 64-bit stores. This provides two alignments for store forwarding

Intel® Processor Micro-architecture - Core® microarchitecture 63 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge preparation Notable features Micro-architecture tour Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 64 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing for Instruction Fetch and PreDecode Avoid “Length Changing Prefixes” (LCPs) • Affects instructions with immediate data or offset • Operand Size Override (66H) • Address Size Override (67H) [obsolete] • LCPs change the length decoding algorithm – increasing the processing time from one cycle to six cycles (or eleven cycles when the instruction spans a 16-byte boundary) • The REX (EM64T) prefix (4xH) is not an LCP • The REX prefix does lengthen the instruction by one byte, so use of the first eight general registers in EM64T is preferred

Intel® Processor Micro-architecture - Core® microarchitecture 65 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing for Instruction Queue Includes a “Loop Stream Detector” (LSD) • Potentially very high bandwidth instruction streaming • A number of requirements to make use of the LSD • • • •

Maximum of 18 instructions in up to four 16-byte packets No RET instructions (hence, little practical use for CALLs) Up to four taken branches allowed Most effective at 70+ iterations

• LSD is after PreDecode so there is no added cost for LCPs • Trade-off LSD with conventional loop unrolling

Intel® Processor Micro-architecture - Core® microarchitecture 66 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing for Decode Decoder issues up to 4 uOps for renaming/ allocation per clock • This creates a trade off between more complex instruction uOps versus multiple simple instruction uOps • For example, a single four uOp instruction is all that can be renamed/allocated in a single clock • In some cases, multiple simple instructions may be a better choice than a single complex instruction • Single uOp instructions allow more decoder flexibility • For example, 4-1-1-1 can be decoded in one clock • However, 2-2-2-1 takes three clocks to decode

Intel® Processor Micro-architecture - Core® microarchitecture 67 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing for Execution Up to six uOps can be dispatched per clock • “Store Data” and “Store Address” dispatch ports are combined on the block diagram Up to four results can be written back per clock Single clock latency operations are best • Differing latency operations can create writeback conflicts • Separate multiple-clock uOps with several single uOp instructions •

Typical instructions here: ADC/SBB, RWM, CMOVcc

• In some cases, separating a RMW instruction into its piece might be faster (decode and scheduling flexibility) When equivalent, PS preferred to PD

(LCP)

• For example, MOVAPS over MOVAPD, XORPS over XORPD

Intel® Processor Micro-architecture - Core® microarchitecture 68 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing for Execution (cont.) Bypass register “access” preferred to register reads Partial register accesses often lead to stalls • Register size access that ‘conflicts’ with recent previous register write • Partial XMM updates subject to dependency delays • Partial flag stall can occur, too  much higher cost •

Use TEST instruction between shift and conditional to prevent

• Common zeroing instructions (e.g., XOR reg,reg) don’t stall Avoid bypass between execution domains • For example: FP (ADDPS) and logical ops (PAND) on XMMn Vectorization: careful packing/unpacking sequence • Use MXCSR’s FZ and DAZ controls as appropriate

Intel® Processor Micro-architecture - Core® microarchitecture 69 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing for Memory Software prefetch instructions • Can reach beyond a page boundary (including page walk) • Prefetches only when it completes without an exception General techniques to help these prefetchers • Organize data in consecutive lines • In general, increasing addresses are more easily prefetched

Intel® Processor Micro-architecture - Core® microarchitecture 70 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Summary What has been covered • Notable features of Core® Micro-architecture • • • • •

Wide Dynamic Execution Advanced Memory Access Advanced Smart Cache Advanced Digital Media Boost Power Efficient Support

• Core® Micro-architecture components • Front End • OOO execution core • Memory sub-system

Intel® Processor Micro-architecture - Core® microarchitecture 71 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Processor Micro-architecture - Core® microarchitecture 72 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Platform Legacy & Debug I/O

Intel provides most of the silicon on any computer

Core LLC Core

Classical platform partition • CPU – Computation

FSB

HD video

Graphics speed and memory latencies will require different partition This presentation focuses on the core microarchitecture

ME

PCIe TVout

PEG

Graphics Display

Analog

DMI

Wireless PCI (IO) SATA USB KBRD others

DDR

• MCH – high speed IO • ICH – low speed IO

CPU

FSB

MEM

MCH

DMI

ICH

Intel® Processor Micro-architecture - Core® microarchitecture 73 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® 64 = Extending IA-32 to 64 Bit

Extended ExtendedMemory Memory Addressability Addressability

64 -Bit Pointers, 64-Bit Pointers,Registers Registers

+

Additional AdditionalRegisters Registers 88-SSE -SSE &&88-Gen -Gen Purpose Purpose

=

Double -bit) DoublePrecision Precision(64 (64-bit) Integer IntegerSupport Support

With 64-Bit Extension Technology

Added to Intel XEON™ and Pentium® 4 Processor in 2004; today available in all main stream Intel IA-32 processors – in particular in all processors based on Intel® Core™ Architecture Intel® Processor Micro-architecture - Core® microarchitecture 74 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® 64 - New Modes of Operation

Mode

Compile required

OS Req’d

64-bit Mode Long Mode

Compa tibility

Yes

New 64-bit OS

No

New Features

Defaults

64bit IP

RIP Rel.

New Regs

GPR Widt h

Addr Size

Operand Size

Yes

Yes

Yes

64

64

32

32

32

16

16

32

32

16

16

Yes

No

No

32

Mode Legacy Mode (IA32 Mode)

Legac y 32bit or 16-bit OS

No

No

No

No

32

Intel® Processor Micro-architecture - Core® microarchitecture 75 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Registers : Extensions and Additions RIP

63

79

0

32 31

127

64 63

RAX

EAX

XMM0

RBX

EBX

XMM1

RCX

ECX

XMM2

RDX

EDX

XMM3

RBP

EBP

XMM4

RSI

ESI

XMM5

RDI

EDI

XMM6

RSP

ESP

XMM7

R8

XMM8

R9

XMM9

R10

XMM10

R11

XMM11

R12

XMM12

R13

X87/ MMX

0

EIP

R14 R15

0

XMM13 XMM14 XMM15

Intel® Processor Micro-architecture - Core® microarchitecture 76 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Registers : Availability in different modes

Intel® Processor Micro-architecture - Core® microarchitecture 77 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

64-bit Mode of Operation Default data size is 32-bits • Override to 64-bits using new REX prefix All registers are 64-bit, 32-bit, 16-bit and 8-bit addressable REX prefixes • A family of 16 prefixed, encoded 0x40-0x4F • Allows the use of general purpose registers as 64-bits • Allows the use of new registers (like r8-r15) Instructions that set a 32 bit register automatically zero extend the upper 32-bits

Intel® Processor Micro-architecture - Core® microarchitecture 78 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

REX Prefix A new instruction-prefix byte used in 64-bit mode • Specify the new GPRs and SSE registers • Specify a 64-bit operand size. • Specify extended control registers (used by system software) An instruction can only have one REX prefix and if used, must immediately precede the opcode or the two-byte opcode escape prefix . The legacy instruction-size limit of 15 bytes still applies to instructions that contains a REX prefix.

Intel® Processor Micro-architecture - Core® microarchitecture 79 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Physical and Linear Addressing Linear Addressing • Initial Intel® 64 implementation support 48 bits of Virtual addressing. • Addresses are required to be in canonical form – bits 47 thru 63 must all be 1 or all be 0. Physical Addressing • Initial Netburst™ Intel® 64 implementation support 36 bit, today all current processors support 40bit at least • Entries in page tables expanded for up to 52 bits of physical address. Intel® Processor Micro-architecture - Core® microarchitecture 80 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel®64 - Large Memory Considerations

Canonical addressing for 64 bit addresses • Although the architecture now allows calculating flat addresses to 64 bits, today’s processors limit virtual addressing to 48 bits • Canonical address definition: An address that has address bit 63 through 47 set to either all ones or all zeros • Canonical addresses are a requirement • Values for addresses that are not canonical will cause faults when put into locations expecting a valid address, such as segment registers

Return Intel® Processor Micro-architecture - Core® microarchitecture 81 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Introducing SIMD: Single Instruction Multiple Data Scalar processing

SIMD processing

• traditional mode

• with SSE / SSE2

• one operation produces one result

• one operation produces multiple results

X

X

x3

x2

x1

x0

+

+ Y

Y

y3

y2

y1

y0

X+Y

X+Y

x3+y3

x2+y2

x1+y1

x0+y0

Intel® Processor Micro-architecture - Core® microarchitecture 82 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

X86 Register Sets SSE-Registers introduced first in Pentium® 3

MMX™ Technology / IA-FP Registers

IA-INT Registers

SSE Registers

80

32

128

64

eax

st0

xmm0

mm0

… xmm7 edi

st7

Fourteen 32-bit registers  Scalar data & addresses  Direct access to regs 

mm7

Eight 80/64-bit registers  Hold data only  Stack access to FP0..FP7  Direct access to MM0..MM7  No MMX™ Technology / FP interoperability 

Eight 128-bit registers  Hold data only:  4 x single FP numbers  2 x double FP numbers  128-bit packed integers  Direct access to the registers  Use simultaneously with FP / MMX Technology 

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Set Extensions New Instructions Added to Intel® Processors 160

144

140 120 100

70

80

56

60

~ 50

32

40

32

13

20 0

Process (nm)

Jan-97

Feb-99

MMX™

Streaming SIMD Extensions (SSE)

350

Dec-00

Feb-04

Streaming SIMD Streaming SIMD Extensions 2 (SSE2) Extensions 3 (SSE3)

250

180

90

Jul-06 Supplemental SSE3 (SSSE3) 65

2008+ Future FutureSSE-4 Intel instruction set extensions

45 45 nm

Beginning in 2008: ~50 new instructions in 13 groups All function in 32-bit and 64-bit modes Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D & 3D Imaging, Vectorizing Compiler Performance

Intel® Processor Micro-architecture - Core® microarchitecture 84 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE and SSE-2 Data Types SSE

4x floats 2x doubles 16x bytes 8x 16-bit shorts

SSE-2 4x 32-bit integers 2x 64-bit integers 1x 128-bit(!) integer

Intel® Processor Micro-architecture - Core® microarchitecture 85 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE-Instructions Set Extensions Introduced by Pentium® 3 in 1999; now frequently called SSE-1 Only new data type supported: 4x32Bit (Single Precision) floating point data Some 70 instructions • Arithmetic, compare, convert operations on SSE SP FP data • PACKED, UNPACKED

• • • • •

Data load/store Prefetch Extension of MMX Streaming Store (store without using cache in between) … 2001 PTE Engineering Enabling Conference

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE Sample: Branch Removal R = (A < B)? C : D //remember: everything packed

A

0.0

0.0

-3.0

3.0

cmplt B

0.0

1.0

-5.0

5.0

00000

11111

00000

11111

and

nand

c3

c2

c1

c0

d3

d2

d1

d0

00000

c2

00000

c0

d3

00000

d1

00000

or

Intel® Processor Micro-architecture - Core® microarchitecture 87

d3

Copyright © 2006, Intel Corporation. All rights reserved.

c2

d1

c0

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE-2 Instructions Set Extensions

Introduced by Intel® Pentium®4 processor in 2000 Some 140 new instructions Added double precision floating point data (2x64Bit) and all related instructions including conversion Again some extensions to MMX Added all possible combinations of integer data to SSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and related operations 2001 PTE Engineering Enabling Conference

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SIMD Single vs. SIMD Double SIMD SP FP Operand = 4 Elements

4 x Single Precision: SSE-1

Element = SP FP Number 127

0

X3

X2

X1 31 30

X0 0

23 22

S Exponent

Significand

SIMD DP FP Operand = 2 Elements Element = DP FP Number 127

2 x Double Precision: SSE-2

0

X1 63 62

S

X0 52 51

Exponent

0

Significand

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Sample for SSE-2:

SIMD Double ↔ SIMD Int Conversion

SIMD Double  SIMD Int: conversion to two lower ints, two higher ints cleared

x1 00000

x0

00000 (int)x1 (int)x0

__m128d x; __m128i ix; ix = _mm_cvtpd_epi32(x);

Int  SIMD Double: conversion from two lower ints

 SIMD ????

????

ix1

ix0

x = _mm_cvtepi32_pd(ix);

Intel® Processor Micro-architecture - Core® microarchitecture 90

(double)x1

(double)x0

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE3: No new Data Types but new Instructions

FISTTP

FP to integer conversions ADDSUBPD, ADDSUBPS,

Complex arithmetic

MOVDDUP, MOVSHDUP, MOVSLDUP

Video encoding SIMD FP using AOS format*

LDDQU HADDPD, HSUBPD

Thread Synchronization

HADDPS, HSUBPS MONITOR, MWAIT

* Also benefits Complex and Vectorization

Intel® Processor Micro-architecture - Core® microarchitecture 91 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Streaming SIMD Extensions 3 13 new instructions

Three have limited use for application performance improvement • FISTTP - X87 to integer conversion (requires –longdouble switch) • MONITOR/MWAIT - thread synchronization • Available today in Ring 0 only; being used by newer Windows* and Linux* thread packages

The other ten have some potential for specifc application domains

Intel® Processor Micro-architecture - Core® microarchitecture 92 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE-3 Sample Complex Arithmetic: ADDSUBPS ADDSUBPS OperandA OperandB • OperandA (xmm register; 4 data elements) • a3, a2, a1, a0 • OperandB (xmm reg. Or memory addr; 4 data elements) • b3, b2, b1, b0 • Result (Stored in OperandA) • a3+b3, a2-b2, a1+b1, a0-b0 __m128 _mm_addsub_ps(__m128 a, __m128 b)

a3 b3 Add

93

a3+b3

a2 b2

a1 b1

Sub

a0 b0

Add

Sub

Intel® Processor Micro-architecture - Core® microarchitecture

a2-b2

a1+b1

a0-b0

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Sample SSSE-3 Inst.: Byte Permute PSHUFB mm, mm/m64 PSHUFB xmm, xmm/m128 • • • • •

A complete byte-granularity permutation The source operand is used as the control field (variable control) The destination operand gets permuted Each byte of the source field selects the origin of the corresponding destination byte Also includes force-byte-to-zero flag (bit 7) src

0x7

0x7 0xFF 0x80 0x01 0x00 0x00 0x00

dest

0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01

dest

0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01

Intel® Processor Micro-architecture - Core® microarchitecture 94 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Ways to SSE/SIMD programming Coding using SSE/SSE2/3/4 assembler instructions • Very tedious (manually schedule) – discouraged: Don’t do it ! • E.g.: How do you exploit the benefits of having now 16 instead of 8 SSE registers for Intel® 64 without maintaining two versions ?

Intel® compiler’s C/C++ SIMD intrinsics • No need to take care of register allocation, scheduling etc

Intel® compiler’s C++ Vector Class Library • Use this if you are heavy into C++ classes

Vectorizer of Intel® C++ and Fortran Compilers • Recommended for most cases – easy and efficient

Use ready-to-go vectorized code from a library like Intel® Math Kernel Library (MKL) 2001 PTE Engineering Enabling Conference

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Compiler Based Vectorization

Intel® Software College

Processor Specific

Generate Code and Optimize for

Linux*

Pentium® 3 compatible and Athlon XPprocessors including code generation for MMX and SSE

-axK -axK

Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode, including code generation for MMX, SSE and SSE2

-xW -axW

Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE2 - depreciated switch: use xW instead

-xN -axN

Pentium® M processors including code generation for MMX, SSE and SSE-2

-xB -axB

Intel® processors with SSE3 capability including Pentium 4 (both 32 and 64bit mode) – including code generation for MMX, SSE, SSE2 and SSE-3

-xP, -axP

Intel® processors with MNI capability – Intel® Core™2 Duo processors ( Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE3 and MNI

-xT, -axT

Intel® Processor Micro-architecture - Core® microarchitecture 96 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Return Features (cont.) New Instructions Instruction name

Description

psignb/w/d mm, mm/m64

Per element, if the source operand is negative, multiply the destination operand by -1.

psignb/w/d xmm, xmm/m128 pabsb/w/d mm, mm/m64 pabsb/w/d xmm, xmm/m128 phaddw/d/sw mm, mm/m64 phaddw/d/sw xmm, xmm/m128 phsubw/d/sw mm, mm/m64 phsubw/d/sw xmm, xmm/m128 PMADDUBSW mm, mm/m64

Per element, overwrite destination with absolute value of source. Pairwise integer horizontal addition + pack. Pairwise integer horizontal subtract + pack.

PMADDUBSW xmm, xmm/m128

Multiply signed & unsigned bytes. Accumulate result to signed-words. (Multiply Accumulate)

PMULHRSW mm, mm/m64

Signed 16 bits multiply, return high bits.

PMULHRSW xmm, xmm/m128 PSHUFB mm, mm/m64 PSHUFB xmm, xmm/m128 PALIGNR mm, mm/m64, imm8

A complete byte-granularity permutation, including force-to-zero flag.

Extract any continuous 16 (8 in the 64 bit case) bytes from the pair [dst, src] and PALIGNR xmm, xmm/m128,Intel® imm8 Processor Micro-architecture - Core® microarchitecture store them to the dst register. 97 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Dependencies and Bypasses “Read-after-Write” Dependency - 1 clock stall assuming register file can be written-through add eax, ecx  eax F D E W sub ebx, eax  ebx F D D E W “E to D” Bypass - save clock penalty add eax, ecx  eax F D E W sub ebx, eax  ebx F D E W Long Latency operations Load [ecx+edi]  eax F D E E E W add ebx, eax  ebx F D D D E W

Intel® Processor Micro-architecture - Core® microarchitecture 98 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Branch Handling Given the code: for (i=100, a=0; i>0; i--) a+=B[i]; Compiler would generate • // eax initiated with zero, edi initiated with 100 loop:

load

B[edi]  ebx

// read B[i] from memory

add

eax, ebx  eax // a+=B[i]

add

edi,-1  edi

jnz

edi, loop

store

eax  a

// i-=1 // store result

Intel® Processor Micro-architecture - Core® microarchitecture 99 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Branch Handling (cont.) load add add jnz store xxx load

B[edi]  ebx F D E W eax,ebx  eax F D E W edi,-1  edi F D E W edi, loop F D E W eax  a F D E W F D E W B[edi]  ebx F D E W

Only after branch Execute stage we know that next fetch was wrong

• Need to flush the pipe • IPC: 4 instructions in 6 clocks (IPC = 0.66 vs. optimum IPC = 1) • ‘Pipe break’ penalty = 2 clocks • Adding a stage?: IPC = 0.57 ~14% slower!!! Prolonging the pipeline achieves higher frequencies however pipe break penalty increases! MUST solve the pipe break penalty problem! Intel® Processor Micro-architecture - Core® microarchitecture 100 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Branch Handling (cont.) H/W can ‘learn’ about SW behavior • Same branch goes same direction in most cases • Learn branch address and target •

Branch Target Buffer (BTB)

• Predict based on branch history, surrounding branch behavior, loop behavior. •

We are at ~95% correct prediction.

• Looks in BTB while fetching instruction • Lee&Smith or Yeh&Patt algorithms New (and correct) pointer calculated in Fetch stage of branch load add add jnz load

B[edi]  ebx F D E W eax,ebx  eax F D E W edi,-1  edi F D E W edi, loop F/P D E W B[edi]  ebx F D E W Intel® Processor Micro-architecture - Core® microarchitecture 101 Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Pipeline Techniques Limitations of the Typical Pipeline Scheme • IPC is theoretically limited by 1 • Actually IPC is less than 1 because of long latency operations, stalls (e.g. cache miss), pipeline flushes (due to branch miss prediction) etc.

• Pipeline stages are frequently not balanced • Cycle Time (Tc) is determined by the longest pipeline stage

Advanced Pipeline Techniques • Super pipeline • Super-scalar

Intel® Processor Micro-architecture - Core® microarchitecture 102 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Pipeline Techniques (cont.) Super pipeline: shorter stages allows higher frequency F1 F2 D1 D2 E1 E2 W1 W2 F1 F2 D1 D2 E1 E2 W1 W2 F1 F2 D1 D2 E1 E2 W1 W2

Super-scalar: perform more in a single cycle F F

D D F F

E W E W D E W D E W

Intel® Processor Micro-architecture - Core® microarchitecture 103 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting stalls: Out Of Order Execution (OoO) Instructions are executed based on “data flow” rather than program order (Tomasulo’s algorithm ) Avoid the stall that occurs on this 1. Instruction Fetch and Decode. 2. Instruction queue @ Reservation Station.

stage in an in-order processor

3. Instruction • waits in the queue until all input operands are available • leaves the queue before earlier, older instructions.

4. Instruction Execution 5. Results are queued. 6. Instruction Reorder and Writeback.

Intel® Processor Micro-architecture - Core® microarchitecture 104 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting stalls: Register Renaming Creates new opportunities for OOO execution • Eliminates Write-after-write (WAW) and Write-afterread (WAR) dependencies = hazards. Architectural vs physical registers dispatch 1. mov eax, [m1] 2. add eax, 2 MULTD F4,F2,F2 reads from F2 3. mov [m2], eax 4. F2,F0,F6 mov eaxwrites , [m3] ADDD to F2 5. add eax, 4 6. mov [m4], eax MULTD F4,F2,F2

4, 5, 6 can be executed parallel with 1, 2, 3 ADDD F8,F0,F6 (assume F8 is in unused) but after registers renaming only!!!

Intel® Processor Micro-architecture - Core® microarchitecture 105 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Re-Order Buffer (ROB) Mechanism for renaming and retirement Table contains in-order instructions order instructions • Instructions are entered in order • Registers renamed by the entry number • Once assigned: execution order unimportant • After execution: entries marked • An executed entry can be “retired” once all prior instruction have retired. That is: instruction have retired • Update “real registers real registers” with value of renamed regs • Update memory • Leave the ROB Intel® Processor Micro-architecture - Core® microarchitecture 106 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Reservation Station(s) Pool(s) of all “not yet executed” instructions Maintains operands status “ready / not-ready” Each cycle, executed instructions make more operands “ready” Instructions whose all operands are “ready” can be “dispatched” for execution Dispatcher chooses which of the “ready” instructions will be executed next

Intel® Processor Micro-architecture - Core® microarchitecture 107 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Memory Order Buffer (MOB) Idea - allow out of order among memory operations Problem Memory dependencies cannot fully resolved statically (memory disambiguation) Structure similar in concept to ROB Every access is allocated an entry Address & data (for stores) are updated when known Load is checked against all previous stores: Load is checked against all previous stores

Return Intel® Processor Micro-architecture - Core® microarchitecture 108 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.) Intelligent Power Capability - Split Busses (core power feature)

Many buses are sized for worst case data (x86 instruction of 15 bytes) (ALU can write-back 128 bits)

Improved Energy Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 109 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.) Intelligent Power Capability - Split Busses (core power feature)

By splitting buses to deal with varying data widths, we can gain the performance benefit of bus width while maintaining C dynamic closer to thinner buses

Improved Energy Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 110 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge refreshment Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 111 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Overview System Bus Bus Unit

1st Level Cache (Data)

2nd Level Cache

Instruction Fetch Unit

Decode /IQ

Front End

Renamer/Allocator Buffers(Retirement) Scheduler

Execution Unit

Execution Core

Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture 112 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Drill-down

icache

branch prediction predecode unit

instruction queue

page miss handler

data cache unit

memory order buffer

instruction decode

register alias table

MS

ALLOC

store address load store data

integer FP SIMD (3x)

Reservation Station

Re-Order Buffer

Intel® Processor Micro-architecture - Core® microarchitecture 113 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Example Code to Be Used … addps xmm0, [EAX+16] mulps xmm0, xmm0 movps [EAX+240], xmm0 cmp EAX, 100000 jge label …

Intel® Processor Micro-architecture - Core® microarchitecture 114 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge refreshment Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 115 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End Instruction preparation before executed • Instruction Fetch Unit • Instruction Queue • Instruction Decode Unit • Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture 116 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture 117 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Instruction Fetch Unit Prefetches instructions that are likely to be executed

icache branch prediction unit

predecode

Caches frequently-used instructions Predecodes and Buffers instructions

instruction queue 2nd Level Cache

1st Level Cache (Data)

IQ/ Decode

Instruction Fetch Unit Front End

Renamer/Allocator Buffers(Retirement) Scheduler

Execution Unit

instruction decode

Execution Core

MS

BTBs/Branch Prediction

Intel® Processor Micro-architecture - Core® microarchitecture 118 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Instruction Fetch Unit (cont.) I-Cache (Instruction Cache) • 32 KBytes / 8-way / 64-byte line • 16 aligned bytes fetched per cycle ITLB (Instruction Translation Lookaside Buffer) • 128 4k pages, 8 2M pages Instruction Prefetcher • 16-byte aligned lookup through the ITLB into the instruction cache and instruction prefetch buffers Instruction Pre-decoder • Instruction Length Decode (predecode) •

Avoid Length Changing Prefix, for example • The REX (EM64T) prefix (4xH) is not an LCP

Avoid in loop: MOV dx, 1234h Opcode ModR/M SIB Displacement Instruction Prefixes (66H/67H)Intel® ModR/M Processor Micro-architecture - Core® microarchitecture

Immediate

119 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture 120 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Instruction Queue Buffer between instruction pre-decode unit and decoder • up to six predecoded instructions written per cycle • 18 Instructions contained in IQ • up to 5 Instructions read from IQ Potential Loop cache Loop Stream Detector (LSD) support • Re-use of decoded instruction • Potential power saving 2nd Level Cache

branch prediction unit

predecode

instruction queue

1st Level Cache (Data)

Renamer/Allocator Buffers(Retirement) Scheduler

IQ/ Decode

Instruction Fetch Unit

icache

Front End

Execution Unit

instruction decode

Execution Core

MS

BTBs/Branch Prediction Intel® Processor Micro-architecture - Core® microarchitecture 121 Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture 122 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Instruction Decode Decode the instructions into micro-ops

icache branch prediction unit

Ready for the execution in OOO core predecode

instruction queue 2nd Level Cache

1st Level Cache (Data)

Renamer/Allocator Buffers(Retirement) Scheduler

IQ/ Decode

Instruction Fetch Unit Front End

Execution Unit

instruction decode

Execution Core

MS

BTBs/Branch Prediction Intel® Processor Micro-architecture - Core® microarchitecture 123 Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking

Intel® Processor Micro-architecture - Core® microarchitecture 124 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Instruction Decode / Decoders Instructions converted to micro-ops (uops) • 1-uop includes load+op, stores, indirect jump, RET... 4 decoders:1 “large” and 3 “small” • All decoders handle “simple” 1-uop instructions • One large decoder handles instructions up to 4 uops All decoder working in parallel • Four(+) instructions / cycle Micro-Sequencer takes over for long flows (handling instruction contains 2~4 uops, uCodeRom handles more complex)

Intel® Processor Micro-architecture - Core® microarchitecture 125 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Code Sequence in Front End cmp EAX, 100000 these instructions took more than one fetch as they are 22 bytes

jne label

IQ buffers them together

mulps xmm0, xmm0 addps xmm0, [EAX+16]

IQ

movps [EAX+240], xmm0

all instructions are decodable by all decoders CMP and adjacent JCC are “fused” into a single uop. up to 5 instructions decoded per cycle

Large (dec0)

small small small (dec1) (dec2) (dec3)

cmpjne EAX, 100000, label sta_std [EAX+240], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16]

Intel® Processor Micro-architecture - Core® microarchitecture 126 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking

Intel® Processor Micro-architecture - Core® microarchitecture 127 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Instruction Decode / Macro - Fusion Scheduler Roughly ~15% of all instructions are conditional branches.

cmpjae eax, [mem], label

Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.

Execution

Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch.

Branch Eval

Not supported in EM64T long mode

flags and target to Write back Intel® Processor Micro-architecture - Core® microarchitecture 128 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Instruction Decode / MacroFusion Absent Read four instructions from Instruction Queue

Instruction Queue addps xmm0, [EAX+16] mulps xmm0, xmm0 movps [EAX+240], xmm0

Each instruction gets decoded into separate uops

cmp eax, 100000

Enabling Example

jge label

for (int i=0; i<100000; i++) { …

Cycle 1

}

mulps xmm0, xmm0 movps [EAX+240], xmm0

cmp eax, 100000 Cycle 2

dec0

addps xmm0, [EAX+16]

jge label

dec1 dec2 dec3 dec0

Intel® Processor Micro-architecture - Core® microarchitecture 129 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Instruction Decode / MacroFusion Presented Read five Instructions from Instruction Queue

Instruction Queue addps xmm0, [EAX+16] mulps xmm0, xmm0

Send fusable pair to single decoder

movps [EAX+240], xmm0

cmp eax, 100000

Single uop represents two instructions

jae label

Enabling Example for (unsigned int i=0; i<100000; i++) { …

Cycle 1

addps xmm0, [EAX+16] mulps xmm0, xmm0 movps [EAX+240], xmm0

}

cmpjae

eax, 100000, label

dec0 dec1 dec2 dec3

Intel® Processor Micro-architecture - Core® microarchitecture 130 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Instruction Decode / Macro – Fusion (cont.) Benefits • Reduces latency • Increased renaming • Increased retire bandwidth • Increased virtual storage • Power savings

Enabling Greater Performance & Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 131 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking

Intel® Processor Micro-architecture - Core® microarchitecture 132 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Instruction Decode / Micro-Op Fusion Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation

Micro-op fusion effectively widens the pipeline Intel® Processor Micro-architecture - Core® microarchitecture 133 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Instruction Decode / Micro-Fusion (cont.) u-ops of a Store “movps [EAX+240], xmm0”

sta eax+240 st xmm0, [eax+240] std xmm0, [eax+240]

Intel® Processor Micro-architecture - Core® microarchitecture 134 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Instruction Decode Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking

Intel® Processor Micro-architecture - Core® microarchitecture 135 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Instruction Decode / Stack Pointer Tracker (Extended Stack Pointer folding) ESP is calculated by dedicate logic PUSH EAX

• No explicit Micro-Ops updating ESP • Micro-Ops saving ESPd=8

• Power saving

PUSH EDX

Decoder 4 Decoder 0 1

Recovery Information

POP EBX

0 …

Decoder N

. . .

Intel® Processor Micro-architecture - Core® microarchitecture 136 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture 137 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Branch Prediction Unit Allow executing instructions long before the branch outcome is decided

icache branch prediction unit

• Superset of Prescott / Pentium-M features predecode

• One taken branch every other clock • Branch predictions for 32 bytes at a time, twice the width of the fetch engine

2nd Level Cache

1st Level Cache (Data)

Renamer/Allocator Buffers(Retirement) Scheduler

IQ/ Decode

Instruction Fetch Unit

instruction queue

Front End

Execution Unit

instruction decode

Execution Core

MS

BTBs/Branch Prediction Intel® Processor Micro-architecture - Core® microarchitecture 138 Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End

Intel® Software College

Branch Prediction Unit (cont.) 16-entry Return Stack Buffer (RSB) Front end queuing of BPU lookups Type of predictions • Direct Calls and Jumps • Indirect Calls and Jumps • Conditional branches

Intel® Processor Micro-architecture - Core® microarchitecture 139 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Front End

Branch Prediction Improvements Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements:

Indirect Branch Predictor

Loop Detector

Branch miss-predictions reduced by >20%

Intel® Processor Micro-architecture - Core® microarchitecture 140 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda Introduction Knowledge preparation Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture 141 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Execution Core store address integer

Accepted decoded u-ops, assign resources, execute and retire u-ops

load

• Renamer

store data

• Reservation station (RS) register alias table

• Issue ports • Execution Unit

ALLOC

2nd Level Cache

Reservation Station

Re-Order Buffer

1st Level Cache (Data)

Renamer/Allocator Buffers(Retirement) Scheduler

IQ/ Decode

Instruction Fetch Unit

FP SIMD (3x)

Execution Unit

Execution Core

Front End BTBs/Branch Prediction

Intel® Processor Micro-architecture - Core® microarchitecture 142 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Execution Core

Execution Core Building Blocks

Renamer

Ports (number)

RS

0,1,5 SIMD Integer

ROB

SIMD/Integer MUL

0,1,5 Integer

0,1,5 Floating Point

Execution Unit

2 Load 3,4 Store

Memory Sub-system Intel® Processor Micro-architecture - Core® microarchitecture 143 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Execution Core

Rename and Resources 4 uops renamed / retired per clock • one taken branch, any # of untaken • one fxchg per cycle Uops written to RS and ROB • Decoded uops were renamed and allocated with resource by RAT and sent to ROB read and RS • RS waits for sources to arrive allowing OOO execution • Registers not “in flight” read from ROB during RS write

register alias table ALLOC

Reservation Station Re-Order Buffer

Intel® Processor Micro-architecture - Core® microarchitecture 144 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Execution Core

Issue Ports and Execution Units 6 dispatch ports from RS • 3 execution ports • (shared for integer / fp / simd)

store address

• load

load

• store (address) • store (data)

store data

128-bit SSE implementation

integer FP SIMD (3x)

• Port 0 has packed multiply (4 cycles SP 5 DP pipelined) • Port 1 has packed add (3 cycles all precisions) FP data has one additional cycle bypass latency • Do not mix SSE FP and SSE integer ops on same register Avoid:

Addps XMM0,XMM1 Pand xmm0,xmm3 Addps xmm2,xmm0

Better:

Addps XMM0,XMM1 Addps xmm2,xmm0

Pand xmm0,xmm3

Intel® Processor Micro-architecture - Core® microarchitecture 145 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Execution Core

Intel® Software College

The Out Of Order each uop only takes a single RS entry load + add dispatches twice (load, then add) mulps dispatches once when load + add to write back sta + std dispatches twice sta (address) can fire as early as possible std must wait for mulps to write back cmpjne dispatches only once (functionality is truly fused) no dependency, can fire as early as it wants

cmpjne EAX, 100000, label sta_std [EAX+240], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16]

RS

Intel® Processor Micro-architecture - Core® microarchitecture 146 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core™ Microarchitecture – Execution Core

Dispatching to OOO EXE cmpjne EAX, 100000, label RS sta_std [EAX+240], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16] cmpjne EAX, 100000, label sta_std [EAX+244], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16] cmpjne EAX, 100000, label sta_std [EAX+248], xmm0 mulps xmm0, xmm0, xmm0 load_add xmm0, xmm0, [EAX+16]

5 GP (incl jmp)

4 STD

3 STA

2 Load

1 GP (incl FP add)

cmpjne EAX, 100000, label 0 GP (incl FP mul) sta_std [EAX+24C], xmm0 mulps xmm0, xmm0, Intel® xmm0 Processor Micro-architecture - Core® microarchitecture load_add xmm0, xmm0, [EAX+16] 147 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Memory Sub-system

Intel® Software College

Advanced Memory Access 3 clk latency and 1 clk thrput of L1D; 14 and 2 for L2 Miss Latencies • L1 miss hits L2 ~ 10 cycles • L2 miss, access to memory ~300 cycles (server/FBD) • L2 miss, access to memory ~165 cycles (Desk/DDR2) • C step broadwater is reported to have ~50ns latency

Cache Bandwidth • Bandwidth to cache ~ 8.5 bytes/cycle Memory Bandwidth • Desktop ~ 6 GB/sec/socket (linux) • Server ~3.5 GB/sec/socket Intel® Processor Micro-architecture - Core® microarchitecture 148 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing for Intel® Core™ Microarchitecture Use CMP = employ both Cores • Go to multithreading! Prefer SSE as much as possible. If you didn’t do it so far, vectorize the code now!!

• Intel Compiler has very good vectorization engine Align data and data layout (sequential) • To align use

__declspec(align (16)) float a[1000];

Intel® Processor Micro-architecture - Core® microarchitecture 149 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing for Intel® Core™ Microarchitecture (advanced) Use Intel VTune™ Performance Analyzer for performance problems revealing • CPI • Specific CPU events for Core-arch: RESOURCE_STALLS.RS_FULL, L2_IFETCH.SELF.MESI, RESOURCE_STALLS.RS_FULL, RESOURCE_STALLS.ROB_FULL etcsee VTune help

Intel® Processor Micro-architecture - Core® microarchitecture 150 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Front End Issue Debugging Look for Front End optimization only when code is FE bound • Reservation station (RS) is the front end and allocation target • Low RESOURCE_STALLS.RS_FULL and poor CPI should be debugged as front end issue •

If there are no issues in the FE the RS should be full above 30% of the time

Front End typical issues: • Code is too big to fit in the L1: • • •

When L2_IFETCH.SELF.MESI happens every 10-15 instructions Code that could have been with CPI 1 will be around 2 14 cycles penalty for L1 demand miss

• Average instruction size above 6 bytes • •

Happens typically with SSE code and more with EM64T Can have impact only in case of otherwise excellent CPI

• Code with length changing prefix issues (LCP) • •

Penalty of 6 cycles or more Look at ILD_STALL VTune event

Front-End should not be the bottleneck. Focus on Front End issues only if it is the issue. Intel® Processor Micro-architecture - Core® microarchitecture 151 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Execution micro architecture The busiest port may determent the potential execution speed

Single clock latency operations are best • Different latency operations can create writeback conflicts  Creating bubble in the port

Look at the dependency chains to see the potential parallelism • Remember that the RS has only 32 entries and only those instructions are candidates for scheduling to the execution ports • High RESOURCE_STALLS.RS_FULL percentage if the code is latency bound

• The ROB has 96 entries • High RESOURCE_STALLS.ROB_FULL percentage only if

Execution stage: The key good performance. • Code has long latency instructions (L2 for misses) Intel® Processor Micro-architecture - Core® microarchitecture Focus oncanport utilization and dependency chains •152 Other code be executed while waiting Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Execution micro architecture The Divider is a big potential stall source

• DIV for the number Divide operations executed • IDLE_DURING_DIV for number of cycles of no port issue while the diverter is busy • Try to find some useful work to do in parallel with divide operations Extra cycle latency for bypass between execution domains • For example: FP (ADDPS) and logical ops (PAND) on XMMn • DELAYED_BYPASS.FP • DELAYED_BYPASS.LOAD • DELAYED_BYPASS.SIMD

EXE Data Cache Unit

0,1,5 SIMD Integer

0,1,5

0,1,5

Integer

Floating Point

integer / SIMD MUL

dtlb memoryorderring store forwarding

2 load store (address) 3 store (data)

4

Intel® Processor Micro-architecture - Core® microarchitecture 153 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Enhancements and Optimization Opportunities IP Prefetcher • Prefetches stride loads associated with the same IP • •

Uses History table Use VTune events to identify misses when expected prefetches

Memory Disambiguation • Predicts when OK to fire load before preceding stores with unknown address • •

Misprediction triggers Pipeline flash and load restart Disambiguation is temporarily disabled if frequently fails

• LOAD_BLOCK.STA where Loads blocked by a preceding store with unknown address •

In case not to the same address: Possible reasons for not working: Address collision with other load(s)

Intel® Processor Micro-architecture - Core® microarchitecture 154 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Other Opportunities for Performance Gain in the memory sub-system 4k Aliasing

• OOO engine can fire Load before preceding Store if not collides on the Store’s address •

Address collision serializes execution

• Address checking uses only the last 12 bits (4K) •

False blocking - if Load’s & Store’s addresses have 4KB offset • e.g. accessing large, power of two, sized arrays in a loop

• Resolve 4K aliasing conflicts by changing memory layout •

VTune event LOAD_BLOCK.OVERLAP_STORE

Load block cases

• Increase the distance between the store and the dependant load, so that the store data/address is known at the time the load is dispatched •

Store address unknown - LOAD_BLOCK.STA • Loads blocked by a preceding store with unknown address



Store data unknown - LOAD_BLOCK.STD • Loads blocked by a preceding store with unknown data

• Loads blocked until retirement LOAD_BLOCK.UNTIL_RETIRE •

This includes mainly uncacheable loads and split loads (loads that cross the cache line boundary) Intel® Processor Micro-architecture - Core® microarchitecture 155 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Related Documents