It Ani Um

  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View It Ani Um as PDF for free.

More details

  • Words: 1,966
  • Pages: 21
What is the Itanium® Architecture ?

Thomas Siebold Technology Consultant Alpha Systems Division [email protected]

Agenda

• •

Terminology What is the Itanium® Architecture?

1

Terminology

Processor Architectures and Implementations IA64 Architecture Alpha Architecture EV5 EV4

EV68 EV7

EV6

Intel Itanium® Architecture

Merced Itanium®

McKinley Itanium®2 processor

Madison Future Itanium® processor

implementations

Itanium® Processor Family

2

Itanium® Processor Family Roadmap Intel has enhanced the Itanium® Processor Family roadmap



To deliver the most competitive product offerings for enterprise customers To pull-in dual core technology as early as possible and deliver a significant performance boost – To maintain a consistent introduction rate on new Itanium® Processor Family product offerings – –

2002

2003

(1 (1 GHz, GHz, 3MB 3MB L3) L3)

2005

® ®

Itanium Itanium 22 Processor Processor

Itanium Itanium 22 Processor Processor

(Madison (Madison & & Deerfield) Deerfield) (1.5GHz, (1.5GHz, 6MB 6MB L3) L3)

(Madison (Madison 9M) 9M) (>1.5GHz, (>1.5GHz, 9MB 9MB L3) L3)

®

® ® Itanium Itanium 22 Processor Processor



2004

® ®

Montecito Montecito

Silicon Process 0.18 m 0.18 µµm 0.13 0.13 µm µm 90 90 nm nm

(Dual (Dual Core) Core)

Montecito processor will enable dual-core technology

– – –

Continues PAC611 and maintains the same bus protocol Extends Itanium® 2 microarchitecture to 90nm process technology Platform Release target of 2005

Roadmap Roadmapmaintains maintainsworld worldclass classperformance performance

next generation processor technologies New features !

Alpha EV79 PA 8800+ PA -8800

Alpha EV7

E xplicitly xplicitly P arallel arallel nstruction I nstruction C omputing omputing

Multiple Cores & Integrated Interconnects Innovation

POWER4 PA-8700 IA-32 Processor Family

Itanium tm 2 Itanium

Alpha EV68

SuperScalar SPARC -III MIPS 14K

CISC © 2002

RISC 2

3

Itanium2 Processor 221M FETs 421mm2 21.6mm

90+% of the transistors and 50+% of the die area are devoted to cache and cache support logic! 19.5mm

What is the Itanium® Architecture?

4

Traditional CPU Architectures •

Performance barriers: -



Headroom constraints : -



Memory latency Branches Loop pipelining Procedure call / return overhead Hardware-based instruction scheduling Unable to efficiently schedule parallel execution

Resource constraints -

Too few registers Unable to fully utilize multiple execution units

EPIC – Explicitly Parallel Instruction Computing Basic Ideas •

Static Hardware Design –

Compiler creates record of execution



Machine plays record



No runtime changes like out-of -order-excution

• Instructions in bundles • Distribute among execution units •

High Scalability of ‚execution units‘ – –

Very Large Instruction Word (VLIW) concept Focus is parallelism



High number of execution units

• 6 instructions in parallel (2 bundles per cycle) •

Enhancement of VLIW concepts with – – –

Predication Indication of parallelism in machine code Speculative data loading

5

Improving Performance •

Itanium® architecture boosts performance by – –

allowing compiler to provide information to chip using available compile time information – Moving performance burden from microarchitecture (chip) to compiler



Itanium® architecture code accomplishes the following: – – – –

Increases instruction level parallelism (ILP) Improves branch handling Reduces memory access cost Supports modular code (note)

6

Increasing Instruction Level Parallelism

Increasing Instruction Level Parallelism •

Improving instruction level parallelism (ILP) by: –

Compiler/assembly writer is able to explicitly indicate parallelism – Instruction groups



Three-instruction-wide word – Instruction bundle – Two executed per cycle



Massive resources on chip – Large number of registers to avoid register contention

7

Instruction Format: Bundles & Templates

•Bundle (123 bits) •Set of three instructions •Template (5 bits) •Identifies types of instructions in bundle •One of Integer, Memory, Branch, Floating, eXtended •Identifies independent operations (“stops”) -> MM_F •Defines execution units to be invoked executing the bundle •Compiler can schedule functional units to avoid contention

Explicitly Parallel Instruction Computing EPIC S2

S1

S0

T

128-bit instruction bundles from I-cache

Fetch one or more bundles for execution (Implementation, Itanium® takes two.) Processor

functional units MEM

MEM

INT

INT

FP

FP

B

B

B

Try to execute all instructions in parallel, depending on available units.

Retired instruction bundles

8

Instruction Groups • • • • • • •

Instruction groups: Set of instructions No dependencies (read-after-write) within group May execute in parallel The processor executes as many instructions per instruction group as possible, based on its resources Must contain at least one instruction (no upper limit) Instruction groups are indicated by cycle breaks (;;)

Instruction groups and bundles ld8 sub add add st8

r5 = [r7] r1 = r2, r3 r10 = r20, r21 ;; r1 = r1, r5 ;; [r7] = r1

Instructions within a group may not have any register dependencies within the group.

;; indicates the end of a group.

Instruction bundles { .mii ld8 r10, [r5] add r1 = r2, r3 add r4 = r5,r6

// // // //

template slot 0, Memory slot 1, Integer slot 2, Integer

Instructions are fetched and executed in bundles.

}

9

Instruction groups and bundles Itanium® and Itanium2® fetch 2 bundles at a time for execution. They may or may not execute in parallel. Handwritten code instr instr instr ;; instr instr ;; instr intsr instr instr instr ;; instr instr ;; instr …

Instruction bundles Execution

Code generator

Fetch instr instr instr instr instr instr intsr …

instr instr instr nop instr instr instr

instr instr nop nop nop nop instr

Forgetting end-of-group may be fatal: add r1 = r1, r5 st8 [r7]= r1

tmpl tmpl tmpl tmpl tmpl tmpl tmpl

instr instr instr tmpl instr instr instr tmpl

Can the bundle pair Execute in parallel ?

Code generator creates bundles, possibly including nops.

;;

There are two difficulties: 1) Finding instruction triplets matching the defined templates. 2) Matching pairs of bundles that can execute in parallel.

Massive On Chip Resources •

Several register files visible to the programmer:



128 General registers 128 Floating-point registers 64 Predicate registers 8 Branch registers 128 Application registers Instruction Pointer (IP) register Control Registers Process Status Register (includes slot index within current bundle)

• • • • • • •

10

Improving Branch Handling

What is the problem ? •

Traditional CPUs: • Branch-prediction is used to predict the most likely set of

instructions • Correct branch prediction keeps the execution pipelines full • A mispredicted branch flushes the pipeline with a large

penalty •

Itanium® architecture improves branch handling: • Provide a way to minimize branches using predicates • Provide support for special branch instructions – counted loop

11

Branch Handling •

Predication – – –

Conditional execution of instructions When the predicate is true, the instruction is executed When it is false, the instruction is treated as a NOP

Predication converts a control dependency into a data dependency • Predication eliminates branches in the code •

Speculation

Predication •

Traditional code: if (a>b) c = c + 1 else d = d * e + f



Avoid branch by using predicated code p1, p2 = compare(a>b) if (p1) c = c + 1 if (p2) d = d * e + f – –

Predicate p1 set to 1 if compare is true, and to 0 if it evaluates to false p2 is the complement of p1

12

Speculation

Predication Before: • Instructions c = c + 1 and d = d * e + f are control dependant on a
Values of p1 and p2 They determine execution The branch is eliminated

Predication Itanium® Architecture

Traditional Architecture Cmp a,b

Cmp a,b pT, pF

br NEQ Jump then

pT

Y=3

pF

Y=4

Y=3

JumpbrEND Y=4

else

Code for both paths loaded and routed to different execution pipelines. Only one ‘branch’ will have a valid predicate and be executed.

13

Reducing Memory Access Cost

Reducing Memery Access Cost •

Itanium® architecture eliminates many memory accesses through: • large register files to manage work in progress • better control of the memory hierarchy (cache

hints) •

Itanium® architecture reduces remaining memory accesses by: • moving load instructions earlier in the code – Data speculation - the execution of a load before a preceeding store – Control speculation - the execution of a load before its guarding branch

• hides memory latency • enables the processor to bring in the data in time • avoids stalling the processor

14

„Data Speculation“ Advanced Loads •

Load is performed before a store that logically precedes it – may potentially use the same address – also referred to as ‘advanced load’ – at compile time memory addresses need to be “disambiguated” (relationship)

Itanium® architecture Traditional sequence:sequence: aload(ld_addr,target) store(st_addr,data) /* other operations including uses of load(ld_addr,target) target use(target) */ store(st_addr,data) acheck(target,recovery_addr) use(target)

„Control Speculation“ •

Load is performed before a store that’s guarded by a branch – Need to check for exceptions Traditionalarchitecture sequence: sequence: Itanium® if a>b sload(ld_addr1,target1) then sload(ld_addr2,target2) load(ld_addr1,target1) /* other operations including usage of else target1/target2 */ load(ld_addr2, target2) if a>b then scheck(target1,recovery_addr1) else scheck(target2, recovery_addr2)

15

Massive Memory Resources •

Physical memory –

Full implementation will address 16 EB of physical memory (264) • 16,000,000,000GB



Itanium® architecture microprocessor has 44-bit address bus • 16TB (16,000GB) physical memory addressable





Itanium2® architecture microprocessor has 50-bit address bus

Virtual memory – –

Itanium® architecture microprocessor uses 50-bits Itanium2® architecture microprocessor uses 64-bits

Supporting Modular Code

16

Procedure Call Overhead •

Modular programs create more overhead – – – – –



Programs tend to be call intensive Register space shared by caller and callee Call/Returns require register save/restores Frequent memory access Limitations due to resource shortage

Itanium® solution –

Massive register resources • Renaming, rotating • Integer registers stackable

– – –

Register Stack Engine (RSE) Eliminates memory accesses Allows to allocate local registers dynamically

Register Stack The general register stack is divided into two subsets: • Static: 32 permanent registers (r0-r31) – visible to all procedures – Used for global variables • Stacked: 96 other registers are like a stack – procedure code allocates up to 96 registers for a frame • Frame allocation: – previous frame is hidden – first register is renamed to logical register r32 – small frames eliminate/reduce saving/restoring registers to/from memory •

17

Procedure Call Overhead IA-32 • Procedure A • call B • • • • •

Itanium® Architecture Procedure A call B

Procedure B save current register state ... restore previous register state return...

Procedure B alloc, no save! ... no restore! (remap) return

Register Stack Engine (RSE) When a procedure is called – New frame of registers is made available – Caller’s register content remain in registers, invisible and inaccessible to called procedure – If deep nesting exhausts physical registers the RSE will save contents of hidden registers to memory to free up resources – On return to caller, caller’s register content automatically restored • RSE works in background, utilizing unused memory bandwidth • Activity not visible to application programs •

18

Loop Optimization Overhead Enhance loop performance: – Done by unrolling loops – Causes code expansion – Prologue/epilogue add to code size • Itanium® solution – Software pipelining – Architecture support •

– – – –

Minimal prologue/epilogue code Predication Loop control registers (LC, EC) Loop branches (br.ctop, br.wtop)

IA64 Instruction Peculiarities

There is a floating point multiply and add instruction, fma (f= a*b+c) A simple floating point multiply is a fma with c=0. A simple floating point add is a fma with b=1. There is an integer multiply and add instruction, which executes in fp registers! There is a memory fence instruction: mf (Alpha: MB) There are three atomic semaphore instructions: xchg, cmpxchg and fetchadd. There are no load/store instructions with immediate offsets a la LDQ R1, 32(R5) on Alpha. There are speculative and advanced loads that do not exist on Alpha. The Register Stack Engine (RSE) is a powerful tool in procedure nestings.

19



Itanium® Architecture Training

• •

https://shale.intel.com/softwarecollege/

Q & A 20

21

Related Documents

It Ani Um
May 2020 6
Err Um Ani A
June 2020 14
Ani
April 2020 26
Ani
December 2019 41
Ani
December 2019 35
Ani
July 2020 26