Computer Organization and Architecture Pentium Processor
Pentium 4 Diagram (Simplified)
Pentium 4 Core Processor
Fetch/Decode Unit
• Fetches instructions from L2 cache • Decode into micro-ops • Store micro-ops in L1 cache
Out of order execution logic
• Schedules micro-ops • Based on data dependence and resources • May speculatively execute
Execution units
• Execute micro-ops • Data from L1 cache • Results in registers
Memory subsystem
• L2 cache and systems bus
Pentium 4 Core Processor System bus Speed 400MH datapath between the L2 memory cache and L1 data cache is 256-bit between L2 memory cache and the prefetch unit continues to be 64-bit wide. 128 internal registers • Pentium 4 has five execution units working in parallel and two units for loading and storing data on RAM memory. • BTB was increased to 4,096 entries
Pentium 4 Core Processor
each CPU uses its own RISC instructions, which are not public documented and are incompatible with microinstructions from other CPUs. I.e., Pentium III microinstructions are different from Pentium 4 Intel doesn’t tell the depth (size) of this queue.
Pentium 4 Design Reasoning
Decodes instructions into RISC like micro-ops before L1 cache Micro-ops fixed length • Superscalar pipelining and scheduling
Pentium instructions long & complex Performance improved by separating decoding from scheduling & pipelining Data cache is write back • Can be configured to write through
L1 cache controlled by 2 bits in register
• CD = cache disable • NW = not write through • 2 instructions to invalidate (flush) cache and write back then invalidate
Pentium Data Types
8 bit Byte 16 bit word 32 bit double word 64 bit quad word Addressing is by 8 bit unit A 32 bit double word is read at addresses divisible by 4
Specific Data Types General - arbitrary binary contents Integer - singned binary value Ordinal - unsigned integer Unpacked BCD - One digit per byte Packed BCD - 2 BCD digits per byte Floating Point
Pentium Floating Point Data Types
Pentium operations Types
Arithmetic Logical Data Movement Control Transfer
String operations MMX
Segment Register
Protection Cache management
Pentium Addressing Modes • • • • • • • • •
Immediate Register operand Displacement Base Base with displacement Scaled index with displacement Base with index and displacement Base scaled index with displacement Relative
Pentium Addressing Mode Calculation
Pentium Instruction Format
Pentium 4 Registers
EFLAGS Register
Control Registers
MMX Register Mapping
MMX uses several 64 bit data types Use 3 bit register address fields • 8 registers
No MMX specific registers • Aliasing to lower 64 bits of existing floating point registers
MMX Register Mapping Diagram
Pentium 4 Diagram
BREIF DESCRIPTION OF EACH PIPELINE STAGE
PIPELINE STAGES
BREIF DESCRIPTION OF EACH PIPELINE STAGE
TC Nxt IP: looks at BTBfor the next microinstruction to be executed. This step takes two stages. TC Fetch: Trace cache fetch. Loads, from the trace cache, this microinstruction. This step takes two stages. Drive: Sends the microinstruction to be processed to the resource allocator and register renaming circuit.
BREIF DESCRIPTION OF EACH PIPELINE STAGE
Alloc: Allocate. Checks which CPU resources will be needed by the microinstruction Rename: If the program uses one of the eight standard x86 registers it will be renamed into one of the 128 internal registers present on Pentium 4. This step takes two stages. Que: Queue. The microinstructions are put in queues accordingly to their types (for example, integer or floating point. Sch: Schedule. Microinstructions are scheduled to be executed accordingly to its type (integer, floating point, etc). Before arriving to this stage, all instructions are in order, This step takes three stages
BREIF DESCRIPTION OF EACH PIPELINE STAGE
Disp: Dispatch. Sends the microinstructions to their corresponding execution engines. This step takes two stages. RF: Register file. The internal registers, stored in the instructions pool, are read. This step takes two stages. Ex: Execute. Microinstructions are executed. Flgs: Flags. The microprocessor flags are updated. Br Ck: Branch check. Checks if the branch taken by the program is the same predicted by the branch prediction circuit. Drive: Sends the results of this check to the branch target buffer (BTB) present on the processor’s entrance
Power pc processors summary L1 cache
L2 cache
Number of transistors (106)
First Ship Date
Clock Speeds
601
1993
50120Mhz
-
-
2.8
603/ 603e
1994
100300MHz
16KB inst 16KB dat
-
1.6-2.6
604/ 604e
1994
166350MHz
32KB inst 32KB dat
740/750 (G3)
1997
200366MHz
32KB inst 32KB dat
256KB – 1MB
G4
1999
500MHz
32KB inst 32KB dat
256KB – 1MB
G5
2003
2.5GHz
32KB inst 64KB dat
512kB
3.6-5.1 6.35
58
POWER PC BLOCK DIAGRAM
Power pc G5 cache
L1: eight way set associative L2:two way ( 256k, 512k or 1MB L3: offchip upto 1MB
PowerPC Data Types
8 (byte), 16 (halfword), 32 (word) and 64 (doubleword) length data types Fixed point processor recognises: • Unsigned byte, unsigned halfword, signed halfword, unsigned word, signed word, unsigned doubleword, byte string • Floating point • IEEE 754 • Single or double precision
PowerPC Addressing Modes
Load/store architecture • Indirect
Instruction includes 16 bit displacement to be added to base register (may be GP register) Can replace base register content with new address
• Indirect indexed
Instruction references base register and index register (both may be GP) EA is sum of contents
Branch address • Absolute • Relative • Indirect
Arithmetic
• Operands in registers or part of instruction • Floating point is register only
PowerPC Memory Operand Addressing Modes
lwz r3, 4(r1) (without update) r3 = mem[r1+4] lwzu r3, 4(r1) (with update) r3 = mem[r1+4] r1 = r1 + 4
lwzx r3, r1, r2 r3 = memory[r1+r2] lwzux r3, r1, r2 r3 = memory[r1+r2] r1 = r1 + r2
PowerPC instruction format
PowerPC instruction format
PowerPC User Visible Registers
PowerPC Register Formats