Low power Design Strategies Daniele Folegnani
Talk outline • • • • • • •
Why Low Power is Important Power Consumption in CMOS Circuits New Trends for Future Microprocessors Low Power Strategies Power Consumption Evaluation of a Superscalar Processor An Architectural Technique to Reduce the Power Consumption of the Issue Logic Conclusions
Why Low Power is Important High performance microprocessors PowerPC704 consumes 85 Watt Alpha 21364 consumes 100 Watt
Problems involved: thermal runaway, gate dielectric, junction fatigue, electromigration diffusion, electrical parameters shift, silicon interconnections fatigue, package related failure. THE FUNCTIONALITY AND THE CLOCK SPEED CAN BE LIMITED
Thermal and Power dissipation costs 60
Total dissipation cost
50
40
CPU
30
1$/1W
20
10
0 0
10
20
30 Watt
40
50
60
Low performance processors • High demand of portable devices ( mobile phones, laptops, smart cards, videogames, etc ) >>> 95% of production !!! • Extensive use of multimedia features
Problems involved:
>>> Battery life !!!
Energy battery will not grow drastically in the near future due to technology and safety reasons ( today´s batteries has the same energy of a grenade !!! )
• One of the market point is: hours of use and hours of standby • Need of techniques to improve energy efficiency without penalizing performance
Power Consumption in CMOS Circuits
• Static •
Theoretically 0, in practice leakage and threshold currents exist in transistors
• Dynamic • •
Transients ( the linear zone ) Capacitance switching THE MOST IMPORTANT FACTOR
1 P = CV 2 f 2
New Trends for Future Microprocessors 2,5
2
1,5 Power Perf 1
0,5
0 0
0,5
1
1,5
2
2,5
microarch complexity
3
3,5
4
4,5
Moore´s Law doubling transistors every 18 months
Power is proportional to DIE AREA and FREQUENCY • In the same technology a new architecture has 2-3X in Die Area • Changing technology implies 2X frequency SCALING TECHNOLOGY ... • Decreasing voltage ( 0.7 scaling factor ) • Decreasing of die area ( 0.5 scaling factor ) • Increasing C per unit area 43% !!!
This implies that the power density increase of 40% every generation !!! Temperature is a function of power density and determinates the type of cooling system needed. VARIABLES • PEAK POWER ( worst case ) Today´s packages can sustain a power dissipation over 100W for up to 100msec >>> cheaper package if peaks are reduced • ENERGY SPENT ( for a workload ) More correlated to battery life
Low Power Strategies • OS level :
PARTITIONING, POWER DOWN
• Software level :
REGULARITY, LOCALITY, CONCURRENCY ( Compiler technology for low power, instruction scheduling )
• Architecture level : PIPELINING, REDUNDANCY, DATA ENCODING ( ISA, architectural design, memory hierarchy, HW extensions, etc )
• Circuit/logic level : LOGIC STYLES, TRANSISTOR SIZING, ENERGY RECOVERY ( Logic families, conditional clocking, adiabatic circuits, asynchronous design )
• Technology level : Threshold reduction, multi-threshold devices, etc
Power Consumption Estimation 30
25
Error estimation
20
15
power consumption
10
5
0 Arch
RTL
Circuit
Levels of abstraction
Layout
Due to the relative high error rate in the architectural estimation ( no vision of the total area, circuit types, technology, block activity, etc ) IMPORTANT DESIGN DECISIONS MUST BE DONE AT ARCHITECTURAL LEVEL • •
Accurate power evaluation is done at late design phases Needs of good feedback between all the design phases - Correlation between power estimation from low level to high level TRY TO IMPROVE ACCURACY AT HIGH LEVEL
- Critical path based power consumption analysis ( CIRCUIT TYPES, TECHNOLOGY, ACTIVITY FACTOR ) - Thermal images based correlation analysis ( HOTTEST SPOTS LOCATION, COOLEST SPOTS LOCATION, TEMPERATURE DIFFERENCES, TEMPERATURE DISTRIBUTION )
Architectural Power Evaluation [ G.Cai, Intel ] • Architectural design partition • Power consumption evaluation at block level - Power density of blocks
( SPICE simulation, statistical input set, technology and circuit types definition )
- Activity of blocks and sub-blocks
( running benchmarks )
- Area ( feedback from VLSI design, circuits and technology defined ) •
TRY DO DEFINE SCALING FACTORS THAT ALLOW TO REMAP THE ARCHITECTURAL POWER SIMULATOR WHEN TECHNOLOGY, AREA AND CIRCUIT TYPES CHANGE
•
TRY TO REDUCE THE ERROR ESTIMATION AT HIGH LEVEL
POW OUT ORDER • • • • •
Technology assumed: CMOS 0.18 micron 5 types of circuit logic ( static, dynamic, SRAM, clock distribution, PLA ) 32 architectural blocks and area associated blocks built with custom design two types of power density ( active and inactive power density )
Pj ( i ) = ∑ k ak ( i ) ∗ Ak ∗ APDk + ∑ k (1 − ak ( i ) ) ∗ Ak ∗ IPDk
E j = ∑i Pj ( i )
Power Consumption Evaluation of a Superscalar Processor Architectural parameters: • • • • • • •
4 instr. fetch, issue and commit 128 entries instruction queue size I-Cache 128Kbytes, direct mapped, 32 byte line, 1 cycle hit, 3 cycle miss D-Cache 128Kbytes, 4 way set ass, 32 byte line, 1 cycle hit, 3 cycle miss UL2-Cache,1024Kbytes, 4 way set ass, 64 byte line, 3 cycle hit Combined predictor of 1K entries with Gshare with 1K 2-bit counters, 8 bit global history and bimodal pred. of 2K entries with 2-bit counters 4 intALU, 4fpALU, 1int mul/div, 1 fp mul/div
•
Out of order issue, oldest ready first selection policy
applu swim tomcatv wave su2cor hydro4 Avg (%) inst dec 340,952 336,48 351,133 341,916 344,195 349,875 2,751 BTB 143,346 119,679 195,688 149,563 156,424 187,337 1,269 TLB 63,492 53,009 86,676 66,246 69,284 82,977 0,562 IL1 677,202 565,397 924,48 706,574 738,986 855,03 5,955 DL1 621,026 518,495 847,79 647,961 677,684 811,613 5,497 UL2 1353,916 1130,387 1848,292 1412,638 1477,439 1769,421 11,986 rename table 1627,983 1672,283 1725,565 1668,854 1724,716 1738,017 13,539 instr queue 3124,82 3136,195 3282,977 3160,858 3170,821 3269,56 25,201 ROB 3429,455 3445,777 3394,513 3489,683 3221,504 3348,813 27,099 int FU 111,612 109,172 110,205 112,288 103,285 108,362 0,873 fp FU 147,722 144,934 145,859 148,617 136,701 143,42 1,156 I/O logic 244,201 203,884 333,371 254,793 266,481 319,145 2,161 Other 189,273 180,103 214,89 192,667 200,816 242,276 1,951 Total 12075,833 11615,795 13461,439 12352,658 12288,336 13225,846 100
perl inst dec BTB TLB IL1 DL1 UL2 rename table instr queue ROB int FU fp FU I/O logic Other Total
345,874 164,525 72,873 777,26 712,783 1490,043 1812,117 3351,535 3247,036 105,166 139,19 280,028 267,57 12766,783
li 334,607 114,926 50,904 542,941 497,902 1040,843 1879,014 3420,231 3227,355 100,344 132,808 195,786 227,973 11360,385
m88ksim vortex compress gcc Avg (%) 355,221 346,63 333,885 349,317 2,761 107,751 169,376 108,817 109,063 1,045 47,726 75,021 48,198 84,184 0,511 509,046 800,174 514,082 897,905 5,456 466,819 733,796 471,437 823,42 5,004 1017,726 1412,638 1477,439 1795,162 11,117 1742,077 1999,771 1027,793 1773,833 13,819 3214,888 3645,328 3106,252 2906,202 26,524 3315,514 3558,173 3225,669 3499,98 27,104 104,111 114,311 103,075 109,19 0,859 137,794 151,294 136,423 144,517 1,136 183,564 288,545 185,38 323,788 1,967 178,148 364,64 403,729 549,925 2,697 11360,385 13659,697 11142,179 13366,486 100
An Architectural Technique to Reduce the Power Consumption of the Issue Logic • • •
IQ + ROB responsible of about 53% of power consumption Cache hierarchy is not the most important power consumption factor in superscalar paradigm Power consumption is almost independent to the instruction mix
TRENDS IN SUPERSCALAR • • •
Increasing issue width Increasing size of instruction window is more than linear respect IW Area of IQ grows more than linear respect the number of entries
IQ power contribution may grow in the future
Every cycle the wakeup logic broadcast the result tags through the result buses to all the entries and each entry compares them with their to find a match THE ISSUE ENGINE SPEND EVERY CYCLE A LARGE AMOUNT OF POWER ONYL FOR CHECKING IF SOME INSTRUCTIONS ARE AVAILABLE FOR EXECUTION Considering • Periods of execution with high parallelism, just a subpart of the IQ may satisfy the IW • Periods of execution with poor parallelism, some parts of the IQ may not provide any useful instruction ready to execute
The issue engine is very power inefficient
Issue in the window 100.000 90.000 80.000
Percentage
70.000 60.000
1 part 2 part
50.000
3 part
40.000
4 part
30.000 20.000 10.000 0 APPLU
HYDRO
SU2COR
SWIM SpecFP
TOMCATV
WAVE5
Issue in the window 100.000 90.000 80.000
Percentage
70.000 60.000
1 part 2 part
50.000
3 part
40.000
4 part
30.000 20.000 10.000 0 COMPRESS
GCC
LI
M88KSIM SpecINT
PERL
VORTEX
Commit in the window 100.000 90.000 80.000
Percentage
70.000 60.000
1 part 2 part
50.000
3 part
40.000
4 part
30.000 20.000 10.000 0 APPLU
HYDRO
SU2COR SpecFP
SWIM
TOMCATV
WAVE5
Commit in the window 100.000 90.000 80.000
Percentage
70.000 60.000
1 part 2 part
50.000
3 part
40.000
4 part
30.000 20.000 10.000 0 COMPRESS
GCC
LI
M88KSIM SpecINT
PERL
VORTEX
Dynamically Resizing the Instruction Queue • • •
We propose a run-time mechanism that adapt the size of IQ based on its contribution on IPC We avoid the wake-up function in the parts that are temporally disabled Resize decision are commit based
IQ implemented as a circular FIFO with head and tail pointers, no collapsing
What we do is ... Partition the queue in 16 parts of 8 entries Define a new pointer for the queue, called the limit pointer • • • •
At start time has the same value of the head pointer and is update as the head pointer When a resize decision is done an offset ( one portion ) is added/subtracted from it The zone between the head and the limit pointer is the disabled zone ( no wake-up ) If the tail grows more than the limit, we allow the correct wake-up and we stop the insertion until the limit reach the tail
Heuristic to reduce size •
• •
Collect statistics about the instructions committed in the youngest portion of the queue every quantum time ( 1000 cycles ). We propose to insert a bit in each ROB entry that will be set at dispatch time if the physical position of the instruction in IQ is in the current youngest part The resize decision is threshold-based >>> 0.025 of IPC in the current portion No limit to cut
Heuristic to increase size •
Blind >>> grow of one portion every 5 quantum time at lets the cut approach decide if the decision was correct or not ( time of high parallelism or not )
88 ks im
Li
Pe rl
Hy dr o
Su 2c or
Av g
cc
G
Vo rte x Co m pr es s
M
ca tv
W av e
To m
Sw im
Ap pl u
IPC
Results
3.500
3.000
2.500
2.000 128
dynamic
1.500 64
1.000
500
0
Applu Swim Tomcatv Wave Su2cor Hydro Perl Li M88ksim Vortex Compress Gcc Avg
% PowSav % TPowSav Avg entries 62.907 16.276 47.5 30.000 8.100 89.6 58.613 14.291 52.9 41.989 10.721 74.2 61.821 15.946 48.9 66.129 16.339 43.3 55.379 14.517 57.1 59.847 17.383 51.4 65.890 18.620 43.6 61.084 16.278 49.8 65.658 18.287 43.9 60.243 13.088 50.9 57.463 14.987 54.4
Conclusions • Power consumption is a new constraint in the design of computer systems like cost and performance • The problem must be attacked from different levels of abstraction • Power decision must be done at early steps of the design • There is a need of power estimation models and tools, specially at architectural level
Q&A ?