Architectures for Next Generation HiPC Prof. V. Kamakoti Reconfigurable Intelligent Systems Engineering group (RISE) Dept. of CSE IIT Madras 11/17/08
RISE Group
1
How Chips Have Shrunk
! ! 11/17/08
1946 in UPenn Measured in cubic ft. RISE Group
3
ENIAC on a Chip
1997 ! 174,569 Transistors !
11/17/08
" "
RISE Group
7.44 mm x 5.29 mm 0.5 µ technology 4
Integrated Circuit Revolution
2000: Intel Pentium 4 Processor Clock speed: 1.5 GHz # Transistors: 42 million Technology: 0.18µm CMOS
1958: First integrated circuit (germanium) Built by Jack Kilby at Texas Instruments Contailed five components : transistors, resistors and capacitors 11/17/08
RISE Group
5
Costs Over Time
11/17/08
RISE Group
6
Evolution in IC Complexity
11/17/08
RISE Group
7
If Transistors are Counted as Seconds
11/17/08
4004
< 1 hr
8080
< 2 hrs
8086 80286
8 hrs 1.5 days
80386 DX 80486 Pentium
3 days 13 days > 1 month
Pentium Pro P II
2 months 3 months
P III
~1 year
P4
~ 1.5 years RISE Group
8
Comparison of Sizes
11/17/08
RISE Group
9
How Small Are The Transistors? 2.0
Micron Sub-micron
80286
1.0
80386
Nano Ultra Deep-sub micron
Deep-sub micron
486
pentium pentium II
0.3 0.2
Pentium IV Itanium
0.1 0.05 0.03 83
!
86
89
92
95
98
01
04
07
Compare that to diameter of human hair - 56 µ m
11/17/08
RISE Group
10
Moore’s Law !
Transistors double almost every 2.3 years # Gordon
Moore of Intel # Visionary prediction # Observed in practice for more than 3 decades !
Implication # More
functionality # More complexity # Cost ?? 11/17/08
RISE Group
11
Processor Frequency Trends
Frequency doubles each generation 11/17/08
RISE Group
12
Processor Power Trends
11/17/08
RISE Group
13
Power Density Increase Sun’s Surface
Power Density (W/cm2)
10000 Rocket
1000
Reactor
100
Hot Plate
8086
10
8008 4004
8085 286 8080
1 1970
11/17/08
Nozzle
Nuclear
1980
386
P6 Pentium®
486
1990
RISE Group
2000
2010
14
Complexity
11/17/08
RISE Group
15
What can we do to bridge the gap? Increase frequency ! Increase voltage of operation ! Increase the amount of work done per time unit ! Increase the number of hardware units ! Use clever techniques !
11/17/08
RISE Group
16
Increase Frequency Has been attempted for a long time ! Increase in frequency != Better performance ! Has been around 4 GHz for almost 2 years ! Companies don’t play this number game anymore !
11/17/08
RISE Group
17
Increase Voltage of Operation !
Has been done by overclockers # Mostly
avid gamers # There is a limit beyond which voltage cannot be increased – electrical breakdown !
Power = C • V^2 • f + g1 • V^3 + g2 • V^5 Subthreshold Gate Leakage Leakage
11/17/08
RISE Group
18
Increase Amount of Work Done per Time Unit By dedicating more resources ! Don’t waste computation ! Make hardware faster !
11/17/08
RISE Group
19
Basis of Hyperthreading
11/17/08
RISE Group
20
11/17/08
RISE Group
21
Conventional Superscalar Wide Issue (4 Instructions per cycle)
Different Tasks
Pipelined Functional Units 11/17/08
RISE Group
Executed Job 22
Opportunities Lost !
Functional Units have pipeline bubbles # Lost
opportunity to execute some other instruction
!
Correspondingly, the front end issue also have holes # Front
end = Instruction fetch, decode, out-of-order execution unit, register rename logic etc.
11/17/08
RISE Group
23
Symmetric Multiprocessing to the Rescue
Multiple Processors 11/17/08
RISE Group
24
Discussion on SMP !
Two programs are running # More
work done per unit time
!
Number of execution units doubled # Number
of empty slots also doubled
!
Execution efficiency has not improved
11/17/08
RISE Group
25
Multithreading One instruction stream per slot
11/17/08
RISE Group
26
Multithreading Alleviates some of the memory latency problems ! Still has problems !
# What
if red thread waits for data from memory and there is a cache miss? !
11/17/08
Yellow thread waits unnecessarily RISE Group
27
Hyperthreading More than one instruction stream per slot
11/17/08
RISE Group
28
Multicores Two or more processors on the same chip ! Each has an independent interface to the frontside bus ! Both OS and the applications must support thread-level parallelism !
11/17/08
RISE Group
30
Typically……
11/17/08
RISE Group
31
Challenges & Solutions • Multiple cores face a serious programmability problem – Writing correct parallel programs is very difficult – Problem is synchronization – Traditional Lock based Synchronization
Problem: Lock-Based Synchronization Lock-based synchronization of shared data access Is fundamentally problematic • Software engineering problems – Lock-based programs do not compose – Performance and correctness tightly coupled – Timing dependent errors are difficult to find and debug
• Performance problems – High performance requires finer grain locking – More and more locks add more overhead
Need a better concurrency model for multi-core software
Transactional Memory • Transactional Memory addresses key part of the problem – Makes parallel programming easier by simplifying coordination – Requires hardware support for performance
• Allow multiple atomic operations to proceed till they conflict (readwrite/write-write on same address)
time start_transaction R(A)
Examples
CONFLICT start_transaction
W(B)
Transactions T1,T2
R(B)
end_transaction
R(C) end_transaction
time
NO CONFLICT start_transaction R(A)
start_transaction R(B) R(B) end_transaction R(C) RISE IITM
end_transaction
TM Design Issues Detect Conflicts - Eager or Lazy Resolve Conflicts Commit/Abort - Version Management
STM and HTM - two approaches
The History • Massively Parallel Processors – Distributed Memory Machines – Shared memory architectures
• Multicores – The new generation – Share large amount of Cache
• Research on how we can use them
The History (Contd) • Yesterday’s software is today’s coprocessor and tomorrow’s hardware • Examples – – – –
Segmentation Overlay to Paging Single User to Multiuser OS Floating point - soft to coprocessor to processor capability – Graphic cards become Compute cards - GPUs
The Rise of GPUs • A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API
GFLOPS
– – –
G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800
–
GPU in every PC and workstation – massive volume and potential impact
GeForce 8800 16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU Host Input Assembler Thread Execution Manager
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Texture Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
G80 Characteristics • 367 GFLOPS peak performance (25-50 times of current high-end microprocessors) • 265 GFLOPS sustained for typical applications • Massively parallel, 128 cores, 90W • Massively threaded, sustains 1000s of threads per app • 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
CUDA • “Compute Unified Device Architecture” • General purpose programming model – User kicks off batches of threads on the GPU – GPU = dedicated super-threaded, massively data parallel co-processor
• Targeted software stack – Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU – – – – –
Standalone Driver - Optimized for computation Interface designed for compute - graphics free API Data sharing with OpenGL buffer objects Guaranteed maximum download & readback speeds Explicit GPU memory management
GeForce 7800 GTX Board Details SLI Connector
Single slot cooling
sVideo TV Out DVI x 2
16x PCI-Express
256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32
Online Materials • http://courses.ece.uiuc.edu/ece498/al1/ Syllabus.html - A course on Programming Massively Parallel Processors - by David Kirk, UIUC