Architectures For Next Generation Hipc

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Architectures For Next Generation Hipc as PDF for free.

More details

  • Words: 1,295
  • Pages: 56
Architectures for Next Generation HiPC Prof. V. Kamakoti Reconfigurable Intelligent Systems Engineering group (RISE) Dept. of CSE IIT Madras 11/17/08

RISE Group

1

How Chips Have Shrunk

! ! 11/17/08

1946 in UPenn Measured in cubic ft. RISE Group

3

ENIAC on a Chip

1997 ! 174,569 Transistors !

11/17/08

" "

RISE Group

7.44 mm x 5.29 mm 0.5 µ technology 4

Integrated Circuit Revolution

2000: Intel Pentium 4 Processor Clock speed: 1.5 GHz # Transistors: 42 million Technology: 0.18µm CMOS

1958: First integrated circuit (germanium) Built by Jack Kilby at Texas Instruments Contailed five components : transistors, resistors and capacitors 11/17/08

RISE Group

5

Costs Over Time

11/17/08

RISE Group

6

Evolution in IC Complexity

11/17/08

RISE Group

7

If Transistors are Counted as Seconds

11/17/08

4004

< 1 hr

8080

< 2 hrs

8086 80286

8 hrs 1.5 days

80386 DX 80486 Pentium

3 days 13 days > 1 month

Pentium Pro P II

2 months 3 months

P III

~1 year

P4

~ 1.5 years RISE Group

8

Comparison of Sizes

11/17/08

RISE Group

9

How Small Are The Transistors? 2.0

Micron Sub-micron

80286

1.0

80386

Nano Ultra Deep-sub micron

Deep-sub micron

486

pentium pentium II

0.3 0.2

Pentium IV Itanium

0.1 0.05 0.03 83

!

86

89

92

95

98

01

04

07

Compare that to diameter of human hair - 56 µ m

11/17/08

RISE Group

10

Moore’s Law !

Transistors double almost every 2.3 years # Gordon

Moore of Intel # Visionary prediction # Observed in practice for more than 3 decades !

Implication # More

functionality # More complexity # Cost ?? 11/17/08

RISE Group

11

Processor Frequency Trends

Frequency doubles each generation 11/17/08

RISE Group

12

Processor Power Trends

11/17/08

RISE Group

13

Power Density Increase Sun’s Surface

Power Density (W/cm2)

10000 Rocket

1000

Reactor

100

Hot Plate

8086

10

8008 4004

8085 286 8080

1 1970

11/17/08

Nozzle

Nuclear

1980

386

P6 Pentium®

486

1990

RISE Group

2000

2010

14

Complexity

11/17/08

RISE Group

15

What can we do to bridge the gap? Increase frequency ! Increase voltage of operation ! Increase the amount of work done per time unit ! Increase the number of hardware units ! Use clever techniques !

11/17/08

RISE Group

16

Increase Frequency Has been attempted for a long time ! Increase in frequency != Better performance ! Has been around 4 GHz for almost 2 years ! Companies don’t play this number game anymore !

11/17/08

RISE Group

17

Increase Voltage of Operation !

Has been done by overclockers # Mostly

avid gamers # There is a limit beyond which voltage cannot be increased – electrical breakdown !

Power = C • V^2 • f + g1 • V^3 + g2 • V^5 Subthreshold Gate Leakage Leakage

11/17/08

RISE Group

18

Increase Amount of Work Done per Time Unit By dedicating more resources ! Don’t waste computation ! Make hardware faster !

11/17/08

RISE Group

19

Basis of Hyperthreading

11/17/08

RISE Group

20

11/17/08

RISE Group

21

Conventional Superscalar Wide Issue (4 Instructions per cycle)

Different Tasks

Pipelined Functional Units 11/17/08

RISE Group

Executed Job 22

Opportunities Lost !

Functional Units have pipeline bubbles # Lost

opportunity to execute some other instruction

!

Correspondingly, the front end issue also have holes # Front

end = Instruction fetch, decode, out-of-order execution unit, register rename logic etc.

11/17/08

RISE Group

23

Symmetric Multiprocessing to the Rescue

Multiple Processors 11/17/08

RISE Group

24

Discussion on SMP !

Two programs are running # More

work done per unit time

!

Number of execution units doubled # Number

of empty slots also doubled

!

Execution efficiency has not improved

11/17/08

RISE Group

25

Multithreading One instruction stream per slot

11/17/08

RISE Group

26

Multithreading Alleviates some of the memory latency problems ! Still has problems !

# What

if red thread waits for data from memory and there is a cache miss? !

11/17/08

Yellow thread waits unnecessarily RISE Group

27

Hyperthreading More than one instruction stream per slot

11/17/08

RISE Group

28

Multicores Two or more processors on the same chip ! Each has an independent interface to the frontside bus ! Both OS and the applications must support thread-level parallelism !

11/17/08

RISE Group

30

Typically……

11/17/08

RISE Group

31

Challenges & Solutions • Multiple cores face a serious programmability problem – Writing correct parallel programs is very difficult – Problem is synchronization – Traditional Lock based Synchronization

Problem: Lock-Based Synchronization Lock-based synchronization of shared data access Is fundamentally problematic • Software engineering problems – Lock-based programs do not compose – Performance and correctness tightly coupled – Timing dependent errors are difficult to find and debug

• Performance problems – High performance requires finer grain locking – More and more locks add more overhead

Need a better concurrency model for multi-core software

Transactional Memory • Transactional Memory addresses key part of the problem – Makes parallel programming easier by simplifying coordination – Requires hardware support for performance

• Allow multiple atomic operations to proceed till they conflict (readwrite/write-write on same address)

time start_transaction R(A)

Examples

CONFLICT start_transaction

W(B)

Transactions T1,T2

R(B)

end_transaction

R(C) end_transaction

time

NO CONFLICT start_transaction R(A)

start_transaction R(B) R(B) end_transaction R(C) RISE IITM

end_transaction

TM Design Issues Detect Conflicts - Eager or Lazy Resolve Conflicts Commit/Abort - Version Management

STM and HTM - two approaches

The History • Massively Parallel Processors – Distributed Memory Machines – Shared memory architectures

• Multicores – The new generation – Share large amount of Cache

• Research on how we can use them

The History (Contd) • Yesterday’s software is today’s coprocessor and tomorrow’s hardware • Examples – – – –

Segmentation Overlay to Paging Single User to Multiuser OS Floating point - soft to coprocessor to processor capability – Graphic cards become Compute cards - GPUs

The Rise of GPUs • A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API

GFLOPS

– – –

G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800



GPU in every PC and workstation – massive volume and potential impact

GeForce 8800 16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU Host Input Assembler Thread Execution Manager

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Texture Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Load/store

Load/store

Load/store

Load/store

Global Memory

Load/store

Load/store

G80 Characteristics • 367 GFLOPS peak performance (25-50 times of current high-end microprocessors) • 265 GFLOPS sustained for typical applications • Massively parallel, 128 cores, 90W • Massively threaded, sustains 1000s of threads per app • 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics

CUDA • “Compute Unified Device Architecture” • General purpose programming model – User kicks off batches of threads on the GPU – GPU = dedicated super-threaded, massively data parallel co-processor

• Targeted software stack – Compute oriented drivers, language, and tools

• Driver for loading computation programs into GPU – – – – –

Standalone Driver - Optimized for computation Interface designed for compute - graphics free API Data sharing with OpenGL buffer objects Guaranteed maximum download & readback speeds Explicit GPU memory management

GeForce 7800 GTX Board Details SLI Connector

Single slot cooling

sVideo TV Out DVI x 2

16x PCI-Express

256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32

Online Materials • http://courses.ece.uiuc.edu/ece498/al1/ Syllabus.html - A course on Programming Massively Parallel Processors - by David Kirk, UIUC

Related Documents