Cuda - Copperhead

  • Uploaded by: Adam Noble
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Cuda - Copperhead as PDF for free.

More details

  • Words: 1,542
  • Pages: 28
Copperhead: A Python-like Data Parallel Language & Compiler Bryan Catanzaro, UC Berkeley Michael Garland, NVIDIA Research Kurt Keutzer, UC Berkeley

Universal Parallel Computing Research Center University of California, Berkeley

Intro to CUDA    

Overview Multicore/Manycore SIMD Programming with millions of threads

2/28

The CUDA Programming Model 



 

CUDA is a recent programming model, designed for  Manycore architectures  Wide SIMD parallelism  Scalability CUDA provides:  A thread abstraction to deal with SIMD  Synchronization & data sharing between small groups of threads CUDA programs are written in C + extensions OpenCL is inspired by CUDA, but HW & SW vendor neutral  Programming model essentially identical 3/28

Multicore and Manycore

Multicore 

Manycore

Multicore: yoke of oxen  Each core optimized for executing a single thread



Manycore: flock of chickens  Cores optimized for aggregate throughput, deemphasizing

individual performance 4/28

Multicore & Manycore, cont. Specifications

Core i7 960

GTX285

Processing Elements

4 cores, 4 way SIMD @3.2 GHz

30 cores, 8 way SIMD @1.5 GHz

4 cores, 2 threads, 4 width SIMD: 32 strands

30 cores, 32 SIMD vectors, 32 width SIMD: 30720 strands

SP GFLOP/s

102

1080

Memory Bandwidth

25.6 GB/s

159 GB/s

Register File

-

1.875 MB

Local Store

-

480 kB

Resident Threads (max)

Core i7

GTX285

5/28

SIMD: Neglected Parallelism  

It is difficult for a compiler to exploit SIMD How do you deal with sparse data & branches?  Many languages (like C) are difficult to vectorize  Fortran is somewhat better



Most common solution:  Either forget about SIMD ▪ Pray the autovectorizer likes you  Or instantiate intrinsics (assembly language)  Requires a new code version for every SIMD extension

6/28

What to do with SIMD?

4 way SIMD 

16 way SIMD

Neglecting SIMD in the future will be more expensive  AVX: 8 way SIMD, Larrabee: 16 way SIMD



This problem composes with thread level parallelism

7/28

CUDA CUDA addresses this problem by abstracting both SIMD and task parallelism into threads  The programmer writes a serial, scalar thread with the intention of launching thousands of threads  Being able to launch 1 Million threads changes the parallelism problem 

 It’s often easier to find 1 Million threads than 32: just look

at your data & launch a thread per element 

CUDA is designed for Data Parallelism  Not coincidentally, data parallelism is the only way for

most applications to scale to 1000(+) way parallelism 8/28

Hello World

9/28

CUDA Summary  

  

CUDA is a programming model for manycore processors It abstracts SIMD, making it easy to use wide SIMD vectors It provides good performance on today’s GPUs In the near future, CUDA-like approaches will map well to many processors & GPUs CUDA encourages SIMD friendly, highly scalable algorithm design and implementation

10/28

A Parallel Scripting Language What is a scripting language?  Lots of opinions on this  I’m using an informal definition: ▪ A language where performance is happily traded for productivity  Weak performance requirement of scalability ▪ “My code should run faster tomorrow”  What is the analog of today’s scripting languages for manycore? 

11/28

Data Parallelism Assertion: Scaling to 1000 cores requires data parallelism  Accordingly, manycore scripting languages will be data parallel  They should allow the programmer to express data parallelism naturally  They should compose and transform the parallelism to fit target platforms 

12/28

Warning: Evolving Project Copperhead is still in embryo We can compile a few small programs Lots more work to be done in both language definition and code generation  Feedback is encouraged   

13/28

Copperhead = Cu + python Copperhead is a subset of Python, designed for data parallelism  Why Python? 

 Extant, well accepted high level scripting language ▪ Free simulator(!!)  Already understands things like map and reduce  Comes with a parser & lexer 

The current Copperhead compiler takes a subset of Python and produces CUDA code  Copperhead is not CUDA specific, but current compiler is 14/28

Copperhead is not Pure Python 

Copperhead is not for arbitrary Python code  Most features of Python are unsupported

Copperhead is compiled, not interpreted Connecting Python code & Copperhead code will require binding the programs together, similar to Python-C interaction  Copperhead is statically typed  

Python Copperhead 15/28

Saxpy: Hello world def saxpy(a, x, y): return map(lambda xi, yi: a*xi + yi, x, y) 

Some things to notice:  Types are implicit ▪ The Copperhead compiler uses a Hindley-Milner type system with typeclasses similar to Haskell ▪ Typeclasses are fully resolved in CUDA via C++ templates  Functional programming: ▪ map, lambda (or equivalent in list comprehensions) ▪ you can pass functions around to other functions ▪ Closure: the variable ‘a’ is free in the lambda function, but bound to the ‘a’ in its enclosing scope 16/28

Type Inference, cont. c=a+b

+ : (Num0, Num0) > Num0

A145 A207 A52 c=a+b

   

Num52 Num52 Num52 Copperhead includes function templates for intrinsics like add, subtract, map, scan, gather Expressions are mapped against templates Every variable starts out with a unique generic type, then types are resolved by union find on the abstract syntax tree Tuple and function types are also inferred 17/28

Data parallelism Copperhead computations are organized around data parallel arrays  map performs a “forall” for each element in an array 

 Accesses must be local 

Accessing non-local elements is done explicitly  shift, rotate, or gather



No side effects allowed

18/28

Copperhead primitives map  reduce  Scans: 

 scan, rscan, segscan, rsegscan

 exscan, exrscan, exsegscan, exrsegscan 

Shuffles:  shift, rotate, gather, scatter

19/28

Implementing Copperhead def saxpy(a, x, y): return map(lambda xi, yi: a*xi + yi, x, y)

The Copperhead compiler is written in Python  Python provides its own Abstract Syntax Tree  Type inference, code generation, etc. is done by walking the AST 

Module( None, Stmt( Function( None, 'saxpy', ['a', 'x', 'y'], 0, None, Stmt( Return( CallFunc( Name('map'), Lambda( ['xi', 'yi'], 0, Add( Mul( Name('a'), Name('xi') ), Name('yi') ) ), Name('x'), Name('y'), None, None ) ) ) ) ) )

20/28

Compiling Copperhead to CUDA Every Copperhead function creates at least one CUDA device function  Top level Copperhead functions create a CUDA global function, which orchestrates the device function calls  The global function takes care of allocating shared memory and returning data (storing it to DRAM)  Global synchronizations are implemented through multiple phases 

 All intermediate arrays & plumbing handled by Copperhead

compiler

21/28

Saxpy Revisited def saxpy(a, x, y): return map(lambda xi, yi: a*xi + yi, x, y)

template __device__ Num lambda0(Num xi, Num yi, Num a) { return ((a * xi) + yi); } template__device__ void saxpy0Dev(Array x, Array y, Num a, uint _globalIndex, Num& _returnValueReg) { Num _xReg, yReg; if (_globalIndex < x.length) _xReg = x[_globalIndex]; if (_globalIndex < y.length) _yReg = y[_globalIndex]; if (_globalIndex < x.length) _returnValueReg = lambda0(_xReg, _yReg, a); } template__global__ void saxpy0(Array x, Array y, Num a, Array _returnValue) { uint _blockMin = IMUL(blockDim.x, blockIdx.x); uint _blockMax = _blockMin + blockDim.x; uint _globalIndex = _blockMin + threadIdx.x; Num _returnValueReg; saxpy0Dev(x, y, a, _globalIndex, _returnValueReg); if (_globalIndex < _returnValue.length) _returnValue[_globalIndex] = _returnValueReg; } 22/28

Phases 

Reduction phase 0 phase 1



Scan

phase 0 phase 1 phase 2 23/28

Copperhead to CUDA, cont. B = reduce(map(A)) D = reduce(map(C))

phase 0 phase 1



Compiler schedules computations into phases  Right now, this composition is done greedily  Compiler tracks global and local availability of all variables

and creates a phase boundary when necessary  Fusing work into phases is important for performance 24/28

Copperhead to CUDA, cont. 

Shared memory used only for communicating between threads  Caching unpredictable accesses (gather)  Accessing elements with a uniform stride (shift & rotate)



Each device function returns its intermediate results through registers

25/28

phases

Split def split(input, value): 0 flags = map(lambda a: 1 if a <= value else 0, input) 0 notFlags = map(lambda a: not a, flags) 0-2 leftPositions = exscan(lambda a, b: a + b, 0, flags) 0-2 rightPositions = exrscan(lambda a, b: a + b, 0, notFlags) 2 positions = map(lambda a, b, flag: a if flag else len(input) - b - 1, leftPositions, rightPositions, flags) 2 return scatter(input, positions)

This code is decomposed into 3 phases Copperhead compiler takes care of intermediate variables  Copperhead compiler uses shared memory for temporaries used in scans here  

 Everything else is in registers 26/28

Interpreting to Copperhead If the interpreter harvested dynamic type information, it could use the Copperhead compiler as a backend  Fun project – see what kinds of information could be gleaned from the Python interpreter at runtime to figure out what should be compiled via Copperhead to a manycore chip 

27/28

Future Work  

Finish support for the basics Compiler transformations  Nested data parallelism flattening ▪ segmented scans



Retargetability  Thread Building Blocks/OpenMP/OpenCL

 

Bridge Python and Copperhead Implement real algorithms with Copperhead  Vision/Machine Learning, etc.

28/28

Related Documents

Cuda - Copperhead
June 2020 9
Cuda
July 2020 16
Tecaj Cuda
April 2020 19
Ziola Czynia Cuda
August 2019 16

More Documents from ""