Data-Parallel Algorithms and Data Structures John Owens UC Davis
Programming a GPU for Graphics • Application specifies
geometry -> rasterized
• Each fragment is shaded w/ SIMD program
• Shading can use values from texture memory
• Image can be used as
texture on future passes
Programming a GPU for GP Programs (Old-School)
• Draw a screen-sized quad • Run a SIMD program over each fragment
• “Gather” is permitted from texture memory
• Resulting buffer can be
treated as texture on next pass
Programming a GPU for GP Programs (New School)
• Define a computation domain over elements
• Run a SIMD program over each element/thread
• Global memory supports both gather and scatter
• Resulting buffer can be used in other generalpurpose kernels or as texture
One Slide Summary of Today • GPUs are great at running many closelycoupled but independent threads in parallel – The programming model is SPMD—single program, multiple data
• GPU computing boils down to: – Define a computation domain that generates many parallel threads • This is the data structure
– Iterate in parallel over that computation domain, running a program over all threads • This is the algorithm
Outline • Data Structures – GPU Memory Model – Taxonomy
• Algorithmic Building Blocks – – – – – –
Sample Application Map Gather & Scatter Reductions Scan (parallel prefix) Sort, search, …
Memory Hierarchy • CPU and GPU Memory Hierarchy Disk CPU Main Memory CPU Caches
CPU Registers
GPU Main (Video) Memory GPU Caches (Texture, Shared) GPU Constant Registers
GPU Temporary Registers
Big Picture: GPU Memory Model • GPUs are a mix of: – Historical, fixed-function capabilities – Newer, flexible, programmable capabilities
• Fixed-function: – Known access patterns, behaviors – Accelerated by special-purpose hardware
• Programmable: – Unknown access patterns – Generality good
• Memory model must account for both – Consequence: Ample special-purpose functionality – Consequence: Restricting flexibility may improve performance » Take advantage of special-purpose functionality!
Taxonomy of Data Structures • Dense arrays – Address translations: n-dimension to m-dimension
• Sparse arrays – Static sparse arrays (e.g. sparse matrix) – Dynamic sparse arrays (e.g. page tables)
• Adaptive data structures – Static adaptive data structures (e.g. k-d trees) – Dynamic adaptive data structures (e.g. adaptive multiresolution grids)
• Non-indexable structures (e.g. stack) Details? Lefohn et al.’s Glift (TOG Jan. ‘06), GPGPU survey
GPU Memory Model • More restricted memory access than CPU – Allocate/free memory only before computation – Transfers to and from CPU are explicit – GPU is controlled by CPU, can’t initiate transfers, access disk, etc.
• To generalize … – GPUs are better at accessing data structures – CPUs are better at building data structures – As CPU-GPU bandwidth improves, consider doing data structure tasks on their “natural” processor
GPU Memory Model • Limited memory access during computation (kernel) – Registers (per fragment/thread) • Read/write
– Local memory (shared among threads) • Does not exist in general • CUDA allows access to shared memory btwn threads
– Global memory (historical) • Read-only during computation • Write-only at end of computation (precomputed address)
– Global memory (new) • Allows general scatter/gather (read/write) – No collision rules! – Exposed in AMD R520+ GPUs, NVIDIA G80+ GPUs
Data-Parallel Algorithms • The GPU is a data-parallel processor – Data-parallel kernels of applications can be accelerated on the GPU
• Efficient algorithms require efficient building blocks • This talk: data-parallel building blocks – – – –
Map Gather & Scatter Reduce Scan
• Next talk (Naga Govindaraju): sorting and searching
Sample Motivating Application • How bumpy is a surface that we represent as a grid of samples? • Algorithm: – Loop over all elements – At each element, compare the value of that element to the average of its neighbors (“difference”). Square that difference. – Now sum up all those differences. • But we don’t want to sum all the diffs that are 0. • So only sum up the non-zero differences.
– This is a fake application—don’t take it too seriously. Picture courtesy http://www.artifice.com
Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2
result = 0 for all samples where diff != 0: result += diff
return result
Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2
result = 0 for all samples where diff != 0: result += diff
return result
The Map Operation • Given: – Array or stream of data elements A – Function f(x)
• map(A, f) = applies f(x) to all ai∈ A • GPU implementation is straightforward – Graphics: A is a texture, ai are texels – General-purpose: A is a computation domain, ai are elements in that domain – Pixel shader/kernel implements f(x), reads ai as x – (Graphics): Draw a quad with as many pixels as texels in A with f(x) pixel shader active – Output stored in another texture/buffer
Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2
result = 0 for all samples where diff != 0: result += diff
return result
Scatter vs. Gather • Gather: p = a[i] – Vertex or Fragment programs
• Scatter: a[i] = p – Historically vertex programs, but exposed in fragment programs in latest GPUs
Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2
result = 0 for all samples where diff != 0: result += diff
return result
Parallel Reductions • Given: – Binary associative operator ⊕ with identity I – Ordered set s = [a0, a1, …, an-1] of n elements
• reduce(⊕, s) returns a0 ⊕ a1 ⊕ … ⊕ an-1 • Example: reduce(+, [3 1 7 0 4 1 6 3]) = 25 • Reductions common in parallel algorithms – Common reduction operators are +, ×, min and max – Note floating point is only pseudo-associative
Parallel Reductions on the GPU 1D parallel reduction: add two halves of texture together repeatedly... … until we’re left with a single row of texels
+
+
N/4…
N/2 N
O(log2N) steps, O(N) work
+
1
Multiple 1D Parallel Reductions Can run many reductions in parallel Use 2D texture and reduce one dimension
+
+ MxN/2
MxN
MxN/4…
O(log2N) steps, O(MN) work
+
Mx1
2D reductions • Like 1D reduction, only reduce in both directions simultaneously
– Note: can add more than 2x2 elements per pixel • Trade per-pixel work for step complexity • Best perf depends on specific GPU (cache, etc.)
Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2
result = 0 for all samples where diff != 0: result += diff
return result
Stream Compaction • Input: stream of 1s and 0s [1 0 1 1 0 0 1 0] • Operation:“sum up all elements before you” • Output: scatter addresses for “1” elements [0 1 1 2 3 3 3 4] • Note scatter addresses for red elements are packed!
Parallel Scan (aka prefix sum) • Given: – Binary associative operator ⊕ with identity I – Ordered set s = [a0, a1, …, an-1] of n elements
• scan(⊕, s) returns [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]
• Example: scan(+, [3 1 7 0 4 1 6 3]) = [0 3 4 11 11 15 16 22] (From Blelloch, 1990, “Prefix Sums and Their Applications”)
Applications of Scan • • • • •
Radix sort Quicksort String comparison Lexical analysis Stream compaction
• • • • •
Sparse matrices Polynomial evaluation Solving recurrences Tree operations Histograms
A Naive Parallel Scan Algorithm Log(n) iterations
T0
3
1
7
0
4
1
6
3
Note: With graphics API, can’t read and write the same texture, so must “ping-pong”. Not necessary with newer APIs.
A Naive Parallel Scan Algorithm Log(n) iterations
T0
3
1
7
0
4
1
6
3
3
4
8
7
4
5
7
9
Stride 1
T1
Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i].
A Naive Parallel Scan Algorithm Log(n) iterations
T0
3
1
7
0
4
1
6
3
3
4
8
7
4
5
7
9
Stride 1
T1
Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i]. •
Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin.
A Naive Parallel Scan Algorithm Log(n) iterations
T0
3
1
7
0
4
1
6
3
3
4
8
7
4
5
7
9
3
4
11 11
12 12
11
14
Stride 1
T1 Stride 2
T0
Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i]. •
Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin.
A Naive Parallel Scan Algorithm Log(n) iterations
T0
3
1
7
0
4
1
6
3
3
4
8
7
4
5
7
9
3
4
11 11
12 12
11
14
Stride 1
T1 Stride 2
T0
Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i]. •
Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin.
A Naive Parallel Scan Algorithm Log(n) iterations
T0
3
1
7
0
4
1
6
3
3
4
8
7
4
5
7
9
3
4
11 11
12 12
11
14
Stride 1
T1 Stride 2
T0
Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes
Stride 4
Out
vout = v[k] + v[k-2i].
3
4
11 11 15
16
22 25
•
Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin.
A Naive Parallel Scan Algorithm Log(n) iterations
T0
3
1
7
0
4
1
6
3
3
4
8
7
4
5
7
9
3
4
11 11
12 12
11
14
Stride 1
T1 Stride 2
T0
Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes
Stride 4
Out
vout = v[k] + v[k-2i].
3
4
11 11 15
16
22 25
•
Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin.
A Naive Parallel Scan Algorithm • Algorithm given in more detail by Horn [‘05] • Step-efficient, but not work-efficient – O(log n) steps, but O(n log n) adds – Sequential version is O(n) – A factor of log(n) hurts: 20x for 106 elements!
• Dig into parallel algorithms literature for a better solution – See Blelloch 1990, “Prefix Sums and Their Applications”
Balanced Trees • Common parallel algorithms pattern – Build a balanced binary tree on the input data and sweep it to and from the root – Tree is conceptual, not an actual data structure
• For scan: – Traverse down from leaves to root building partial sums at internal nodes in the tree • Root holds sum of all leaves
– Traverse back up the tree building the scan from the partial sums – 2 log n passes, O(n) work
Balanced Tree Scan “Balanced tree”: O(n) computation, space • First build sums in place up the tree: “reduce” • Traverse back down and use partial sums to generate scan: “downsweep” See Mark Harris’s article in GPU Gems 3 Figure courtesy Shubho Sengupta
Scan with Scatter • Scatter in pixel shaders makes algorithms that use scan easier to implement – NVIDIA CUDA and AMD CTM enable this – NVIDIA CUDA SDK includes example implementation of scan primitive – http://www.gpgpu.org/scan-gpugems3
• CUDA Parallel Data Cache improves efficiency – All steps executed in a single kernel – Threads communicate through shared memory – Drastically reduces bandwidth bottleneck!
Parallel Sorting • Given an unordered list of elements, produce list ordered by key value – Kernel: compare and swap
• GPU’s constrained programming environment limits viable algorithms – Bitonic merge sort [Batcher 68] – Periodic balanced sorting networks [Dowd 89]
• Recent research results impressive – Naga Govindaraju will cover the algorithms in detail
Binary Search • Find a specific element in an ordered list • Implement just like CPU algorithm – Finds the first element of a given value v • If v does not exist, find next smallest element > v
• Search is sequential, but many searches can be executed in parallel – Number of pixels drawn determines number of searches executed in parallel • 1 pixel == 1 search
• For details see: – “A Toolkit for Computation on GPUs”. Ian Buck and Tim Purcell. In GPU Gems. Randy Fernando, ed. 2004
Summary • Think parallel! – Effective programming on GPUs requires mapping computation & data structures into parallel formulations
• Future thoughts – Definite need for: • Efficient implementations of parallel primitives • Research on what is the right set of primitives
– How will high-bandwidth connections between CPU and GPU change algorithm and data structure design?
References • “A Toolkit for Computation on GPUs”. Ian Buck and Tim Purcell. In GPU Gems, Randy Fernando, ed. 2004. • “Parallel Prefix Sum (Scan) with CUDA”. Mark Harris, Shubhabrata Sengupta, John D. Owens. In GPU Gems 3, Herbert Nguyen, ed. Aug. 2007. • “Stream Reduction Operations for GPGPU Applications”. Daniel Horn. In GPU Gems 2, Matt Pharr, ed. 2005. • “Glift: Generic, Efficient, Random-Access GPU Data Structures”. Aaron Lefohn et al., ACM TOG, Jan. 2006.