03 Data Parallel Algorithms And Data Structures

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View 03 Data Parallel Algorithms And Data Structures as PDF for free.

More details

  • Words: 2,390
  • Pages: 42
Data-Parallel Algorithms and Data Structures John Owens UC Davis

Programming a GPU for Graphics • Application specifies

geometry -> rasterized

• Each fragment is shaded w/ SIMD program

• Shading can use values from texture memory

• Image can be used as

texture on future passes

Programming a GPU for GP Programs (Old-School)

• Draw a screen-sized quad • Run a SIMD program over each fragment

• “Gather” is permitted from texture memory

• Resulting buffer can be

treated as texture on next pass

Programming a GPU for GP Programs (New School)

• Define a computation domain over elements

• Run a SIMD program over each element/thread

• Global memory supports both gather and scatter

• Resulting buffer can be used in other generalpurpose kernels or as texture

One Slide Summary of Today • GPUs are great at running many closelycoupled but independent threads in parallel – The programming model is SPMD—single program, multiple data

• GPU computing boils down to: – Define a computation domain that generates many parallel threads • This is the data structure

– Iterate in parallel over that computation domain, running a program over all threads • This is the algorithm

Outline • Data Structures – GPU Memory Model – Taxonomy

• Algorithmic Building Blocks – – – – – –

Sample Application Map Gather & Scatter Reductions Scan (parallel prefix) Sort, search, …

Memory Hierarchy • CPU and GPU Memory Hierarchy Disk CPU Main Memory CPU Caches

CPU Registers

GPU Main (Video) Memory GPU Caches (Texture, Shared) GPU Constant Registers

GPU Temporary Registers

Big Picture: GPU Memory Model • GPUs are a mix of: – Historical, fixed-function capabilities – Newer, flexible, programmable capabilities

• Fixed-function: – Known access patterns, behaviors – Accelerated by special-purpose hardware

• Programmable: – Unknown access patterns – Generality good

• Memory model must account for both – Consequence: Ample special-purpose functionality – Consequence: Restricting flexibility may improve performance » Take advantage of special-purpose functionality!

Taxonomy of Data Structures • Dense arrays – Address translations: n-dimension to m-dimension

• Sparse arrays – Static sparse arrays (e.g. sparse matrix) – Dynamic sparse arrays (e.g. page tables)

• Adaptive data structures – Static adaptive data structures (e.g. k-d trees) – Dynamic adaptive data structures (e.g. adaptive multiresolution grids)

• Non-indexable structures (e.g. stack) Details? Lefohn et al.’s Glift (TOG Jan. ‘06), GPGPU survey

GPU Memory Model • More restricted memory access than CPU – Allocate/free memory only before computation – Transfers to and from CPU are explicit – GPU is controlled by CPU, can’t initiate transfers, access disk, etc.

• To generalize … – GPUs are better at accessing data structures – CPUs are better at building data structures – As CPU-GPU bandwidth improves, consider doing data structure tasks on their “natural” processor

GPU Memory Model • Limited memory access during computation (kernel) – Registers (per fragment/thread) • Read/write

– Local memory (shared among threads) • Does not exist in general • CUDA allows access to shared memory btwn threads

– Global memory (historical) • Read-only during computation • Write-only at end of computation (precomputed address)

– Global memory (new) • Allows general scatter/gather (read/write) – No collision rules! – Exposed in AMD R520+ GPUs, NVIDIA G80+ GPUs

Data-Parallel Algorithms • The GPU is a data-parallel processor – Data-parallel kernels of applications can be accelerated on the GPU

• Efficient algorithms require efficient building blocks • This talk: data-parallel building blocks – – – –

Map Gather & Scatter Reduce Scan

• Next talk (Naga Govindaraju): sorting and searching

Sample Motivating Application • How bumpy is a surface that we represent as a grid of samples? • Algorithm: – Loop over all elements – At each element, compare the value of that element to the average of its neighbors (“difference”). Square that difference. – Now sum up all those differences. • But we don’t want to sum all the diffs that are 0. • So only sum up the non-zero differences.

– This is a fake application—don’t take it too seriously. Picture courtesy http://www.artifice.com

Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2

result = 0 for all samples where diff != 0: result += diff

return result

Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2

result = 0 for all samples where diff != 0: result += diff

return result

The Map Operation • Given: – Array or stream of data elements A – Function f(x)

• map(A, f) = applies f(x) to all ai∈ A • GPU implementation is straightforward – Graphics: A is a texture, ai are texels – General-purpose: A is a computation domain, ai are elements in that domain – Pixel shader/kernel implements f(x), reads ai as x – (Graphics): Draw a quad with as many pixels as texels in A with f(x) pixel shader active – Output stored in another texture/buffer

Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2

result = 0 for all samples where diff != 0: result += diff

return result

Scatter vs. Gather • Gather: p = a[i] – Vertex or Fragment programs

• Scatter: a[i] = p – Historically vertex programs, but exposed in fragment programs in latest GPUs

Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2

result = 0 for all samples where diff != 0: result += diff

return result

Parallel Reductions • Given: – Binary associative operator ⊕ with identity I – Ordered set s = [a0, a1, …, an-1] of n elements

• reduce(⊕, s) returns a0 ⊕ a1 ⊕ … ⊕ an-1 • Example: reduce(+, [3 1 7 0 4 1 6 3]) = 25 • Reductions common in parallel algorithms – Common reduction operators are +, ×, min and max – Note floating point is only pseudo-associative

Parallel Reductions on the GPU 1D parallel reduction: add two halves of texture together repeatedly... … until we’re left with a single row of texels

+

+

N/4…

N/2 N

O(log2N) steps, O(N) work

+

1

Multiple 1D Parallel Reductions Can run many reductions in parallel Use 2D texture and reduce one dimension

+

+ MxN/2

MxN

MxN/4…

O(log2N) steps, O(MN) work

+

Mx1

2D reductions • Like 1D reduction, only reduce in both directions simultaneously

– Note: can add more than 2x2 elements per pixel • Trade per-pixel work for step complexity • Best perf depends on specific GPU (cache, etc.)

Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2

result = 0 for all samples where diff != 0: result += diff

return result

Stream Compaction • Input: stream of 1s and 0s [1 0 1 1 0 0 1 0] • Operation:“sum up all elements before you” • Output: scatter addresses for “1” elements [0 1 1 2 3 3 3 4] • Note scatter addresses for red elements are packed!

Parallel Scan (aka prefix sum) • Given: – Binary associative operator ⊕ with identity I – Ordered set s = [a0, a1, …, an-1] of n elements

• scan(⊕, s) returns [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]

• Example: scan(+, [3 1 7 0 4 1 6 3]) = [0 3 4 11 11 15 16 22] (From Blelloch, 1990, “Prefix Sums and Their Applications”)

Applications of Scan • • • • •

Radix sort Quicksort String comparison Lexical analysis Stream compaction

• • • • •

Sparse matrices Polynomial evaluation Solving recurrences Tree operations Histograms

A Naive Parallel Scan Algorithm Log(n) iterations

T0

3

1

7

0

4

1

6

3

Note: With graphics API, can’t read and write the same texture, so must “ping-pong”. Not necessary with newer APIs.

A Naive Parallel Scan Algorithm Log(n) iterations

T0

3

1

7

0

4

1

6

3

3

4

8

7

4

5

7

9

Stride 1

T1

Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i].

A Naive Parallel Scan Algorithm Log(n) iterations

T0

3

1

7

0

4

1

6

3

3

4

8

7

4

5

7

9

Stride 1

T1

Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i]. •

Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin.

A Naive Parallel Scan Algorithm Log(n) iterations

T0

3

1

7

0

4

1

6

3

3

4

8

7

4

5

7

9

3

4

11 11

12 12

11

14

Stride 1

T1 Stride 2

T0

Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i]. •

Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin.

A Naive Parallel Scan Algorithm Log(n) iterations

T0

3

1

7

0

4

1

6

3

3

4

8

7

4

5

7

9

3

4

11 11

12 12

11

14

Stride 1

T1 Stride 2

T0

Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i]. •

Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin.

A Naive Parallel Scan Algorithm Log(n) iterations

T0

3

1

7

0

4

1

6

3

3

4

8

7

4

5

7

9

3

4

11 11

12 12

11

14

Stride 1

T1 Stride 2

T0

Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes

Stride 4

Out

vout = v[k] + v[k-2i].

3

4

11 11 15

16

22 25



Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin.

A Naive Parallel Scan Algorithm Log(n) iterations

T0

3

1

7

0

4

1

6

3

3

4

8

7

4

5

7

9

3

4

11 11

12 12

11

14

Stride 1

T1 Stride 2

T0

Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes

Stride 4

Out

vout = v[k] + v[k-2i].

3

4

11 11 15

16

22 25



Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin.

A Naive Parallel Scan Algorithm • Algorithm given in more detail by Horn [‘05] • Step-efficient, but not work-efficient – O(log n) steps, but O(n log n) adds – Sequential version is O(n) – A factor of log(n) hurts: 20x for 106 elements!

• Dig into parallel algorithms literature for a better solution – See Blelloch 1990, “Prefix Sums and Their Applications”

Balanced Trees • Common parallel algorithms pattern – Build a balanced binary tree on the input data and sweep it to and from the root – Tree is conceptual, not an actual data structure

• For scan: – Traverse down from leaves to root building partial sums at internal nodes in the tree • Root holds sum of all leaves

– Traverse back up the tree building the scan from the partial sums – 2 log n passes, O(n) work

Balanced Tree Scan “Balanced tree”: O(n) computation, space • First build sums in place up the tree: “reduce” • Traverse back down and use partial sums to generate scan: “downsweep” See Mark Harris’s article in GPU Gems 3 Figure courtesy Shubho Sengupta

Scan with Scatter • Scatter in pixel shaders makes algorithms that use scan easier to implement – NVIDIA CUDA and AMD CTM enable this – NVIDIA CUDA SDK includes example implementation of scan primitive – http://www.gpgpu.org/scan-gpugems3

• CUDA Parallel Data Cache improves efficiency – All steps executed in a single kernel – Threads communicate through shared memory – Drastically reduces bandwidth bottleneck!

Parallel Sorting • Given an unordered list of elements, produce list ordered by key value – Kernel: compare and swap

• GPU’s constrained programming environment limits viable algorithms – Bitonic merge sort [Batcher 68] – Periodic balanced sorting networks [Dowd 89]

• Recent research results impressive – Naga Govindaraju will cover the algorithms in detail

Binary Search • Find a specific element in an ordered list • Implement just like CPU algorithm – Finds the first element of a given value v • If v does not exist, find next smallest element > v

• Search is sequential, but many searches can be executed in parallel – Number of pixels drawn determines number of searches executed in parallel • 1 pixel == 1 search

• For details see: – “A Toolkit for Computation on GPUs”. Ian Buck and Tim Purcell. In GPU Gems. Randy Fernando, ed. 2004

Summary • Think parallel! – Effective programming on GPUs requires mapping computation & data structures into parallel formulations

• Future thoughts – Definite need for: • Efficient implementations of parallel primitives • Research on what is the right set of primitives

– How will high-bandwidth connections between CPU and GPU change algorithm and data structure design?

References • “A Toolkit for Computation on GPUs”. Ian Buck and Tim Purcell. In GPU Gems, Randy Fernando, ed. 2004. • “Parallel Prefix Sum (Scan) with CUDA”. Mark Harris, Shubhabrata Sengupta, John D. Owens. In GPU Gems 3, Herbert Nguyen, ed. Aug. 2007. • “Stream Reduction Operations for GPGPU Applications”. Daniel Horn. In GPU Gems 2, Matt Pharr, ed. 2005. • “Glift: Generic, Efficient, Random-Access GPU Data Structures”. Aaron Lefohn et al., ACM TOG, Jan. 2006.

Related Documents