Td Mxc Parallel Programming Tatkar

  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Td Mxc Parallel Programming Tatkar as PDF for free.

More details

  • Words: 3,244
  • Pages: 55
How to Develop Solaris Parallel Applications

Vijay Tatkar Sr. Engineering Manager Sun Studio Developer Tools http://blogs.sun.com/tatkar

Sun Tech Days 07-08 /Sun Studio - # 1

1

The GHz Chip Clock Race is Over... Classic CPU efficiencies: Clock speed Execution optimization Cache

Where is my 10GHz chip? Design Impediments: Heat Power Slower Memory than chips

Sun Tech Days 07- 08 /Sun Studio - # 2

Putting transistors to work in a new way ... UltraSPARC T2 1.4GHz * 8 cores (64 threads in a chip)

Intel: Penryn, AMD: Barcelona Intel: 4 cores * 3.1GHz AMD: 4 cores * 2.3GHz (4 threads in a chip)

The Multicore Revolution

Every new system now has a multi-core chip in it Sun Tech Days 07- 08 /Sun Studio - # 3

Things to know about Parallelism • Parallel processing is not for massively parallel supercomputer anymore. (HPC ≠ High Priced Computing) • CPU clock speed doubled every 18 months, whereas memory doubled every 6 years! Heat, Memory, Power lead to multi-cores CPUs. • Free ride is over for serial programs relying on the hardware to boost performance. • Parallel programming is BEST BET for speedups > Parallelism is all about performance, first and foremost > Program correctness is often harder for parallel programs • Parallelism is often considered hard , but there are several models to choose from, and compiler support for each model to ease the choice.

Sun Tech Days 07- 08 /Sun Studio - # 4

Programming Model Shared Memory Model OpenMP (de-facto standard) Java, Native Multi-threaded Programming

Distributed Memory Model Message Passing Interface – MPI (de-facto standard) Parallel Virtual Machine – PVM (less popular)

Global Address Space Unified Parallel C – UPC (research technology)

Grid Computing Sun Grid Computing (www.network.com) Sun Grid Engine (www.sun.com/software/gridware) Sun Tech Days 07- 08 /Sun Studio - # 5

Compiler Support .... To The Rescue? Application Sun Studio Developer Tools Easiest AutoPar libumem

Hardest OpenMP

MT Atomic Operations

Solaris Threads

MPI

Posix Threads

Event Ports

Solaris UltraSPARC T1/T2 SPARC64 VI, UltraSPARC IV+

Intel/AMD x86/x64 Sun Tech Days 07- 08 /Sun Studio - # 6

Automatic Parallelization and Vectorization Application Sun Studio Developer Tools

Easiest Hardest MPI OpenMP Instruction-level MT Parallelism Posix libumem Solaris Atomic Event Automatic Threads Threads Operations Solaris Ports Parallelization, Automatic Vectorization Tuned MTUltraSPARC librariesT1/T2 Intel/AMD SPARC64 VI, UltraSPARC IV+

x86/x64

Sun Tech Days 07- 08 /Sun Studio - # 7

Instruction level Parallelism Chips have figured out how to dispatch multiple instructions in parallel Compilers have figured out how to schedule for such processors Chips + Compilers are very mature in this regard, so there is no programmer action required and the gain is automatic, whereever possible It IS possible to chew gum and walk at the same time!

Sun Tech Days 07- 08 /Sun Studio - # 8

Automatic Parallelization Support for the Fortran, C and C++ applications First introduced for 4-20 way SPARCserver 600 MP in 1991 Useful for loop oriented programs Every (nested) loop will be analyzed for data dependencies and parallelized if safe to do so Non-loop code fragments will not be analyzed Loops versioned with serial and parallel code (runtime) Combine with powerful loop optimizations One can have subtle interactions between loop transformations and parallelization Compilers have limited knowledge about the application Overall gains can be impressive Entire SPECfp 2006 suite gains 16% with PARALLEL=2 Individual gains can be upto 2x for suitable programs; libquantum from SPEC CPU2006 speeds up 6-7x on 8-cores! Sun Tech Days 07- 08 /Sun Studio - # 9 Not every program will see a gain

Automatic Parallelization Options -xautopar Automatic parallelization (Fortran, C and C++ compiler) requires -xO3 or higher (-xautopar implies -xdepend) -xreduction Parallelize reduction operations Recommended to use -fsimple=2 as well -xloopinfo Show parallelization messages on screen • Only apply to the most time consuming parts of program

Sun Tech Days 07- 08 /Sun Studio - # 10

AutoPar: SPECfp 2006 improvements Woodcrest box: 3.0GHz dual-core PARALLEL=2 Overall Gain: 16% 27.5 25 22.5 20

Base Flags + Autopar

17.5 15 12.5 10 7.5 5 2.5 0

bwav g es ame ss

milc

zeus g cac- leslie n dealI mp rotusA 3d amd I mac DM

soplex

p ovr ay

calculix

ge tonto ms FDT

lbm

wrf

sphi nx3

Sun Tech Days 07- 08 /Sun Studio - # 11

Automatic Vectorization Support for the Fortran, C and C++ applications -xvector=simd exploits special SSE2+ instructions Works on data in adjacent memory locations Gains are smaller than -xautopar SPECfp 2006 gains are 3% overall and upto 14% range individually Best suited for loop-level SIMD parallelism

for (i=0; i<1024; i++) c[i] = a[i] * b[i]

for (i=0; i<1024; i+=4) c[i:i+3] = a[i:i+3] * b[i:i+3]

Sun Tech Days 07- 08 /Sun Studio - # 12

Case Study: Vectorizing STREAM

Sun Tech Days 07- 08 /Sun Studio - # 13

Tuned MT Libraries – Sun Perf Lib

Sun Tech Days 07- 08 /Sun Studio - # 14

Compiler Support : OpenMP Application Sun Studio Developer Tools

Easiest

Hardest

AutoPar libumem

Atomic Operations

MPI

MT Solaris Threads Solaris

UltraSPARC T1/T2 SPARC64 VI, UltraSPARC IV+

Posix Threads

OpenMP

Event Ports

Intel/AMD x86/x64 Sun Tech Days 07- 08 /Sun Studio - # 15

What is OpenMP? • Defacto industry standard API for writing shared-memory parallel applications in C, C++ and Fortran See: http://www.openmp.org • Consists of > Compiler directives (pragmas) > Runtime routines (libmtsk) > Environment variables • Advantages: > Incremental parallelization of source code > Small(er) amount of programming effort > Good Performance and Scalability > Portable across variety of vendor compilers • Sun Studio has consistently led OpenMP > Support for latest version (2.5 now, v3.0 API underway) > Consistent World Record SPEC OMP submissions for several years now Sun Tech Days 07- 08 /Sun Studio - # 16

OpenMP- Directives with Intelligence

Sun Tech Days 07- 08 /Sun Studio - # 17

A Loop Parallelized With OpenMP #pragma omp parallel default (none) \ shared(n, x, y) private (i) { #pragma omp for for ( i = 0; i < n; i++) x[i] += y[i]; } /*-- End of Parallel region -- */

C/C++

!$omp parallel default (none) !$omp shared(n,x,y) private(i) !$omp do do i = 1, n x(i) = x(i) + y(i) end do !$ end do !$ end parallel

Clauses

&

Fortran Sun Tech Days 07- 08 /Sun Studio - # 18

Components Of OpenMP Directives, Pragmas Parallel regions ● Work sharing ● Synchronization ● Data scope attributes ● private ● firstprivate ● lastprivate ● shared ● reduction ● Orphaning ●

Runtime Environment Number of threads ● Thread ID ● Dynamic thread adjustment ● Nested parallelism ● Timers ● API for locking ●

Sun Tech Days 07- 08 /Sun Studio - # 19

An OpenMP Example Find the primes up to 3,000,000 (216816) Run on Sun Fire 6800, Solaris 9, 24 processors 1.2GHz US-III+, with 9.8GB main memory

Model # threads Time (secs) % change Serial

OpenMP

N/A

6.636

Base

1

7.210

8.65% drop

2

3.771

1.76x faster

4

1.988

3.34x faster

8

1.090

6.09x faster

16

0.638

10.40x faster

20

0.550

12.06x faster

24

0.931

Saturation drop Sun Tech Days 07- 08 /Sun Studio - # 20

Compiler Support : Programming Threads Application Sun Studio Developer Tools

Easiest

Hardest

AutoPar libumem

MPI

OpenMP Atomic Operations

Solaris Threads Solaris

MT

UltraSPARC T1/T2 SPARC64 VI, UltraSPARC IV+

Posix Threads

Event Ports

Intel/AMD x86/x64 Sun Tech Days 07- 08 /Sun Studio - # 21

Programming Threads • Use the POSIX APIs – pthread_create, pthread_join, pthread_exit, et. al. > Recommendation: consider reducing the thread stack size

(default is 1MB) > See pt hr ead_at t r _i ni t (3C) for this and other attributes which can be adjusted

• Do not use the native Solaris threading API (e.g., t hr _c r eat e ). > Though applications which use it are still supported, it is non-

portable.

Sun Tech Days 07- 08 /Sun Studio - # 22

Data Synchronization • Concurrent access to shared data requires synchronization > Mutexes (pthread_mutex_lock/pthread_mutex_unlock) > Condition Variables (pthread_cond_wait) > Reader/Writer locks

(pthread_rwlock_rdlock/pthread_rwlock_wrlock) > Spin locks (pthread_spin_lock)

• Objects can be local to a process or shared between processes via shared memory.

Sun Tech Days 07- 08 /Sun Studio - # 23

MT Demo Multithreading Primes

Sun Tech Days 07- 08 /Sun Studio - # 24

int is_prime(int v) { int i; int bound = floor (sqrt((double)v)) + 1; for (i=2; i 1); } Sun Tech Days 07- 08 /Sun Studio - # 25

void *work(void *arg) { int start; int end; int i; int val= *((int *) arg); start = (N/THREADS) * val; end = start + N/THREADS; for (i = start; i < end ; i++) { if ( is_prime(i) ) { primes[total] = i; total++; } } return NULL; } Sun Tech Days 07- 08 /Sun Studio - # 26

int main(int argc, char** argv) for (i=0; i < N; i++) { pflag[i] = 1; } for (i = 0; i < (THREADS-1); i++) { pthread_create(&tids[i], NULL, work, (void *) &i); } i = THREADS -1; work((void *) &i); for (i = 0; i < THREADS ; i++) { pthread_join(tids[i], NULL); }

Sun Tech Days 07- 08 /Sun Studio - # 27

STOP! Problem Ahead RDT Demo, please

Sun Tech Days 07- 08 /Sun Studio - # 28

Data Race Condition • A data race condition occurs when > multiple threads access a shared memory location > without synchronized accessing order > At least one access is to write a new data

• A data race problem often occurs in shared memory parallel programming models such as Pthread and OpenMP. > The effect of a data race problem is unpredictable and may

occur only once during hundreds of runs.

Sun Tech Days 07- 08 /Sun Studio - # 29

Thread Analyzer Detects data races and deadlocks in a multithreaded application Points to non-deterministic or incorrect execution Bugs are notoriously difficult to detect by examination Points out actual and potential deadlock situations

Process: Instrument the code with -xinstrument=datarace Detect runtime condition with collect -r all [or race, detection] Use the Graphical Analyzer, tha, to identify conflicts and critical regions

Works with OpenMP, Pthreads, Solaris Threads API provided for user-defined synchronization primitives Works on Solaris (SPARC, x86/x64) and Linux Static lock_lint tool to detect inconsistent use of locks Sun Tech Days 07- 08 /Sun Studio - # 30

A True SPEC Story SPEC OMP Benchmark fma3d 101 source files; 61,000 lines of Fortran code Data race in platq.f90 caused sporadic core dumps It took several engineers and 6 weeks of work to find the data race manually

Perils of Having a DataRace Condition Program exhibits non-deterministic behavior Failure may be hard to reproduce Program may continue to execute, leading to failure in unrelated code A data race is hard to detect using conventional debugging methods and tools

Sun Tech Days 07- 08 /Sun Studio - # 31

How did Thread Analyzer help? SPECOMP Benchmark fma3d 101 source files; 61,000 lines of Fortran code Data race in platq.f90 caused sporadic core dumps It took several engineers and 6 weeks of work to find the data race manually

With the Sun Studio Thread Analyzer, the data race was detected in just a few hours!

Sun Tech Days 07-08 /Sun Studio - # 32

Compiler Support : Message Passing Interface Application Sun Studio Developer Tools

Easiest AutoPar libumem

Hardest MT Atomic Operations

Solaris Threads Solaris

UltraSPARC T1/T2 SPARC64 VI, UltraSPARC IV+

OpenMP Posix Threads

Event Ports

MPI

Intel/AMD x86/x64 Sun Tech Days 07- 08 /Sun Studio - # 33

Message Passing Interface (MPI) MPI programming model is a de-facto standard for distributed memory parallel programming MPI API set is quite large (323 subroutines) MPI application can be programmed with less than 10 different calls Implemented with very small set of device interconnect low level routines.

Open MPI: http://www.open-mpi.org/ MPI home page at Argonne National Laboratories http://www-unix.mcs.anl.gov/mpi/

Sun Tech Days 07- 08 /Sun Studio - # 34

Message Passing Interface (MPI) • OpenMPI 2.0 Conformance • ClusterTools 7.0 with Sun Studio • Multiple processes runs under Open Runtime Environment • Pass data messages between processes in point/block communication mode • No race condition with right use of MPI message passing calls • MPI profiling under Performance Analyzer

Sun Tech Days 07- 08 /Sun Studio - # 35

Launching MPI application • For Single Program Multiple Data (SPMD) > mpirun -np x program1

• For Multiple Program Multiple Data (MPMD) > mpirun -np x program1 : -np y program 2

• Launching on different nodes (hosts) > mpirun -np x -host <machineA, ...> program1

• And more ... • Very flexible way of launching

Sun Tech Days 07- 08 /Sun Studio - # 36

Comparing OpenMP and MPI OpenMP

MPI

Defacto industry standard Limited to one (SMP) system Not (yet?) GRID-ready Easier to get started Assistance from compilers Mix and match model Requires data scoping Increasingly popular (CMT?) Preserves sequential code Needs a compiler No special environment Performance issues implicit

Defacto industry standard Runs on number of systems GRID ready High and steep learning curve You're on your own All or nothing model No data scoping required More widely used (but...) No sequential version No compiler; just a library Requires runtime environment Easy to control performance Sun Tech Days 07- 08 /Sun Studio - # 37

Thank you !

Vijay Tatkar Sr. Engineering Manager Sun Studio Developer Tools http://blogs.sun.com/tatkar

Sun Tech Days 07-08 /Sun Studio - # 38

38

Case Study: AutoPar Matrix Multiply

Sun Tech Days 07- 08 /Sun Studio - # 39

AutoPar Example Program // Matrix Multiplication 32 $define MAX 1024 33 void matrix_mul(float (*x_mat)[MAX], 34 float(*y_mat)[MAX], float (*z_mat)[MAX]) { 35 36 for (int j = 0; j < MAX; j++) { 37 for (int k = 0; k < MAX; k++) { 38 z_mat[j][k] = 0.0; 39 for (int t = 0; t < MAX; t++) { 40 z_mat[j][k] += x_mat[j][t] * y_mat[t][k]; 41 } 42 } 43 } 44 }

Sun Tech Days 07- 08 /Sun Studio - # 40

AutoPar Example Compilation CC -c mat_mul.cc -g -fast -xrestrict -xautopar -xloopinfo -o mat_mul.o "mat_mul.cc", line 36: PARALLELIZED "mat_mul.cc", line 37: not parallelized, not profitable "mat_mul.cc", line 39: not parallelized, unsafe dependence

Can run er_src command on executable binary to see internal compiler messages

Sun Tech Days 07- 08 /Sun Studio - # 41

%CC mat_mul.cc -g -fast -xrestrict -xinline=no -o noautopar %CC mat_mul.cc -g -fast -xrestrict -xloopinfo -xautopar -xinline=no -o autopar %ptime noautopar Finish multiplication of matrix of 1024 real user sys

1.536 1.521 0.018

%ptime autopar Finish multiplication of matrix of 1024 real user sys

1.542 1.520 0.016

%setenv PARALLEL 2 %ptime autopar ptime ./autopar Finish multiplication of matrix of 1024 real user sys

0.817 1.572 0.016 Sun Tech Days 07- 08 /Sun Studio - # 42

OpenMP Demo Parallelizing Primes

Sun Tech Days 07- 08 /Sun Studio - # 43

Parallelizing Primes Example (OpenMP) • Partition the problem space into smaller chunks and dispatch processing of each partition into individual (micro)tasks > A popular and practical example to illustrate how parallel

software deals with large data > The basic design concept of this program example can be applied to many other parallel processing tasks. > The overall program structure is very simple > A thread worker routine > Main program creating multiple working threads/microtasks

Sun Tech Days 07- 08 /Sun Studio - # 44

int main_omp(int argc, char** argv) #ifdef _OPENMP omp_set_num_threads( NTHRS ); omp_set_dynamic(0); #endif for (i=0; i < N; i++) { pflag[i] = 1; } #pragma omp parallel for for (i = 2; i < N ; i++) { if ( is_prime(i) ) { primes[total] = i; total++; } } printf("Number of prime numbers between 2 and %d: %d \n", N, total); Sun Tech Days 07- 08 /Sun Studio - # 45

int is_prime(int v) int is_prime(int v) { int i, bound = floor (sqrt((double)v)) + 1; for (i=2; i 1); }

Sun Tech Days 07- 08 /Sun Studio - # 46

General Race Condition • A general race condition is caused by an undetermined sequence of executions that violate the program state integrity > Data race condition is a simple form of general race condition > A general race problem can occur in shared memory and distributed

memory parallel programming

Sun Tech Days 07- 08 /Sun Studio - # 47

General Race Example Void shuffle_objects(...) { // remove target objects from source container mutex_lock(); source.remove_array( target_objects ); mutex_unlock(); // Here is an unstable state which may cause general race // add target objects to source container mutex_lock(); destination.add_array(target_objects); mutex_unlock();

} Sun Tech Days 07- 08 /Sun Studio - # 48

Design Practice to Avoid Races • Adopt a higher design abstraction such as OpenMP • Use Pass-by-value instead of pass-by-pointer to communicate between the threads • Design the data structure to limit the global variable usage and restrict the access of shared memory • Analyze a race problem to decide if it is a harmful program bug or a benign race • Understand and fix the real cause of a race condition instead of fixing race condition symptom

Sun Tech Days 07- 08 /Sun Studio - # 49

MPI: Single Program Multiple Data • The processes launched are in the same communicator > mpirun -np 8 msorts > The 8 processes launched belongs to the MPI_COMM_WORLD communicator > 8 ranks: 0, 1, 2, 3, 4, 5, 6, 7 > Total size: 8

• All 8 processes running the same program, control flow differ by checking the ranks.

MPI_Init(...); MPI_Comm_rank( MPI_COMM_WORLD, &rank); MPI_Comm_size( MPI_COMM_WORLD,&size); if (rank == 0) { ... else if (rank == 1) ... else if (rank == 2) ... } MPI_Finailize();

Sun Tech Days 07- 08 /Sun Studio - # 50

MPI Example: 7 Sorting Processes All together 8 processes Driver

Bubblesort

Shakersort

Binary Insertion Sort

Heapsort

Straight Insertion Sort

Qiucksort Straight Selection Sort

Sun Tech Days 07- 08 /Sun Studio - # 51

MPI Demo 7 Sorting Processes

Sun Tech Days 07- 08 /Sun Studio - # 52

MPI: Non-Uniform Memory Performance Performance

The length of a plateau is related to the size of that memory component The amount of the drop is related to the latency (or bandwidth) of that memory component

Tuning area

MPI can help reduce program size to fit into good regions 64

64 KB

Reg

L1

8 MB L2

Main Memory

Virtual Memory Sun Tech Days 07- 08 /Sun Studio - # 53

Sun Studio and HPC Sun HPC http://www.sun.com/servers/HPC/index.jsp Sun HPC ClusterTools 7 Software http://www.sun.com/software/products/clustertools N1 Grid Engine Manager Software

Other MPI Libraries Open Source MPI-CH library for Solaris Sparc http://www-unix.mcs.anl.gov/mpi/mpich LAMMPI ported library for Solaris x86/x64 http://apstc.sun.com.sg/popup.php?l1=research&l2=projects&l3=s1 0port&f=applications#LAM/MPI MVAPICH – MPI over InfiniBand for Solaris x86/x64 http://nowlab.cse.ohio-state.edu/projects/mpi-iba

Sun Tech Days 07- 08 /Sun Studio - # 54

Parallel Computing Environment Loosely Coupled Grid SOA

Global and Enterprise Level Grid & SOA

Web Service UPC/GAS MPI MultiProcess OpenMP MultiThread

Local Cluster Grid N1Grid ● MPI Appl ● OpenMP Appl ● MT Appl ● Serial Appl ●

Tightly Coupled Sun Tech Days 07- 08 /Sun Studio - # 55

Related Documents

Td Mxc Jmaki Chen
October 2019 39
Td Mxc Rubyrails Shin
October 2019 38
Td Mxc Python Wierzbiki
October 2019 35
Td Mxc Soa Reddy
October 2019 34
Td Mxc D2keynote Thompson
October 2019 25