[slides] Parallel And Distributed Computing On Low Latency Clusters

  • Uploaded by: Project Symphony Collection
  • 0
  • 0
  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View [slides] Parallel And Distributed Computing On Low Latency Clusters as PDF for free.

More details

  • Words: 1,031
  • Pages: 40
Parallel and Distributed Computing on Low Latecy Clusters Vittorio Giovara M. S. Electrical Engineering and Computer Science University of Illinois at Chicago May 2009

Contents • • •

Motivation



Application

Strategy



Compiler Optimizations

Technologies



OpenMP and MPI over Infinband



OpenMP



MPI



Infinband

• •

Results Conclusions

Motivation

Motivation • Scaling trend has to stop for CMOS technology:

✓ Direct-tunneling limit in SiO2 ~3 nm ✓ Distance between Si atoms ~0.3 nm ✓ Variabilty

• Foundamental reason: rising fab cost

Motivation • Easy to build multiple core processor • Requires human action to modify and adapt concurrent software

• New classification for computer architectures

Classification SIMD

SISD

data pool instruction pool

CPU

MISD

CPU CPU

MIMD data pool CPU CPU

data pool instruction pool

instruction pool

instruction pool

data pool

CPU CPU CPU CPU

easier to parallelize

abstraction level algorithm loop level process management

Levels recursion memory management profiling

data dependency branching overhead control flow

algorithm loop level process management SMP Multiprogramming Multithreading and Scheduling

Backfire • Difficutly to fully exploit the parallelism offered

• Automatic tools required to adapt software to parallelism

• Compiler support for manual or semiautomatic enhancement

Applications • OpenMP and MPI are two popular tools

used to simplify the parallelizing process of both new and old software

• Mathematics and Physics • Computer Science • Biomedics

Specific Problem and Background

• Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering)

• Computationally intensive (even days of CPU); speedup required

• Previous works still not fully encompassing the problem (no Infiniband or OpenMP +MPI solutions)

Strategy

Strategy • Install a Linux Kernel with ad-hoc

configuration for scientific computation

• Compile a OpenMP enable GCC (supported from 4.3.1 onwards)

• Add the Infiniband link among clusters with proper drivers in kernel and user space

• Select a MPI implementation library

Strategy • Verify Infiniband network through some MPI test examples

• Install the target software • Proceed to include OpenMP and MPI directives in the code

• Run test cases

OpenMP • standard • supported by most of modern compilers • requires little knowledge of the software • very simple construction methods

OpenMP - example

OpenMP - example Parallel Task 1

Parallel Task 3

Parallel Task 2

Parallel Task 4

Parallel Task 1

Parallel Task 2

Thread A

Parallel Task 4

Thread B

Master Thread

Parallel Task 3

Join

OpenMP Sceduler • Which scheduler available for hardware? - Static - Dynamic - Guided

OpenMP Scheduler OpenMP Static Scheduler Chart 80000

70000

60000

microseconds

50000

40000

30000

20000

10000

0

1

2

3

4

5

6

7

8

9

10

11

12

13

number of threads

chunk 1

chunk 10

chunk 100

chunk 1000

chunk 10000

14

15

16

OpenMP Scheduler OpenMP Dynamic Scheduler Chart 117000

102375

87750

microseconds

73125

58500

43875

29250

14625

0

1

2

3

4

5

6

7

8

9

10

11

12

13

number of threads

chunk 1

chunk 10

chunk 100

chunk 1000

chunk 10000

14

15

16

OpenMP Scheduler OpenMP Guided Scheduler Chart 80000

70000

60000

microseconds

50000

40000

30000

20000

10000

0

1

2

3

4

5

6

7

8

9

10

11

12

13

number of threads

chunk 1

chunk 10

chunk 100

chunk 1000

chunk 10000

14

15

16

OpenMP Scheduler

OpenMP Scheduler

static scheduler

dynamic scheduler

guided scheduler

MPI • standard • widely used in cluster environment • many transport link supported • different implementations available - OpenMPI - MVAPICH

Infiniband • standard • widely used in cluster environment • very low latency for small packets • up to 16 Gb/s transfer speed

MPI over Infiniband 10000000,0 µs

1000000,0 µs

100000,0 µs

10000,0 µs

1000,0 µs

100,0 µs

10,0 µs

kB 64 k 12 B 8 k 25 B 6 k 51 B 2 kB 1 M B 2 M B 4 M B 8 M B 16 M 32 B M 64 B M 12 B 8 M 25 B 6 M 51 B 2 M B 1 G B 2 G B 4 G B 8 G B 16 G B

kB

32

kB

16

8

kB

kB

4

2

1

kB

1,0 µs

OpenMPI

Mvapich2

MPI over Infiniband 10000000,00 µs

1000000,00 µs

100000,00 µs

10000,00 µs

1000,00 µs

100,00 µs

10,00 µs

OpenMPI

Mvapich2

B M 8

B M 4

B M 2

B M 1

kB 51 2

kB 25 6

kB 12 8

kB 64

kB 32

kB 16

kB 8

kB 4

kB 2

1

kB

1,00 µs

Optimizations • Active at compile time • Available only after porting the software to standard FORTRAN

• Consistent documentation available • Unexpected positive results

Optimizations •-march = native •-O3 •-ffast-math •-Wl,-O1

Target Software

Target Software • Sally3D • micromagnetic equation solver • written in FORTRAN with some C libraries • program uses linear formulation of mathematical models

Implementation Scheme sequential loop

parallel loop

standard programming model OpenMP Threads

distributed loop

OpenMP Threads

OpenMP Threads

Host 1

Host 2 MPI

Implementation Scheme

• Data Structure: not embarrassingly parallel • Three dimensional matrix • Several temporary arrays – synchronization obiects required

➡ send() and recv() mechanism ➡ critical regions using OpenMP directives ➡ functions merging ➡ matrix conversion

Results

Results OMP * * * * -

MPI * * * * -

OPT * * * * -

seconds 133 400 186 487 200 792 246 1062

Total Speed Increase: 87.52%

Actual Results OMP * * -

Function Name calc_intmudua calc_hdmg_tet calc_mudua campo_effettivo

Normal 24.5 s 16.9 s 12.1 s 17.7 s

MPI * * -

OpenMP 4.7 s 3.0 s 1.9 s 4.5 s

seconds 59 129 174 249

MPI 14.4 s 10.8 s 7.0 s 9.9 s

OpenMP+MPI 2.8 s 1.7 s 1.1 s 2.3 s

Actual Results • OpenMP – 6-8x • MPI – 2x • OpenMP + MPI – 14 - 16x Total Raw Speed Increment: 76%

Conclusions

Conclusions and Future Works

• Computational time has been significantly decreased

• Speedup is consistent with expected results • Submitted to COMPUMAG ‘09 • • •

Continue inserting OpenMP and MPI directives Perform algorithm optimizations Increase cluster size

Related Documents


More Documents from ""