Parallel and Distributed Computing on Low Latecy Clusters Vittorio Giovara M. S. Electrical Engineering and Computer Science University of Illinois at Chicago May 2009
Contents • • •
Motivation
•
Application
Strategy
•
Compiler Optimizations
Technologies
•
OpenMP and MPI over Infinband
•
OpenMP
•
MPI
•
Infinband
• •
Results Conclusions
Motivation
Motivation • Scaling trend has to stop for CMOS technology:
✓ Direct-tunneling limit in SiO2 ~3 nm ✓ Distance between Si atoms ~0.3 nm ✓ Variabilty
• Foundamental reason: rising fab cost
Motivation • Easy to build multiple core processor • Requires human action to modify and adapt concurrent software
• New classification for computer architectures
Classification SIMD
SISD
data pool instruction pool
CPU
MISD
CPU CPU
MIMD data pool CPU CPU
data pool instruction pool
instruction pool
instruction pool
data pool
CPU CPU CPU CPU
easier to parallelize
abstraction level algorithm loop level process management
Levels recursion memory management profiling
data dependency branching overhead control flow
algorithm loop level process management SMP Multiprogramming Multithreading and Scheduling
Backfire • Difficutly to fully exploit the parallelism offered
• Automatic tools required to adapt software to parallelism
• Compiler support for manual or semiautomatic enhancement
Applications • OpenMP and MPI are two popular tools
used to simplify the parallelizing process of both new and old software
• Mathematics and Physics • Computer Science • Biomedics
Specific Problem and Background
• Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering)
• Computationally intensive (even days of CPU); speedup required
• Previous works still not fully encompassing the problem (no Infiniband or OpenMP +MPI solutions)
Strategy
Strategy • Install a Linux Kernel with ad-hoc
configuration for scientific computation
• Compile a OpenMP enable GCC (supported from 4.3.1 onwards)
• Add the Infiniband link among clusters with proper drivers in kernel and user space
• Select a MPI implementation library
Strategy • Verify Infiniband network through some MPI test examples
• Install the target software • Proceed to include OpenMP and MPI directives in the code
• Run test cases
OpenMP • standard • supported by most of modern compilers • requires little knowledge of the software • very simple construction methods
OpenMP - example
OpenMP - example Parallel Task 1
Parallel Task 3
Parallel Task 2
Parallel Task 4
Parallel Task 1
Parallel Task 2
Thread A
Parallel Task 4
Thread B
Master Thread
Parallel Task 3
Join
OpenMP Sceduler • Which scheduler available for hardware? - Static - Dynamic - Guided
OpenMP Scheduler OpenMP Static Scheduler Chart 80000
70000
60000
microseconds
50000
40000
30000
20000
10000
0
1
2
3
4
5
6
7
8
9
10
11
12
13
number of threads
chunk 1
chunk 10
chunk 100
chunk 1000
chunk 10000
14
15
16
OpenMP Scheduler OpenMP Dynamic Scheduler Chart 117000
102375
87750
microseconds
73125
58500
43875
29250
14625
0
1
2
3
4
5
6
7
8
9
10
11
12
13
number of threads
chunk 1
chunk 10
chunk 100
chunk 1000
chunk 10000
14
15
16
OpenMP Scheduler OpenMP Guided Scheduler Chart 80000
70000
60000
microseconds
50000
40000
30000
20000
10000
0
1
2
3
4
5
6
7
8
9
10
11
12
13
number of threads
chunk 1
chunk 10
chunk 100
chunk 1000
chunk 10000
14
15
16
OpenMP Scheduler
OpenMP Scheduler
static scheduler
dynamic scheduler
guided scheduler
MPI • standard • widely used in cluster environment • many transport link supported • different implementations available - OpenMPI - MVAPICH
Infiniband • standard • widely used in cluster environment • very low latency for small packets • up to 16 Gb/s transfer speed
MPI over Infiniband 10000000,0 µs
1000000,0 µs
100000,0 µs
10000,0 µs
1000,0 µs
100,0 µs
10,0 µs
kB 64 k 12 B 8 k 25 B 6 k 51 B 2 kB 1 M B 2 M B 4 M B 8 M B 16 M 32 B M 64 B M 12 B 8 M 25 B 6 M 51 B 2 M B 1 G B 2 G B 4 G B 8 G B 16 G B
kB
32
kB
16
8
kB
kB
4
2
1
kB
1,0 µs
OpenMPI
Mvapich2
MPI over Infiniband 10000000,00 µs
1000000,00 µs
100000,00 µs
10000,00 µs
1000,00 µs
100,00 µs
10,00 µs
OpenMPI
Mvapich2
B M 8
B M 4
B M 2
B M 1
kB 51 2
kB 25 6
kB 12 8
kB 64
kB 32
kB 16
kB 8
kB 4
kB 2
1
kB
1,00 µs
Optimizations • Active at compile time • Available only after porting the software to standard FORTRAN
• Consistent documentation available • Unexpected positive results
Optimizations •-march = native •-O3 •-ffast-math •-Wl,-O1
Target Software
Target Software • Sally3D • micromagnetic equation solver • written in FORTRAN with some C libraries • program uses linear formulation of mathematical models
Implementation Scheme sequential loop
parallel loop
standard programming model OpenMP Threads
distributed loop
OpenMP Threads
OpenMP Threads
Host 1
Host 2 MPI
Implementation Scheme
• Data Structure: not embarrassingly parallel • Three dimensional matrix • Several temporary arrays – synchronization obiects required
➡ send() and recv() mechanism ➡ critical regions using OpenMP directives ➡ functions merging ➡ matrix conversion
Results
Results OMP * * * * -
MPI * * * * -
OPT * * * * -
seconds 133 400 186 487 200 792 246 1062
Total Speed Increase: 87.52%
Actual Results OMP * * -
Function Name calc_intmudua calc_hdmg_tet calc_mudua campo_effettivo
Normal 24.5 s 16.9 s 12.1 s 17.7 s
MPI * * -
OpenMP 4.7 s 3.0 s 1.9 s 4.5 s
seconds 59 129 174 249
MPI 14.4 s 10.8 s 7.0 s 9.9 s
OpenMP+MPI 2.8 s 1.7 s 1.1 s 2.3 s
Actual Results • OpenMP – 6-8x • MPI – 2x • OpenMP + MPI – 14 - 16x Total Raw Speed Increment: 76%
Conclusions
Conclusions and Future Works
• Computational time has been significantly decreased
• Speedup is consistent with expected results • Submitted to COMPUMAG ‘09 • • •
Continue inserting OpenMP and MPI directives Perform algorithm optimizations Increase cluster size