Hardware- and Software-Based Collective Communication on the Quadrics Network Fabrizio Petrini , Salvador Coll , Eitan Frachtenberg and Adolfy Hoisie
CCS-3 Modeling, Algorithms, and Informatics Los Alamos National Laboratory Technical University of Valencia - SPAIN
[email protected]
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.1
Outline Introduction Quadrics network design overview Hardware Communication/programming libraries Collective communication on the QsNET Barrier synchronization Broadcast Performance analysis Experimental framework Results Conclusions
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.2
Introduction The efficient implementation of collective communication is a challenging design effort Very important to guarantee scalability of barrier synchronization, broadcast, gather, scatter, reduce, etc. Essential to implement system primitives to enhance fault-tolerance. Software or hardware support for multicast communication can improve the performance and resource utilization of a parallel computer Software multicast: based on unicast messages, simple to implement, no network topology constraint, slower Hardware multicast: require dedicated hardware, network dependent, faster Hardware- and Software-Based Collective Communication on the Quadrics Network – p.3
Introduction Some of the most powerful systems in the world use the Quadrics interconnection network and the collective communication services analyzed in this job: The Terascale Computing System (TCS) at the Pittsburgh Supercomputing Center – the second most powerful computer in the world
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.4
Introduction Some of the most powerful systems in the world use the Quadrics interconnection network and the collective communication services analyzed in this job: The Terascale Computing System (TCS) at the Pittsburgh Supercomputing Center – the second most powerful computer in the world Barrier Test 6.5
Latency (µs)
6
5.5
5
4.5
4 2
4
8
16
32
64
128
256
512
Nodes Hardware- and Software-Based Collective Communication on the Quadrics Network – p.5
Introduction Some of the most powerful systems in the world use the Quadrics interconnection network and the collective communication services analyzed in this job: The Terascale Computing System (TCS) at the Pittsburgh Supercomputing Center – the second most powerful computer in the world ASCI Q machine, currently under development at Los Alamos National Laboratory (30 TeraOps, expected to be delivered by the end of 2002)
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.6
Quadrics Network Design Overview QsNET provides an abstraction of distributed virtual shared memory Each process can map a portion of its address space into the global memory These address spaces constitutes the virtual shared memory This shared memory is fully integrated with the native operating system Based on two building blocks: a network interface card called Elan a crossbar switch called Elite
Collectives Hardware- and Software-Based Collective Communication on the Quadrics Network – p.7
Elan 10
200MHz
FIFO 0
Link Mux
72
FIFO 1
µ code Processor
Thread Processor
SDRAM I/F
10
DMA Buffers
Inputter
Data Bus
64
100 MHz 64 32 MMU & TLB 4 Way Set Associative Cache
Table Walk Engine
Clock & Statistics Registers
28
PCI Interface 66MHz
64
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.8
Elan 400 MB/s Bidirectional 200MHz / 10bits
10
200MHz
FIFO 0
Link Mux
72
FIFO 1
µ code Processor
Thread Processor
SDRAM I/F
10
DMA Buffers
Inputter
Data Bus
64
100 MHz 64 32 MMU & TLB 4 Way Set Associative Cache
Table Walk Engine
Clock & Statistics Registers
28
PCI Interface 66MHz
64
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.9
Elan 10
200MHz
10
2 virtual channels FIFO 0
Link Mux
72
µ code Processor
Thread Processor
SDRAM I/F
FIFO 1
DMA Buffers
Inputter
Data Bus
64
100 MHz 64 32 MMU & TLB 4 Way Set Associative Cache
Table Walk Engine
Clock & Statistics Registers
28
PCI Interface 66MHz
64
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.10
Elan 10
200MHz
FIFO 0
Link Mux
72
Thread Processor
SDRAM I/F
10
FIFO 1
µ code Processor
Thread Processor Runs Communication Protocols DMA 32−bit SPARC−based
Inputter
Buffers
Data Bus
64
100 MHz 64 32 MMU & TLB 4 Way Set Associative Cache
Table Walk Engine
Clock & Statistics Registers
28
PCI Interface 66MHz
64
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.11
Elan 10
200MHz
FIFO 0
Link Mux
72
DMA Buffers
Inputter
TLB Synchronized with Host
64
FIFO 1
µ code Processor
Thread Processor
SDRAM I/F
10
Data Bus 100 MHz
64 32 MMU & TLB 4 Way Set Associative Cache
Table Walk Engine
Clock & Statistics Registers
28
PCI Interface 66MHz
64
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.12
Elan 10
200MHz
FIFO 0
Link Mux
72
FIFO 1
µ code Processor
Thread Processor
SDRAM I/F
10
DMA Buffers
Inputter
Data Bus
64
100 MHz 64 32 MMU & TLB 4 Way Set Associative Cache
Clock & 66 MHz / 64−bit Statistics PCI Interface Registers
Table Walk Engine
28
PCI Interface 66MHz
64
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.13
Elite 8 bidirectional links with 2 virtual channels in each direction An internal 16x8 full crossbar switch 400 MB/s on each link direction Packet error detection and recovery, with routing and data transactions CRC protected 2 priority levels plus an aging mechanism Adaptive routing Hardware support for broadcast
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.14
Network Topology: Quaternary Fat-Tree
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.15
Network Topology: Quaternary Fat-Tree
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.16
Network Topology: Quaternary Fat-Tree
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.17
Packet Format route
one or more transactions
EOP token
transaction type context packet header
memory address
routing tags CRC
data
CRC
320 bytes data payload (5 transactions with 64 bytes each) 74-80 bytes overhead
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.18
Programming Libraries Elan3lib event notification memory mapping and allocation remote DMA Elanlib and Tports collective communication tagged message passing MPI, shmem
User Applications shmem
mpi
elanlib user space kernel space
tport
elan3lib system calls
elan kernel comms
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.19
Collective communication on the QsNET
Broadcast tree for a 16-node network
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.20
Collective communication on the QsNET
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.21
Collective communication on the QsNET
Serialization through the root switch to avoid deadlocks
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.22
Collective communication on the QsNET
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.23
Collective communication on the QsNET
Deadlocked situation
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.24
Barrier Synchronization QsNET implements two synchronization primitives: Software-based: it uses a balanced tree and point-to-point messages elan_gsync()
Hardware-based: it uses the hardware multicast support elan_hgsync(): busy-wait elan_hgsyncevent(): event-based
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.25
Software-Based Barrier Each process waits for ’ready’ signals from its children 0
1
2
3
5
4
6
7
9
8
10
11
13
12
14
15
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.26
Software-Based Barrier Each process waits for ’ready’ signals from its children (1) ... Root Node
0
1
5
(1)
9
(1)
2
3
4
13
(1)
6
7
8
(1)
10
11
12
14
15
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.27
Software-Based Barrier ... and sends its own signal up to the parent process (2) Root Node
0
(2)
1
5
(1)
9
(1)
2
3
4
13 (1)
(1)
6
7
8
10
11
12
14
15
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.28
Hardware-Based Barrier
Example for 16 nodes
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.29
Hardware-Based Barrier
(1) init barrier, (2) update sequence #, (3) wait
Init barrier
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.30
Hardware-Based Barrier
test sequence #
Multicast transaction
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.31
Hardware-Based Barrier
finish barrier
return OK or FAIL
Acknowledgment
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.32
Hardware-Based Barrier
finish barrier
Final ’EOP’ (End-Of-Packet) token
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.33
Broadcast QsNET implements two broadcast primitives: Software-based: it uses a balanced tree and point-to-point messages elan_bcast()
Hardware-based: it uses the hardware multicast support elan_hbcast()
Both implementations perform an initial barrier to guarantee resources allocation
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.34
Performance Analysis The experimental results are obtained on a 64-node cluster of Compaq AlphaServer ES40s running Tru64 Unix. Each Alpahserver is attached to a quaternary fat-tree of dimension three through a 64 bit, 33 MHz PCI bus using the Elan3 card. In order to expose the real network performance, we place the communication buffers in Elan memory. We present: unidirectional ping results, as a reference, and barrier and broadcast results, analyzing the effect of additional background traffic
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.35
Unidirectional Ping Ping Bandwidth 350
MPI Elan3, Elan to Elan Elan3, Main to Main
Bandwidth (MB/s)
300 250 200 150 100 50 0 1
4
16
64
256
1K
4K
16K 64K 256K 1M
4M
Message Size (bytes)
Peak data bandwidth (Elan to Elan) of 335 MB/s
396 MB/s (99% of nominal
bandwidth) Main to main asymptotic bandwidth of 200 MB/s Hardware- and Software-Based Collective Communication on the Quadrics Network – p.36
Unidirectional Ping Ping Latency 24
MPI Elan3, Elan to Elan Elan3, Main to Main
22 20
Latency (µs)
18 16 14 12 10 8 6 4 2 0
1
4
16
64
256
1K
4K
Message Size (bytes)
Latency of 2.4 s up to 64-byte messages (Elan to Elan memory) Higher MPI latency due to message tag matching
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.37
Barrier Synchronization Barrier Test - 1 CPU per node 16
elan_gsync() elan_hgsync() elan_hgsyncEvent()
Latency (µs)
14 12 10 8 6 4 4
16
64
Nodes
Good hardware barrier scalability
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.38
Barrier Synchronization with Background Traffic Barrier Test - 1 CPU per node (complement traffic) 1024
1024
elan_gsync() elan_hgsync() elan_hgsyncEvent()
512
elan_gsync() elan_hgsync() elan_hgsyncEvent()
512
256
256
128
128
Latency (µs)
Latency (µs)
Barrier Test - 1 CPU per node (uniform traffic)
64 32
64 32
16
16
8
8
4
4 4
16 Nodes
64
4
16 Nodes
Software barrier significantly affected (the slowdown is 40 in the worst case) Little impact on the hardware barriers, whose average latency is only doubled
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.39
64
Hardware Barrier with Background Traffic Barrier Test - 64 nodes, 1 CPU per node (latency distribution) 10000
elan_hgsync() elan_hgsync() - complement traffic elan_hgsync() - uniform traffic
1000
100
10
1 4
8
16
32
64
128
256
512
Latency (µs)
94% of the operations take less than 9 s with no bakground traffic 93% of the tests take less than 20 s with uniform traffic
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.40
Software Barrier with Background Traffic Barrier Test - 64 nodes, 1 CPU per node (latency distribution) 10000
elan_gsync() elan_gsync() - complement traffic elan_gsync() - uniform traffic
1000
100
10
1 8
16
32
64
128
256
512
1024
2048
4096
Latency (µs)
99% of the barriers take less than 30 s with no bakground traffic 93% of the synchronizations complete with less than 605 s with uniform traffic
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.41
Broadcast Bandwidth Broadcast Test - 64 Nodes, 1 CPU per node 300
elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan
Bandwidth (MB/s)
250 200 150 100 50 0 1
4
16
64
256
1K
4K
16K
64K 256K
1M
Message Size (bytes)
Asymptotic bandwidth of 288MB/s when using Elan memory for both implementations
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.42
Broadcast Latency Broadcast Test - 64 Nodes, 1 CPU per node 50
elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan
45
Latency (µs)
40 35 30 25 20 15 10 1
4
16
64
256
1K
4K
Message Size (bytes)
Hardware latency with Elan buffers below 13 s for messages up to 256 bytes Software latencies are 3.5 s higher than hardware latencies
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.43
Broadcast Scalability Broadcast Test - 1 CPU per node (256k bytes)
Broadcast Test - 1 CPU per node (256k bytes)
340
1600
320
1500 1400
280 260
Latency (µs)
Bandwidth (MB/s)
300
elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan
240 220
1300 elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan
1200 1100 1000
200
900
180 160
800 4
16
64
4
Nodes
16 Nodes
No significant effect when using buffers in main memory With buffers in Elan memory performance depends on the number of switch layers traversed
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.44
64
Broadcast with Background Traffic Broadcast Test - 64 Nodes, 1 CPU per node (complement traffic) 40
Broadcast Test - 64 Nodes, 1 CPU per node (complement traffic) 10000
elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan
35
elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan
25
Latency (µs)
Bandwidth (MB/s)
30
20 15
1000
10 5 0
100 1
4
16
64
256
1K
4K
Message Size (bytes)
16K
64K 256K
1M
1
4
16
64
256
1K
4K
16K
Message Size (bytes)
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.45
64K
Broadcast with Background Traffic Broadcast Test - 64 Nodes, 1 CPU per node (uniform traffic) 25
10000
elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan Latency (µs)
20 Bandwidth (MB/s)
Broadcast Test - 64 Nodes, 1 CPU per node (uniform traffic)
15
10
elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan
1000
5
0
100 1
4
16
64
256
1K
4K
Message Size (bytes)
16K
64K 256K
1M
1
4
16
64
256
1K
4K
16K
Message Size (bytes)
Latency differences between hw and sw implementations increase Better performance with buffers in main memory (due to the background traffic application)
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.46
64K
Broadcast with Background Traffic Broadcast Test - 1 CPU per node (256k bytes - uniform traffic) 60
12000
elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() -global - main elan_hbcast() -global - elan
55
elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() -global - main elan_hbcast() -global - elan
11000
50
10000
45
9000
Latency (µs)
Bandwidth (MB/s)
Broadcast Test - 1 CPU per node (256k bytes - uniform traffic)
40 35
8000 7000
30
6000
25
5000
20
4000 4
16
64
4
Nodes
16 Nodes
Significant performance degradation for all the alternatives
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.47
64
Conclusions Hardware-based synchronization takes as little as 6 s on a 64-node Alphaserver cluster, with very good scalability.
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48
Conclusions Hardware-based synchronization takes as little as 6 s on a 64-node Alphaserver cluster, with very good scalability. Good latency and scalability are achieved with the software-based synchronization too, which takes about 15 s.
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48
Conclusions Hardware-based synchronization takes as little as 6 s on a 64-node Alphaserver cluster, with very good scalability. Good latency and scalability are achieved with the software-based synchronization too, which takes about 15 s. The hardware barrier is almost insensitive to background traffic, with 93% of the synchronizations completed in less than 20 s.
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48
Conclusions Hardware-based synchronization takes as little as 6 s on a 64-node Alphaserver cluster, with very good scalability. Good latency and scalability are achieved with the software-based synchronization too, which takes about 15 s. The hardware barrier is almost insensitive to background traffic, with 93% of the synchronizations completed in less than 20 s. With the broadcast, both implementations can deliver a sustained bandwidth of 288 MB/s Elan memory to Elan memory and 200 MB/s main memory to main memory.
Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48