Hardware- And Software-based Collective Communication On The Quadrics Network

  • Uploaded by: Federica Pisani
  • 0
  • 0
  • April 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Hardware- And Software-based Collective Communication On The Quadrics Network as PDF for free.

More details

  • Words: 2,835
  • Pages: 51


Hardware- and Software-Based Collective Communication on the Quadrics Network Fabrizio Petrini , Salvador Coll , Eitan Frachtenberg and Adolfy Hoisie



CCS-3 Modeling, Algorithms, and Informatics Los Alamos National Laboratory Technical University of Valencia - SPAIN [email protected]

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.1

Outline Introduction Quadrics network design overview Hardware Communication/programming libraries Collective communication on the QsNET Barrier synchronization Broadcast Performance analysis Experimental framework Results Conclusions

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.2

Introduction The efficient implementation of collective communication is a challenging design effort Very important to guarantee scalability of barrier synchronization, broadcast, gather, scatter, reduce, etc. Essential to implement system primitives to enhance fault-tolerance. Software or hardware support for multicast communication can improve the performance and resource utilization of a parallel computer Software multicast: based on unicast messages, simple to implement, no network topology constraint, slower Hardware multicast: require dedicated hardware, network dependent, faster Hardware- and Software-Based Collective Communication on the Quadrics Network – p.3

Introduction Some of the most powerful systems in the world use the Quadrics interconnection network and the collective communication services analyzed in this job: The Terascale Computing System (TCS) at the Pittsburgh Supercomputing Center – the second most powerful computer in the world

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.4

Introduction Some of the most powerful systems in the world use the Quadrics interconnection network and the collective communication services analyzed in this job: The Terascale Computing System (TCS) at the Pittsburgh Supercomputing Center – the second most powerful computer in the world Barrier Test 6.5

Latency (µs)

6

5.5

5

4.5

4 2

4

8

16

32

64

128

256

512

Nodes Hardware- and Software-Based Collective Communication on the Quadrics Network – p.5

Introduction Some of the most powerful systems in the world use the Quadrics interconnection network and the collective communication services analyzed in this job: The Terascale Computing System (TCS) at the Pittsburgh Supercomputing Center – the second most powerful computer in the world ASCI Q machine, currently under development at Los Alamos National Laboratory (30 TeraOps, expected to be delivered by the end of 2002)

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.6

Quadrics Network Design Overview QsNET provides an abstraction of distributed virtual shared memory Each process can map a portion of its address space into the global memory These address spaces constitutes the virtual shared memory This shared memory is fully integrated with the native operating system Based on two building blocks: a network interface card called Elan a crossbar switch called Elite

Collectives Hardware- and Software-Based Collective Communication on the Quadrics Network – p.7

Elan 10

200MHz

FIFO 0

Link Mux

72

FIFO 1

µ code Processor

Thread Processor

SDRAM I/F

10

DMA Buffers

Inputter

Data Bus

64

100 MHz 64 32 MMU & TLB 4 Way Set Associative Cache

Table Walk Engine

Clock & Statistics Registers

28

PCI Interface 66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.8

Elan 400 MB/s Bidirectional 200MHz / 10bits

10

200MHz

FIFO 0

Link Mux

72

FIFO 1

µ code Processor

Thread Processor

SDRAM I/F

10

DMA Buffers

Inputter

Data Bus

64

100 MHz 64 32 MMU & TLB 4 Way Set Associative Cache

Table Walk Engine

Clock & Statistics Registers

28

PCI Interface 66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.9

Elan 10

200MHz

10

2 virtual channels FIFO 0

Link Mux

72

µ code Processor

Thread Processor

SDRAM I/F

FIFO 1

DMA Buffers

Inputter

Data Bus

64

100 MHz 64 32 MMU & TLB 4 Way Set Associative Cache

Table Walk Engine

Clock & Statistics Registers

28

PCI Interface 66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.10

Elan 10

200MHz

FIFO 0

Link Mux

72

Thread Processor

SDRAM I/F

10

FIFO 1

µ code Processor

Thread Processor Runs Communication Protocols DMA 32−bit SPARC−based

Inputter

Buffers

Data Bus

64

100 MHz 64 32 MMU & TLB 4 Way Set Associative Cache

Table Walk Engine

Clock & Statistics Registers

28

PCI Interface 66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.11

Elan 10

200MHz

FIFO 0

Link Mux

72

DMA Buffers

Inputter

TLB Synchronized with Host

64

FIFO 1

µ code Processor

Thread Processor

SDRAM I/F

10

Data Bus 100 MHz

64 32 MMU & TLB 4 Way Set Associative Cache

Table Walk Engine

Clock & Statistics Registers

28

PCI Interface 66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.12

Elan 10

200MHz

FIFO 0

Link Mux

72

FIFO 1

µ code Processor

Thread Processor

SDRAM I/F

10

DMA Buffers

Inputter

Data Bus

64

100 MHz 64 32 MMU & TLB 4 Way Set Associative Cache

Clock & 66 MHz / 64−bit Statistics PCI Interface Registers

Table Walk Engine

28

PCI Interface 66MHz

64

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.13

Elite 8 bidirectional links with 2 virtual channels in each direction An internal 16x8 full crossbar switch 400 MB/s on each link direction Packet error detection and recovery, with routing and data transactions CRC protected 2 priority levels plus an aging mechanism Adaptive routing Hardware support for broadcast

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.14

Network Topology: Quaternary Fat-Tree

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.15

Network Topology: Quaternary Fat-Tree

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.16

Network Topology: Quaternary Fat-Tree

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.17

Packet Format route

one or more transactions

EOP token

transaction type context packet header

memory address

routing tags CRC

data

CRC

320 bytes data payload (5 transactions with 64 bytes each) 74-80 bytes overhead

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.18

Programming Libraries Elan3lib event notification memory mapping and allocation remote DMA Elanlib and Tports collective communication tagged message passing MPI, shmem

User Applications shmem

mpi

elanlib user space kernel space

tport

elan3lib system calls

elan kernel comms

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.19

Collective communication on the QsNET

Broadcast tree for a 16-node network

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.20

Collective communication on the QsNET

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.21

Collective communication on the QsNET

Serialization through the root switch to avoid deadlocks

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.22

Collective communication on the QsNET

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.23

Collective communication on the QsNET

Deadlocked situation

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.24

Barrier Synchronization QsNET implements two synchronization primitives: Software-based: it uses a balanced tree and point-to-point messages elan_gsync()

Hardware-based: it uses the hardware multicast support elan_hgsync(): busy-wait elan_hgsyncevent(): event-based

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.25

Software-Based Barrier Each process waits for ’ready’ signals from its children 0

1

2

3

5

4

6

7

9

8

10

11

13

12

14

15

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.26

Software-Based Barrier Each process waits for ’ready’ signals from its children (1) ... Root Node

0

1

5

(1)

9

(1)

2

3

4

13

(1)

6

7

8

(1)

10

11

12

14

15

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.27

Software-Based Barrier ... and sends its own signal up to the parent process (2) Root Node

0

(2)

1

5

(1)

9

(1)

2

3

4

13 (1)

(1)

6

7

8

10

11

12

14

15

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.28

Hardware-Based Barrier

Example for 16 nodes

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.29

Hardware-Based Barrier

(1) init barrier, (2) update sequence #, (3) wait

Init barrier

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.30

Hardware-Based Barrier

test sequence #

Multicast transaction

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.31

Hardware-Based Barrier

finish barrier

return OK or FAIL

Acknowledgment

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.32

Hardware-Based Barrier

finish barrier

Final ’EOP’ (End-Of-Packet) token

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.33

Broadcast QsNET implements two broadcast primitives: Software-based: it uses a balanced tree and point-to-point messages elan_bcast()

Hardware-based: it uses the hardware multicast support elan_hbcast()

Both implementations perform an initial barrier to guarantee resources allocation

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.34

Performance Analysis The experimental results are obtained on a 64-node cluster of Compaq AlphaServer ES40s running Tru64 Unix. Each Alpahserver is attached to a quaternary fat-tree of dimension three through a 64 bit, 33 MHz PCI bus using the Elan3 card. In order to expose the real network performance, we place the communication buffers in Elan memory. We present: unidirectional ping results, as a reference, and barrier and broadcast results, analyzing the effect of additional background traffic

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.35

Unidirectional Ping Ping Bandwidth 350

MPI Elan3, Elan to Elan Elan3, Main to Main

Bandwidth (MB/s)

300 250 200 150 100 50 0 1

4

16

64

256

1K

4K

16K 64K 256K 1M

4M

Message Size (bytes)

Peak data bandwidth (Elan to Elan) of 335 MB/s

396 MB/s (99% of nominal

bandwidth) Main to main asymptotic bandwidth of 200 MB/s Hardware- and Software-Based Collective Communication on the Quadrics Network – p.36

Unidirectional Ping Ping Latency 24

MPI Elan3, Elan to Elan Elan3, Main to Main

22 20

Latency (µs)

18 16 14 12 10 8 6 4 2 0

1

4

16

64

256

1K

4K

Message Size (bytes)

Latency of 2.4 s up to 64-byte messages (Elan to Elan memory) Higher MPI latency due to message tag matching

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.37

Barrier Synchronization Barrier Test - 1 CPU per node 16

elan_gsync() elan_hgsync() elan_hgsyncEvent()

Latency (µs)

14 12 10 8 6 4 4

16

64

Nodes

Good hardware barrier scalability

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.38

Barrier Synchronization with Background Traffic Barrier Test - 1 CPU per node (complement traffic) 1024

1024

elan_gsync() elan_hgsync() elan_hgsyncEvent()

512

elan_gsync() elan_hgsync() elan_hgsyncEvent()

512

256

256

128

128

Latency (µs)

Latency (µs)

Barrier Test - 1 CPU per node (uniform traffic)

64 32

64 32

16

16

8

8

4

4 4

16 Nodes

64

4

16 Nodes

Software barrier significantly affected (the slowdown is 40 in the worst case) Little impact on the hardware barriers, whose average latency is only doubled

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.39

64

Hardware Barrier with Background Traffic Barrier Test - 64 nodes, 1 CPU per node (latency distribution) 10000

elan_hgsync() elan_hgsync() - complement traffic elan_hgsync() - uniform traffic

1000

100

10

1 4

8

16

32

64

128

256

512

Latency (µs)

94% of the operations take less than 9 s with no bakground traffic 93% of the tests take less than 20 s with uniform traffic

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.40

Software Barrier with Background Traffic Barrier Test - 64 nodes, 1 CPU per node (latency distribution) 10000

elan_gsync() elan_gsync() - complement traffic elan_gsync() - uniform traffic

1000

100

10

1 8

16

32

64

128

256

512

1024

2048

4096

Latency (µs)

99% of the barriers take less than 30 s with no bakground traffic 93% of the synchronizations complete with less than 605 s with uniform traffic

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.41

Broadcast Bandwidth Broadcast Test - 64 Nodes, 1 CPU per node 300

elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan

Bandwidth (MB/s)

250 200 150 100 50 0 1

4

16

64

256

1K

4K

16K

64K 256K

1M

Message Size (bytes)

Asymptotic bandwidth of 288MB/s when using Elan memory for both implementations

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.42

Broadcast Latency Broadcast Test - 64 Nodes, 1 CPU per node 50

elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan

45

Latency (µs)

40 35 30 25 20 15 10 1

4

16

64

256

1K

4K

Message Size (bytes)

Hardware latency with Elan buffers below 13 s for messages up to 256 bytes Software latencies are 3.5 s higher than hardware latencies

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.43

Broadcast Scalability Broadcast Test - 1 CPU per node (256k bytes)

Broadcast Test - 1 CPU per node (256k bytes)

340

1600

320

1500 1400

280 260

Latency (µs)

Bandwidth (MB/s)

300

elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan

240 220

1300 elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan

1200 1100 1000

200

900

180 160

800 4

16

64

4

Nodes

16 Nodes

No significant effect when using buffers in main memory With buffers in Elan memory performance depends on the number of switch layers traversed

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.44

64

Broadcast with Background Traffic Broadcast Test - 64 Nodes, 1 CPU per node (complement traffic) 40

Broadcast Test - 64 Nodes, 1 CPU per node (complement traffic) 10000

elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan

35

elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan

25

Latency (µs)

Bandwidth (MB/s)

30

20 15

1000

10 5 0

100 1

4

16

64

256

1K

4K

Message Size (bytes)

16K

64K 256K

1M

1

4

16

64

256

1K

4K

16K

Message Size (bytes)

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.45

64K

Broadcast with Background Traffic Broadcast Test - 64 Nodes, 1 CPU per node (uniform traffic) 25

10000

elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan Latency (µs)

20 Bandwidth (MB/s)

Broadcast Test - 64 Nodes, 1 CPU per node (uniform traffic)

15

10

elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() - global - main elan_hbcast() - global - elan

1000

5

0

100 1

4

16

64

256

1K

4K

Message Size (bytes)

16K

64K 256K

1M

1

4

16

64

256

1K

4K

16K

Message Size (bytes)

Latency differences between hw and sw implementations increase Better performance with buffers in main memory (due to the background traffic application)

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.46

64K

Broadcast with Background Traffic Broadcast Test - 1 CPU per node (256k bytes - uniform traffic) 60

12000

elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() -global - main elan_hbcast() -global - elan

55

elan_bcast() - global - main elan_bcast() - global - elan elan_hbcast() -global - main elan_hbcast() -global - elan

11000

50

10000

45

9000

Latency (µs)

Bandwidth (MB/s)

Broadcast Test - 1 CPU per node (256k bytes - uniform traffic)

40 35

8000 7000

30

6000

25

5000

20

4000 4

16

64

4

Nodes

16 Nodes

Significant performance degradation for all the alternatives

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.47

64

Conclusions Hardware-based synchronization takes as little as 6 s on a 64-node Alphaserver cluster, with very good scalability.

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48

Conclusions Hardware-based synchronization takes as little as 6 s on a 64-node Alphaserver cluster, with very good scalability. Good latency and scalability are achieved with the software-based synchronization too, which takes about 15 s.

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48

Conclusions Hardware-based synchronization takes as little as 6 s on a 64-node Alphaserver cluster, with very good scalability. Good latency and scalability are achieved with the software-based synchronization too, which takes about 15 s. The hardware barrier is almost insensitive to background traffic, with 93% of the synchronizations completed in less than 20 s.

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48

Conclusions Hardware-based synchronization takes as little as 6 s on a 64-node Alphaserver cluster, with very good scalability. Good latency and scalability are achieved with the software-based synchronization too, which takes about 15 s. The hardware barrier is almost insensitive to background traffic, with 93% of the synchronizations completed in less than 20 s. With the broadcast, both implementations can deliver a sustained bandwidth of 288 MB/s Elan memory to Elan memory and 200 MB/s main memory to main memory.

Hardware- and Software-Based Collective Communication on the Quadrics Network – p.48

Related Documents


More Documents from "Sahil"