Quadrics Qsnetiii Adaptively Routed Network For Hpc - Presentation

  • Uploaded by: Federica Pisani
  • 0
  • 0
  • December 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Quadrics Qsnetiii Adaptively Routed Network For Hpc - Presentation as PDF for free.

More details

  • Words: 1,899
  • Pages: 37
QsNetIII an Adaptively Routed Network for High Performance Computing Duncan Roweth, Quadrics Ltd Hot Interconnects August 2008

28/8/2008

Quadrics Ltd

1

Quadrics Background

• Develops interconnect products for the HPC market – HPC Linux systems – AlphaServer SC systems

• Quadrics is owned by the Finmeccanica group • Quadrics was 12 years old in July

28/8/2008

Quadrics Ltd

2

QsNet Networks

• Multi-stage switch network • Components – – – – –

Adapter: Elan Router: Elite Switches, cables Firmware, drivers, libraries Diagnostics, documentation

• HPC specific features – Adaptive routing – Hardware barrier & broadcast

28/8/2008

Quadrics Ltd

3

Virtual Address

Communication Model

Processs

28/8/2008

Quadrics Ltd

4

Quadrics Networks

• Elan1 / Elite1, 1994, Meiko Computing Surface 2 – Source chooses between pre-defined routes

• Elan3 / Elite3, 2000, first Quadrics product, QsNet – First use of packet-by-packet adaptive routing – Crosspoint router, x8

• Elan4 / Elite4, 2004, QsNetII – Reduced latency, increased bandwidth – Increased support for offloading collectives

• Elan5 / Elite5, 2008, QsNetIII – General purpose crosspoint router, increased radix, x32 – Highly programmable adapter

28/8/2008

Quadrics Ltd

5

What is Adaptive Routing ?

• Switch networks typically provide many paths between any two points • In an adaptively routed network routers make packet by packet decisions on the route to use based on – – – –

Queue occupancy Channel usage Error rates and state Class of traffic

28/8/2008

Quadrics Ltd

6

Why is Adaptive Routing Important ?

• Most HPC networks are statically routed – They use pre-determined paths between nodes

• Static routing can work well – – – –

If traffic pattern is known in advance If traffic pattern is persistent If traffic pattern is uniform (i.e. application is load balanced) If there are no errors

• These conditions are not met by real codes on production HPC systems {see LLNL and Sandia results} • Adaptive routing solves these problems – Delivering significantly better aggregate bandwidths and worst case latencies on real systems running real codes

28/8/2008

Quadrics Ltd

7

Benefits of Adaptive Routing

• Bandwidth achieved when 1024 nodes all communicate at the same time • Plots show the distribution of measured bandwidths

System

Interconnect

Min

Max

Average

Atlas

Infiniband

95

762

263

Thunder

QsNetII

248

403

369

Data from Lawrence Livermore National Lab, published at the Sonoma OpenFabrics workshop April 2007

28/8/2008

Quadrics Ltd

8

Benefits of Adaptive Routing

• Classic QsNetII all-to-all bandwidth scaling graph

28/8/2008

Quadrics Ltd

9

Ordering Considerations

• Adaptively routed packets can arrive out of order – Problems for stream devices, e.g. multipath Ethernet

• Message ordering is required in HPC – But within a message we are free to deliver the bulk data in arbitrary order Get it there as fast as possible then tell me that it is done

• QsNet ordering – Packets contain the destination virtual address at which to write the data – Bulk data transfers can arrive out of order and can be replayed – Atomic transactions are sequenced

28/8/2008

Quadrics Ltd

10

Adaptive Routing in QsNetIII • More flexible than QsNetII – Operates over arbitrary sets of links – More opportunities to use the technique – Higher radix switches

• Select a subset of lightly loaded output ports based on: – Destination – Link state, errors etc – Number of pending acks (programmable threshold)

• Programmable algorithm for selecting from this subset: – First free, next free, random

28/8/2008

Quadrics Ltd

11

Adaptive Routing: standard case – All top switches are equivalent, select one – Adaptive routing selects a lightly loaded path

28/8/2008

Quadrics Ltd

12

Implementation of Fat Tree Networks

• Connect M×N-way node switches by N×M-way top switches • In this case M = 16, N = 4

28/8/2008

Quadrics Ltd

13

Adaptive Routing in the Top Switch

• If top switch radix ≤ router radix / 2 – i.e. 16 for Elite5, 2048-way networks

• Router provides multiple top switches – Select which to use based on load

• Example: – Traffic from A to B via routers 210 and 300 is blocked by traffic between 300 and 200. – The router providing 300, 301, 302 and 303 can select a different path

28/8/2008

Quadrics Ltd

14

Adaptive Routing on the Final Hop

• Multiple connections to a node • Switch can select a free path • Reduces end-point contention • Simple case is not optimal • Spreading the connections – Improves fault tolerance – Reduces network contention

• Routing decision is made higher in the network

28/8/2008

Quadrics Ltd

15

Adaptive routing in the presence of errors



• •

In a production system with 1000s of links it is not uncommon for a small number to be broken – until the next maintenance slot Adaptive routing minimises the impact Example: – Link between routers 10 and 20 is broken – Router 10 dynamically selects paths via 21,22,23 spreading the load. – Reverse case, avoid sending to 10 via 20. Reset 20’s links or update switches 11,12,13.

28/8/2008

Quadrics Ltd

16

Small Packet Support

• Aim to get as close to line rate as possible with small packets • For example: – Small put – 32 byte packet

• Adapter has multiple packet engines • Adapters support up to 64 outstanding packets per link – Doubles if we use both links

• Switches provide 32 virtual channels per output link • Prioritisation – buffering on input to the router 28/8/2008

Quadrics Ltd

17

Barrier & Broadcast Support

• Switches broadcast over a range of output links • Combine Acks / Nacks • Contiguous in QsNetII • Sparse in QsNetIII • Barrier implementation – Network conditional – Broadcast release

28/8/2008

Quadrics Ltd

18

Elan5 – Device Overview

• 2× –

QsNetIII

links

CX4/ QSNetIII

Link

Link

Packet Engine

Packet Engine

Packet Engine

Packet Engine

Packet Engine

Packet Engine

Packet Engine

16K inst cache 9K data buffers

16K inst cache 9K data buffers

16K inst cache 9K data buffers

16K inst cache 9K data buffers

16K inst cache 9K data buffers

16K inst cache 9K data buffers

16K inst cache 9K data buffers

20Gbit/s/direction after protocol

• PCIe, PCIe2 host interface • Multiple packet engines • 512KB of high bandwidth on chip local memory • SDRAM interface to optional local memory • Buffer manager, object cache • Details in ISC Dresden Paper

28/8/2008

CX4/ QSNetIII

Elan5 Adapter

Fabric x8

Host I/F

Local Memory

Object Cache Tags Buffer Manager

Cmd Launch

Free List

PCIe SERDES

PCIe 16 Lanes

Quadrics Ltd

Bridge

Local Functions

TLB

External cache SDRAM i/f

Ext i/f

16K x 8 x 8 banks = 1MB ECC RAM

PLL

External EEPROM DDRII

Clocks

19

Elite5 – Device Overview

• 64 × 32 crosspoint router – Direct & buffered input from each link – 8K of input buffering per link

• • • • •

32 virtual channels per link Physical layer DDR XAUI (6.25GHz) Adaptive routing Hardware barrier and broadcast Memory mapped stats & error counters accessed out-of-band

28/8/2008

Quadrics Ltd

20

QsNetIII Device Overview

Elan

Elite Semi custom ASIC

Manufacturing partners LSI / TSMC G90 process

500 MHz

312 MHz

High performance BGA package

28/8/2008

672 pin

982 pin

< 17W

< 18W

Quadrics Ltd

21

QsNetIII Implementation

• Node switch chassis – 128 links down to the nodes – 128 links up to the top switches – Backplane connects 2 sets of cards

• Top switches – 256 links down to the node switches – Range of system sizes:

28/8/2008

Ports

Radix

Per Chassis

512

4

64

1024

8

32

2048

16

16

4096

32

8

QsNetIII switch logical design

QsNetIII switch implementation

Quadrics Ltd

22

QsNetIII Network 1024–way

• Fat tree, constructed from 8 × 128-way node switches connected by 128 × 8-way top switches

28/8/2008

Quadrics Ltd

23

QsNetIII Implementation – Cables

• • • •

QSFP connectors throughout Copper cables (e.g. Gore) 1-10m Active copper cables (e.g. Gore), 8-20m Optical cables (e.g. Luxtera), 5-300m – PVDF Plenum rated – LSZH available as an option

• No longer Quadrics proprietary

• Likely usage: – Short copper cables from nodes – Optical cables between switches

28/8/2008

Quadrics Ltd

24

QsNetIII Fault Tolerance

• All of the QsNetII Features – – – – –

CRCs on every packet Automatic retransmission Redundant routes Adaptive routing avoids failed links Redundant, hot plugable, PSUs and fans

+ Line rate testing of each link as it comes up – Switches generate CRPAT, CJPAT or PRBS packets – Links are only added to the route tables when they are (a) up, (b) connect to the right place, and (c) can transfer data at full line rate without error.

28/8/2008

Quadrics Ltd

25

QsNetIII Implementation – HP BladeSystem

Elan5 mezzanine adapter 2 QsNet links, PCI-E x8 Gen2 128 MB of memory

28/8/2008

Elite5 switch module Full bandwidth 16 links to the blades (via backplane) 16 links to back of the module

Quadrics Ltd

26

Current Status

• Elite5 silicon in Bristol • Elan5 at TSMC, first parts expected in 3-4 weeks • Switch PCBs, chassis, backplane, controllers are working • First adapter PCBs are ready – PCI-Express x16, HP Blade, ExpressModule (Sun Blade)

• We are porting the QsNetII software • Components at SC08 in Austin • First customer shipment in Q1 of 2009 28/8/2008

Quadrics Ltd

27

Future Work

• QsNetIII hardware – Low cost 32-way switch – 1024-way single chassis switch

• QsNetIII Software – General framework for optimised collectives – Support for “multiport” networks - “fat” nodes have multiple connections to the same rail – Ethernet firmware for the network adapter

28/8/2008

Quadrics Ltd

28

Conclusions

• Adaptive routing underwrites the scalability of HPC systems designed to run a single large application • Adaptive routing has been a feature of QsNet systems since 2000 • QsNetIII offers significant enhancements over both QsNetII and competing products

28/8/2008

Quadrics Ltd

29

Thank you for listening

28/8/2008

Quadrics Ltd

30

Additional Material

28/8/2008

Quadrics Ltd

31

Packet Format

• Packet size of up to 4K made up of 256 byte packet segment and continuations, 8 byte ACK

28/8/2008

Quadrics Ltd

32

Impact of static routing on latency

Data from Thunderbird cluster, Sandia National Lab Big increases in worst case latency with number of nodes

28/8/2008

Quadrics Ltd

33

Impact of static routing on latency

Data from Thunderbird cluster, Sandia National Lab Big variation in worst case latency across a large job

28/8/2008

Quadrics Ltd

34

Software Model – Firmware & Drivers

• Base firmware in the ROMs • Firmware modules loadable with the device driver – Elan, OpenFabrics, 10GE Ethernet, …

• Kernel modules – elan5, elan, rms

• Device dependent library (libelan5) • Device independent library (libelan) • User libraries

28/8/2008

Quadrics Ltd

35

Software Model – Elan Libraries

• Point-to-point message passing • One-sided put/get • Transparent rail striping

28/8/2008

• Optimised collectives • Locks and atomics ops • Global memory allocation

Quadrics Ltd

36

QsNetIII Performance Summary

• Similar latencies to QsNetII – The 1.3 to 2 microsecs of latency is mostly in the host PCI and memory system

• Higher issue rates – Improved link utilisation on small transfers

• Higher bandwidths – 1.5 to 2.25 GB/sec/link depending on host interface

• Bi-directional host interface – 2 x improvement over QsNetII

• Broadcast and barrier in hardware • Continued development of adaptive routing underwrites scaling to high node counts

28/8/2008

Quadrics Ltd

37

Related Documents


More Documents from ""