Status Of The Qpace Project

We give an overview of the QPACE project, which is pursuing the development of a massively parallel, scalable supercomputer for LQCD. The machine is a three-dimensional torus of identical processing nodes, based on the PowerXCell 8i processor. The nodes are connected by an FPGA-based, application-optimized network processor attached to the PowerXCell 8i processor. We also compare hardware benchmarks to a previously presented performance analysis of lattice QCD codes on QPACE.


QPACE Architecture

QPACE System

o QPACE: Massively parallel architecture based on

o System

F Commodity processor: PowerXCell 8i

F 32 node cards and 2 root cards per backplane Booting and monitoring controlled by root card with Ethernet interface F 8 backplanes per rack F Liquid cooling solution: closed node card housing acts as heat conductor to liquid-cooled “cold plate” between two rows of node cards

• 200 GFlop/s single precision (peak) • 100 GFlop/s double precision (peak) • DDR2 memory controller (4 GB per node) • high-performance I/O interface (Rambus FlexIO) F Custom network controller directly attached to processor

o Torus Network

o Planned infrastructure consists of 2048 QPACE nodes


F 400 TFlops single precision, 200 TFlops double precision, dedicated to Lattice QCD simulations F About 1.3 Watt per GFlops (peak, double precision) o Project planning F F F F

June 2008 October 2008 February 2009 June 2009

cold plate

Bring-up nodecard prototype Bring-up small prototype Start manufacturing 2048 node cards Deployment large systems

o Tree network F Evaluation of global conditions and synchronization F Global exceptions

node card housing

Network Processor o Implementation on FPGA (Xilinx Virtex5 LX110T)

Nearest-neighbour communication 3-dimensional torus topology 1 GBytes/s per link per direction Simultaneous send and receive along all 6 links LS-to-LS DMA communications, low latency of 1 µsec

o Ethernet tree network F 1 Gigabit Ethernet link per node card to rack-level switches F ≥ 12 Gigabit Ethernet (per rack) up-links to file server F Linux network boot

Link Technology

Communication Model o Communication mechanism

o 10GbE PHY component (PMC Sierra PM8358) F Redundant links F On-chip 8-bit/10-bit encoding/decoding F 32 bit, 250 MHz parallel interface o Test of physical links (2.5 Gbit/s per lane) o User level API: F torus_recv_xm(...); Provide credits for receiving data from x- direction F torus_send_xp(...); Send data in x+ direction and notification

o Redundant PHY links allow for quick, software controlled change of machine partitioning/topology E.g., connection of 8 or 2 × 4 nodes in one dimension:

o Features F F F F F F

2 FlexIO links (6 GBytes/s bandwidth) 6 high-speed torus network links connected to external 10GbE PHYs 1 Ethernet interface connected to external Gigabit Ethernet PHY Interface to flash memory Link to control processor Global signal handling

Linear Algebra Benchmarks o Implementation

o Application code optimization on SPE challenging

nodes 256 512 1024 2048

topology 4×8×8 8×8×8 8 × 16 × 8 16 × 16 × 8

Wilson-Dirac Operator Benchmarks


Optimal data reuse Multi-buffering scheme LS-LS communications Double precision lattice size up to L0 · 83 per node Single precision lattice size up to L0 · 103 per node

o Performance estimates (in units of 1000 cycles) o Single Precision Benchmarks (LS)

o Leverage USQCD/SciDAC concept

version caxpy cdot rdot norma xlc/asm 47% 42% 38% 49% model 50% 50% 50% 100%

F Separation of high-level and low-level optimized kernels F Highly optimised kernels to be executed on SPE • Code written using intrinsics • Possibility to use assembly generators

topology 2×2×2 2×4×4 4×4×4 4×4×8

o Standard Wilson-Dirac operator

F Vectorized complex arithmetics (4-way SIMD) F Double buffering F Performance limited by load/store/shuffle pipeline if data is already in LS F Severely limited by main memory bandwidth if data must be transferred from main memory

F Compilers available (GCC and XLC) F Communication library F Customized Linux kernel

F Instruction match F Data layout, LS management

o Example topologies nodes 8 32 64 128

Software o System software


o Single Precision Benchmarks (main memory) version caxpy cdot rdot norm xlc 3.2% 5.1% 2.7% 5.3% model 4.1% 6.3% 3.1% 6.2%


single precision L1 × L2 × L3 83 43 23 Tpeak /L0 10.5 1.3 0.16 TF P /L0 14 1.7 0.21 TRF /L0 6 0.7 0.1 Tmem/L0 30 3.8 0.35 Tint/L0 2.5 0.6 0.15 Text/L0 10 2.5 0.6 TEIB /L0 20 3 0.5 εF P 34% 34% 27%

double precision 83 43 23 21 2.6 0.33 27 3.4 0.42 12 1.5 0.19 61 7.7 0.69 5 1.2 0.29 20 4.9 1.23 40 6.1 1.06 34% 34% 27%

o Benchmarks (one node, single precision) o Funded by the German Research Foundation (SFB/TR-55) a

difference between benchmark and model is due to limited loop unrolling and sub-optimal instruction scheduling

Lattice L0 = 32 L0 = 64 L0 = 128 model

L0 · 83 L0 · 103 25% 24% 26% 25% 26% 26% 34% 34%

