Status of the QPACE Project H. Baier1, H. Boettiger1, M. Drochner2, N. Eicker2,3, U. Fischer1, Z. Fodor3, G. Goldrian1, S. Heybrock4, D. Hierl4, T. Huth1, B. Krill1, J. Lauritsen1, T. Lippert2,3, T. Maurer4, J. McFadden1, N. Meyer4, A. Nobile5,6, I. Ouda7, D. Pleiter8, A. Schäfer4, H. Schick1, F. Schifano9, H. Simma5,8, S. Solbrig4, T. Streuer4, K.-H. Sulanke8, R. Tripiccione9, T. Wettig4, F. Winter8 1IBM
Böblingen, 2FZ Jülich, 3Wuppertal Univ., 4Regensburg Univ., 5Milano Univ., 6ECT∗ Trento, 7IBM Rochester, 8DESY Zeuthen, 9Ferrara Univ.
We give an overview of the QPACE project, which is pursuing the development of a massively parallel, scalable supercomputer for LQCD. The machine is a three-dimensional torus of identical processing nodes, based on the PowerXCell 8i processor. The nodes are connected by an FPGA-based, application-optimized network processor attached to the PowerXCell 8i processor. We also compare hardware benchmarks to a previously presented performance analysis of lattice QCD codes on QPACE.
Introduction
QPACE Architecture
QPACE System
o QPACE: Massively parallel architecture based on
o System
F Commodity processor: PowerXCell 8i
F 32 node cards and 2 root cards per backplane Booting and monitoring controlled by root card with Ethernet interface F 8 backplanes per rack F Liquid cooling solution: closed node card housing acts as heat conductor to liquid-cooled “cold plate” between two rows of node cards
• 200 GFlop/s single precision (peak) • 100 GFlop/s double precision (peak) • DDR2 memory controller (4 GB per node) • high-performance I/O interface (Rambus FlexIO) F Custom network controller directly attached to processor
o Torus Network
o Planned infrastructure consists of 2048 QPACE nodes
F F F F F
F 400 TFlops single precision, 200 TFlops double precision, dedicated to Lattice QCD simulations F About 1.3 Watt per GFlops (peak, double precision) o Project planning F F F F
June 2008 October 2008 February 2009 June 2009
cold plate
Bring-up nodecard prototype Bring-up small prototype Start manufacturing 2048 node cards Deployment large systems
o Tree network F Evaluation of global conditions and synchronization F Global exceptions
node card housing
Network Processor o Implementation on FPGA (Xilinx Virtex5 LX110T)
Nearest-neighbour communication 3-dimensional torus topology 1 GBytes/s per link per direction Simultaneous send and receive along all 6 links LS-to-LS DMA communications, low latency of 1 µsec
o Ethernet tree network F 1 Gigabit Ethernet link per node card to rack-level switches F ≥ 12 Gigabit Ethernet (per rack) up-links to file server F Linux network boot
Link Technology
Communication Model o Communication mechanism
o 10GbE PHY component (PMC Sierra PM8358) F Redundant links F On-chip 8-bit/10-bit encoding/decoding F 32 bit, 250 MHz parallel interface o Test of physical links (2.5 Gbit/s per lane) o User level API: F torus_recv_xm(...); Provide credits for receiving data from x- direction F torus_send_xp(...); Send data in x+ direction and notification
o Redundant PHY links allow for quick, software controlled change of machine partitioning/topology E.g., connection of 8 or 2 × 4 nodes in one dimension:
o Features F F F F F F
2 FlexIO links (6 GBytes/s bandwidth) 6 high-speed torus network links connected to external 10GbE PHYs 1 Ethernet interface connected to external Gigabit Ethernet PHY Interface to flash memory Link to control processor Global signal handling
Linear Algebra Benchmarks o Implementation
o Application code optimization on SPE challenging
nodes 256 512 1024 2048
topology 4×8×8 8×8×8 8 × 16 × 8 16 × 16 × 8
Wilson-Dirac Operator Benchmarks
F F F F F
Optimal data reuse Multi-buffering scheme LS-LS communications Double precision lattice size up to L0 · 83 per node Single precision lattice size up to L0 · 103 per node
o Performance estimates (in units of 1000 cycles) o Single Precision Benchmarks (LS)
o Leverage USQCD/SciDAC concept
version caxpy cdot rdot norma xlc/asm 47% 42% 38% 49% model 50% 50% 50% 100%
F Separation of high-level and low-level optimized kernels F Highly optimised kernels to be executed on SPE • Code written using intrinsics • Possibility to use assembly generators
topology 2×2×2 2×4×4 4×4×4 4×4×8
o Standard Wilson-Dirac operator
F Vectorized complex arithmetics (4-way SIMD) F Double buffering F Performance limited by load/store/shuffle pipeline if data is already in LS F Severely limited by main memory bandwidth if data must be transferred from main memory
F Compilers available (GCC and XLC) F Communication library F Customized Linux kernel
F Instruction match F Data layout, LS management
o Example topologies nodes 8 32 64 128
Software o System software
Topologies
o Single Precision Benchmarks (main memory) version caxpy cdot rdot norm xlc 3.2% 5.1% 2.7% 5.3% model 4.1% 6.3% 3.1% 6.2%
Funding
single precision L1 × L2 × L3 83 43 23 Tpeak /L0 10.5 1.3 0.16 TF P /L0 14 1.7 0.21 TRF /L0 6 0.7 0.1 Tmem/L0 30 3.8 0.35 Tint/L0 2.5 0.6 0.15 Text/L0 10 2.5 0.6 TEIB /L0 20 3 0.5 εF P 34% 34% 27%
double precision 83 43 23 21 2.6 0.33 27 3.4 0.42 12 1.5 0.19 61 7.7 0.69 5 1.2 0.29 20 4.9 1.23 40 6.1 1.06 34% 34% 27%
o Benchmarks (one node, single precision) o Funded by the German Research Foundation (SFB/TR-55) a
difference between benchmark and model is due to limited loop unrolling and sub-optimal instruction scheduling
Lattice L0 = 32 L0 = 64 L0 = 128 model
L0 · 83 L0 · 103 25% 24% 26% 25% 26% 26% 34% 34%