The QPACE Network Processor QPACE Collaboration: H. Baier1, H. Boettiger1, M. Drochner2, N. Eicker2,3, U. Fischer1, Z. Fodor3, A. Frommer3, C. Gomez10, G. Goldrian1, S. Heybrock4, M. Hüsken3, D. Hierl4, T. Huth1, B. Krill1, J. Lauritsen1, T. Lippert2,3, J. McFadden1, T. Maurer4, N. Meyer4, A. Nobile4, I. Ouda6, M. Pivanti4,5, D. Pleiter7, A. Schäfer4, H. Schick1, F. Schifano8, H. Simma7,9, S. Solbrig4, T. Streuer4, K.-H. Sulanke7, R. Tripiccione8, J. S. Vogt1, T. Wettig4, F. Winter7 1IBM Böblingen, 2FZ Jülich, 3Univ. Wuppertal, 4Univ. Regensburg, 5INFN Trento, 6IBM Rochester, 7DESY Zeuthen, 8Univ. Ferrara, 9Univ. Milano Bicocca, 10IBM La Gaude
We present an overview of the design and implementation of the QPACE Network Processor. The Network Processor implements a standard Ethernet network and a high-speed communication network that allows for a tight coupling of the processing nodes. By using an FPGA we have the flexibility to further optimize our design and to adapt it to different application requirements.
QPACE
(QCD Parallel Computing on the Cell Broadband Engine) is a massively parallel supercomputer optimized for Lattice QCD calculations providing a tight coupling of processing nodes by a custom network:
PowerXCell 8i Processor
Memory
• Node Cards with IBM PowerXCell 8i and custom-designed Network Processor FPGA
Network Processor (FPGA)
Torus Network PHYs
• Nearest-neighbor 3D-torus network, 6GByte/s communication bandwidth per node, remote LS-to-LS DMA communication, low latency, partitionable • Gigabit Ethernet network • Global Signal Tree: evaluation of global conditions and synchronization • 256 Node Cards per rack, 4GByte memory per node • 25.6 (51.2) TFlops single (double) precision per rack • Efficient, low-cost watercooling solution, max. 33 kW per rack • Capability machine (scalable architecture)
Ethernet
FPGAs
to Service Processor
are user-programmable hard-
FlexIO (2 byte, 2.5GHz)
ware chips:
Rocket IO SerDes
• Configured with a logic circuit diagram to specify how the chip will work
• Construct desired logic by setting up a number of these elements • Trade-off between performance and resource usage
• IBM-proprietary internal bus interface 128b at 208MHz
Master Interface
to Cell BE
They also provide other primitives like Block RAMs, Ethernet MACs, processor cores, high-speed transceivers, etc.
• 2 20G links to Cell BE Rambus FlexIO, 16b at 2.5GHz
IBM Logic
to Service Processor
• Ability to update the functionality at any time offers vast advantages in development and operation They are built up from basic elements called ”slices”, interconnected using ”switch matrices”. A slice is made up of a number of Flip-Flops, Look-Up Tables and Multiplexers:
FPGA acts as Southbridge to the Cell Processor. Logic designed to work as fast network fabric:
to Cell BE
Inbound Read
DCR Master
Slave Interface
128 bit 208MHz
Flash Reader
• 6 10G links to torus network XGMII, 32b at 250MHz • Gigabit Ethernet link for booting and disk I/O; RGMII, 4b at 250MHz
128 bit 208MHz
Inbound-Write Controller
Outbound-Write Controller
• Serial interfaces: 2x UART, SPI, Global Signals
Outbound Read
UART 128 bit 208MHz
128 bit, 208MHz
• Most logic controlled through Device Control Register (DCR) Bus
Configuration Status Version
We chose a Xilinx Virtex-5 LX110T-FF1738-3: UART
DCR
SPI Master
6 Torus Links
DCR
• just enough High Speed Serial Transceivers
Ethernet
• just enough pins to connect all 6 Torus links Global Signals 6 XGMII 32bit 250MHz
to Global Signal Tree
Major Challenges • Implementing the FlexIO on an FPGA was (is) a major challenge; only possible due to special features of Xilinx V5 RocketIO(TM) GTP Low-Power Transceivers [1]. However: – No 100% compatibility of Rambus FlexIO and Xilinx GTPs – Training of interface has proven difficult – Designed logic cannot be re-used for other processors • Designing at edge of FPGA capabilites:
– Complexity of logic limits internal bandwidths – On-Chip debugging difficult due to high logic utilization rates
to Torus Transceivers
Utilization 16,029 656 51,018 38,212 36,939 53
out out out out out out
of of of of of of
17,280 680 69,120 69,120 69,120 148
92% 86% 73% 55% 53% 35%
This splits up into: Total IBM Logic Torus Ethernet IWC OWC
MDIO
to Ethernet Transceiver
by SFB
by IBM
Evaluation
Slices PINs LUT-FF Pairs Registers LUTs BlockRAM/FIFO
– Routing of signals difficult due to large number of clocks – Most package pins (of largest package) used
to Flash
MDIO
RGMII 4bit 250MHz
• highest speed grade and sufficient capacity to hold our logic
FlipFlops 38212 20225 13672 1537 583 446
LUTs 36939 16915 14252 894 132 642
Percent FF 100% 53% 36% 4% 1.5% 1.2%
• Reaching target clock frequencies becomes very difficult as FPGA fill rate increases. Current Status: – FlexIO at 2GHz, goal is 2.5GHz (verified plain link at 3GHz) – IBM Bus at 166MHz, goal is 208MHz (verified without torus logic at 208MHz) • Bandwidth currently up to 0.8 GByte/s per LS-to-LS link (Bottleneck is development time and effort) • Latency goal of 1µs missed. Current SPE to SPE latency about 3µs. (Long time between start of data move operation in processor and data entering links.) Further Reading: [1] I. Ouda and K. Schleupen, Application Note: FPGA to IBM Power Processor Interface Setup, (2008). [2] G. Goldrian. et al., QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine, Comput. Sci. Eng. 10 (2008) 26.