The Qpace Network Processor

  • Uploaded by: Heiko Joerg Schick
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View The Qpace Network Processor as PDF for free.

More details

  • Words: 863
  • Pages: 1
The QPACE Network Processor QPACE Collaboration: H. Baier1, H. Boettiger1, M. Drochner2, N. Eicker2,3, U. Fischer1, Z. Fodor3, A. Frommer3, C. Gomez10, G. Goldrian1, S. Heybrock4, M. Hüsken3, D. Hierl4, T. Huth1, B. Krill1, J. Lauritsen1, T. Lippert2,3, J. McFadden1, T. Maurer4, N. Meyer4, A. Nobile4, I. Ouda6, M. Pivanti4,5, D. Pleiter7, A. Schäfer4, H. Schick1, F. Schifano8, H. Simma7,9, S. Solbrig4, T. Streuer4, K.-H. Sulanke7, R. Tripiccione8, J. S. Vogt1, T. Wettig4, F. Winter7 1IBM Böblingen, 2FZ Jülich, 3Univ. Wuppertal, 4Univ. Regensburg, 5INFN Trento, 6IBM Rochester, 7DESY Zeuthen, 8Univ. Ferrara, 9Univ. Milano Bicocca, 10IBM La Gaude

We present an overview of the design and implementation of the QPACE Network Processor. The Network Processor implements a standard Ethernet network and a high-speed communication network that allows for a tight coupling of the processing nodes. By using an FPGA we have the flexibility to further optimize our design and to adapt it to different application requirements.

QPACE

(QCD Parallel Computing on the Cell Broadband Engine) is a massively parallel supercomputer optimized for Lattice QCD calculations providing a tight coupling of processing nodes by a custom network:

PowerXCell 8i Processor

Memory

• Node Cards with IBM PowerXCell 8i and custom-designed Network Processor FPGA

Network Processor (FPGA)

Torus Network PHYs

• Nearest-neighbor 3D-torus network, 6GByte/s communication bandwidth per node, remote LS-to-LS DMA communication, low latency, partitionable • Gigabit Ethernet network • Global Signal Tree: evaluation of global conditions and synchronization • 256 Node Cards per rack, 4GByte memory per node • 25.6 (51.2) TFlops single (double) precision per rack • Efficient, low-cost watercooling solution, max. 33 kW per rack • Capability machine (scalable architecture)

Ethernet

FPGAs

to Service Processor

are user-programmable hard-

FlexIO (2 byte, 2.5GHz)

ware chips:

Rocket IO SerDes

• Configured with a logic circuit diagram to specify how the chip will work

• Construct desired logic by setting up a number of these elements • Trade-off between performance and resource usage

• IBM-proprietary internal bus interface 128b at 208MHz

Master Interface

to Cell BE

They also provide other primitives like Block RAMs, Ethernet MACs, processor cores, high-speed transceivers, etc.

• 2 20G links to Cell BE Rambus FlexIO, 16b at 2.5GHz

IBM Logic

to Service Processor

• Ability to update the functionality at any time offers vast advantages in development and operation They are built up from basic elements called ”slices”, interconnected using ”switch matrices”. A slice is made up of a number of Flip-Flops, Look-Up Tables and Multiplexers:

FPGA acts as Southbridge to the Cell Processor. Logic designed to work as fast network fabric:

to Cell BE

Inbound Read

DCR Master

Slave Interface

128 bit 208MHz

Flash Reader

• 6 10G links to torus network XGMII, 32b at 250MHz • Gigabit Ethernet link for booting and disk I/O; RGMII, 4b at 250MHz

128 bit 208MHz

Inbound-Write Controller

Outbound-Write Controller

• Serial interfaces: 2x UART, SPI, Global Signals

Outbound Read

UART 128 bit 208MHz

128 bit, 208MHz

• Most logic controlled through Device Control Register (DCR) Bus

Configuration Status Version

We chose a Xilinx Virtex-5 LX110T-FF1738-3: UART

DCR

SPI Master

6 Torus Links

DCR

• just enough High Speed Serial Transceivers

Ethernet

• just enough pins to connect all 6 Torus links Global Signals 6 XGMII 32bit 250MHz

to Global Signal Tree

Major Challenges • Implementing the FlexIO on an FPGA was (is) a major challenge; only possible due to special features of Xilinx V5 RocketIO(TM) GTP Low-Power Transceivers [1]. However: – No 100% compatibility of Rambus FlexIO and Xilinx GTPs – Training of interface has proven difficult – Designed logic cannot be re-used for other processors • Designing at edge of FPGA capabilites:

– Complexity of logic limits internal bandwidths – On-Chip debugging difficult due to high logic utilization rates

to Torus Transceivers

Utilization 16,029 656 51,018 38,212 36,939 53

out out out out out out

of of of of of of

17,280 680 69,120 69,120 69,120 148

92% 86% 73% 55% 53% 35%

This splits up into: Total IBM Logic Torus Ethernet IWC OWC

MDIO

to Ethernet Transceiver

by SFB

by IBM

Evaluation

Slices PINs LUT-FF Pairs Registers LUTs BlockRAM/FIFO

– Routing of signals difficult due to large number of clocks – Most package pins (of largest package) used

to Flash

MDIO

RGMII 4bit 250MHz

• highest speed grade and sufficient capacity to hold our logic

FlipFlops 38212 20225 13672 1537 583 446

LUTs 36939 16915 14252 894 132 642

Percent FF 100% 53% 36% 4% 1.5% 1.2%

• Reaching target clock frequencies becomes very difficult as FPGA fill rate increases. Current Status: – FlexIO at 2GHz, goal is 2.5GHz (verified plain link at 3GHz) – IBM Bus at 166MHz, goal is 208MHz (verified without torus logic at 208MHz) • Bandwidth currently up to 0.8 GByte/s per LS-to-LS link (Bottleneck is development time and effort) • Latency goal of 1µs missed. Current SPE to SPE latency about 3µs. (Long time between start of data move operation in processor and data entering links.) Further Reading: [1] I. Ouda and K. Schleupen, Application Note: FPGA to IBM Power Processor Interface Setup, (2008). [2] G. Goldrian. et al., QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine, Comput. Sci. Eng. 10 (2008) 26.

Related Documents


More Documents from "nasrullah kun"