The Elan5 Network Processor Jon Beecroft, David Hewson, Fred Homewood, Duncan Roweth and Ed Turner. Device Architecture Abstract—The Elan5 is a single chip network processor which acts as a host adapter for high speed network protocols. It is capable of handling both 10Gb Ethernet, and proprietary Quadrics protocols developed for ultra low latency communication in High Performance Computing applications. In order to provide flexibility in the choice of protocols the device is implemented as an array of identical RISC processors, which can be dedicated to tasks such as input packet handling and host memory DMA handling.
The Elan5 has seven Packet Processing Engines, with one assigned to device management, two dedicated to link input and four being available for output packet generation and processing of requests from remote Elans. The PPEs are identical with the exception of the two connected to the network links, which have additional input and output buffers. Although the input buffers are owned by their respective PPEs, each of the output buffers can be accessed by any PPE.
Index Terms—10GbE, HPC, Ethernet, QsNet
INTRODUCTION
T
HE Elan5 Network processor marks a significant departure from the preceding generations of high speed Network Interface devices [1] in several aspects. The device implements a standard-based protocol, IEEE 802.3ae [2] in addition to the ultra low latency protocols developed for supercomputing applications. This choice was dictated by the requirement to address wider markets to offset the increasing development costs of the latest generations of CMOS technology, while at the same time maintaining a performance advantage when used in supercomputer class systems.
Each PPE consists of; a dual issue 500MHz RISC processor core optimized for data communication tasks, a 16Kbyte instruction cache, a 9Kbyte DMA buffer; and a port connecting it to the on chip internal fabric. QsNetIII / CX4
QsNetIII / CX4
Link
Link
Packet Engine
Packet Engine
Packet Engine
Packet Engine
Packet Engine
Packet Engine
Packet Engine
16K inst cache 9K data buffers
16K inst cache 9K data buffers
16K inst cache 9K data buffers
16K inst cache 9K data buffers
16K inst cache 9K data buffers
16K inst cache 9K data buffers
16K inst cache 9K data buffers
Elan5 Adapter
Fabric
The requirement to implement multiple protocols necessitates a much greater degree of programmability in the underlying architecture than in preceding generations of device. For example in Elan4 the only form of DMA supported is a simple RDMA operation in which a contiguous memory block is transferred between nodes. By implementing the DMA function in a processor instead of a dedicated state machine it is possible to implement more complex functions, such as Ethernet RDMA and various forms of scatter gather operation. The Elan5’s processing power is provided by a pool of Packet Processing Engines (PPEs) designed specifically to support the high speed streaming of data. Multiple PPEs are provided so as to maximize the packet rate and minimize the packet size necessary to saturate link bandwidth. The PPEs share a multi bank memory system on the chip. Communications between them, the memory system and the host interface being provided by a high speed internal fabric.
To contact Quadrics send e-mail to
[email protected].
x8
Host I/F
Local Memory
Object Cache Tags Buffer Manager
Cmd Launch
Free List
PCIe SERDES
PCIe 16 Lanes
Bridge
Local Functions
MMU
8K
8
External cache SDRAM i/f
Ext i/f
8 banks = 512K ECC RAM
PLL
External EEPROM DDRII
Clocks
Figure 1 - Elan5 Network Processor Architecture All seven devices are connected through the on chip fabric to a multi-bank memory system, with built in hardware buffer management. A dedicated host address path routes host memory references via the on-chip MMU to the host bus interface – a PCI Express interface supporting version 1 and version 2 protocols. The on board MMU allows the Elan5 to replicate the virtual memory translations of the host processor. The MMU supports variable page sizes, up to 8 different page sizes can be in use at anytime. Its TLB includes 64 fully associative tags, with each tag referencing up to 16 consecutive page table entries (PTEs), for a total of 1024 translations. Other functional units include the Command Launch Unit (CLU), the Object Cache, the external EEPROM interfaces, the external SDRAM interface and cache.
The CLU is responsible for taking commands from the host processor and directing them to an appropriate PPE. The CLU can support up to 2048 separate command queues. Commands are issued by PIO write operations from the host processor across the PCI Express bus. The CLU assigns the commands to a queue based on the physical address, and signals the associated PPE, scheduling the task that processes the queue. The Object Cache is a software-controlled content addressable memory (CAM) that provides efficient support for looking up objects in the local memory based on a key supplied as part of a command or network packet. The cache is a shared resource amongst the PPEs. We use it for looking up packet sequencers, queues and protocol control blocks. The CAM supports up to 256 entries that can be rapidly validated against a 32 or 64 bit key presented by the PPE.
Packet Processing Engine The packet processing engines are based around a proprietary RISC processor, optimized for use in communication applications. The basic design consists of a 32 bit processor capable of supporting up to 8 outstanding loads. The high aggregate number of outstanding loads is required to handle the long latency associated with access to host memory. Load and store operations can be for any power of two number of bytes between one and 128. The processor can address up to 64 of 128 registers, with flexible register windowing to provide efficient subroutine call support. Each PPE can execute two instructions per cycle for a single thread or one instruction per cycle from each of two threads. The ability to support two simultaneous threads is useful for code segments where data has to be streamed in and out of a FIFO, where one thread acts as a producer and the other as the consumer. Fork and Join instructions allow the processor to rapidly switch in and out of dual thread mode. The primary role of the PPE is high-speed streaming and re-alignment of data. Even with a relatively large number of registers, the host memory latency is such that it is not possible to code a host memory copy routine that transfers data through registers and sustains full bandwidth on the host bus interface. Host memory latencies vary greatly, from 250ns for a fast dual CPU node to 1500ns for a high CPU count SMP. Assuming a host memory latency in the region of a microsecond and a host memory bandwidth of around 2.8Gbytes/s (16 lane PCI Express, 70% efficiency), then at any time at least 2.8Kbytes of memory or registers must be reserved for data returning from outstanding read requests. The PPE supports a 9Kbyte DMA buffer. Up to 8 outstanding DMA loads, and 3 outstanding store operations can be issued by each PPE, they complete by transferring data to or from the buffer rather than the
PPEs registers. Operations may be any number of bytes up to 4Kbytes, the maximum transaction size permitted by PCI Express. The buffer has associated scoreboarding logic, which ensures that if a load is issued to bring data into the DMA buffer, followed by a dependent store, (i.e. one which overlaps that region in buffer memory) the store will not be scheduled until the load has completed. The PPEs support on-the-fly checksum and CRC generation for data being streamed through the DMA buffers. Support is also provided for generating packets from the PPE’s registers, allowing it to modify data in the DMA buffer before transmitting to the link.
Internal Fabric & Buffer Memory The internal fabric allows each of the seven PPEs, and the host bus interface, to communicate with each other and access the on chip memory. Each port onto the fabric has a separate 64-bit read and write path, for a per port bandwidth of 8Gbytes/sec and a total fabric bandwidth of 64Gbytes/s at 500MHz. The protocols and clocking strategy for the fabric were designed to be tolerant to delays in crossing the chip, to simplify the task of meeting timing closure, which can be a significant challenge in large sub-micron chip designs. The memory system connected to the fabric consists of 512Kbyte of ECC corrected SRAM, arranged as 8, 64 bit wide banks. The individual banks support simultaneous reads and writes. Data is striped across the banks in 64 bit words, so that multiple block accesses to the memory occur without contention, once the first access has been scheduled. A hardware buffer manager is implemented to ensure efficient utilization of the on chip memory. This supports up to 1024 separately controlled buffer structures. Each of which can be either a simple FIFO that takes streamed data and is emptied as it is read, or an object buffer, which can be read and written multiple times then returned to a free list when no longer in use.
Network Interface The Elan5 has two DDR XAUI link interfaces, each consisting of four high speed SerDes operating at 6.25Gbits/s for QsNet, low latency, remote memory protocols or 3.125Gbit/s for CX4 Ethernet operation). Each link has a 32Kbyte input buffer and a 32Kbyte output buffer. Each link is controlled by a PPE, and only that PPE can access the input buffer. The output buffers can be accessed by any of the PPEs writing across the fabric, allowing any PPE to generate output packets. Each link supports upto 64 pending packets.
PCI Express Interface The host interface supports PCI Express version 1.0 with upto 16 lanes or PCI Express version 2.0 with upto 8 lanes. The interface has been designed to be a simple as possible, with much of the management and configuration functionality being implemented in
software running on the PPEs rather than as state machines within the interface.
Quadrics MPI
Vendor MPI
Shmem
ARMCI
UPC
...
The interface is designed to support many outstanding reads to service requirements of the multiple PPEs. Up to 32 concurrent register load operations can be supported. For register loads, the data is not transferred to the processor until the PCI Express bus CRC has been checked. This requires local buffering in the PCI Express interface. To avoid having to buffer the larger DMA loads in the PCI Express Interface, they are allowed to complete directly to the DMA buffers, but are tagged as unchecked until the final CRC is validated. The DMA buffer score-boarding uses this tag to block store operations using that data until it has been both loaded and checked.
Elan5 supports secure multi-user access to QsNet through use of job specific capabilities that describe the rights of each users and network context numbers assigned to each packet by the outputting PPE.
External Memory
Software Model – QsNet
The Elan5 provides a 32 bit DDR 2 SDRAM interface for applications requiring larger amounts of adapter memory. This approach provides the option of significantly lower (and predictable) latency, when compared to accessing the host memory, which may have to wait for lengthy DMA load operations to complete. An external EEPROM interface is also provided for a ROM to store the Elan5 boot code.
Fault Tolerance Elan5 is designed for High Performance Computing (HPC) systems consisting of thousands of commodity servers. Fault tolerance of the system is of critical importance; the sheer number of components make errors in the memory systems and packet corruption a frequent occurrence. For example with 1024 nodes each driving a single link at 2 Gbytes/sec we would expect to see between 10 and 20 corrupt packets per second with a bit error rate (BER) of 10-15. The approach taken on Elan5 is to use ECC to protect all memories containing state that is difficult to recreate (the SRAM, the DMA buffers, the cache and the local memory interface). Parity is sufficient for the instruction caches. Data transfer (and the input/output buffers) are protected by an end-to-end 32-bit CRC on each packet. PCI Express has its own CRCs.
Libelan library. Device independent binary interface providing optimized communications primitives, e.g. message passing, put/get, collectives.
libelan3
libelan4
Device specific firmware and thread code
Figure 2: Elan5 software stack
When used as a QsNet device two PPEs are dedicated to input handling, taking packets from the links and either executing them directly or (typically when they contain requests from other Elans) dispatching them to one of the other PPEs. Two PPEs are dedicated to handling large DMA requests, processing DMA descriptors received from the host (put) or remote Elans (get). Two PPEs are assigned to short put/get requests where the address and data are supplied with the command. The management PPE performs configuration and status monitoring tasks. Use of the device is illustrated with several of the tasks common to HPC applications. Our first example illustrates how a short put is executed, a common operation in Partitioned Global Address Space (PGAS) programming models such as Shmem [3], ARMCI [4], or UPC [5]. The user process formats a command block containing the put command itself, the destination virtual address, the destination virtual process number (sometimes called the rank) and the data. The command block is then written to an Elan5 command queue as a PIO write. The command launch unit assigns the command to a PPE. The PPE builds a packet and forwards it to the link.
H O S T
Link C L U
Software Model The allocation of the processing tasks to the pool of available PPEs is controlled by the device firmware. Common firmware is loaded from ROM as the Elan boots, it initializes the PCI Express interface and local memory. Application specific firmware modules (QsNet or Ethernet) are then loaded by the device driver. The libelan5 library provides direct user space access to the device. The libelan library provides a binary interface to a variety of widely used parallel programming models; their implementation is common to Elan5 and the existing Elan3 and Elan4 adapters
libelan5
Buffer
PPE
Out Ack
In
QsNetIII Switch
Link
H O S T
M M U
Input PPE
Out
Buffer
In
Figure 3: Execution of a short put on Elan5 On the input side the PPE takes the packet from the links and writes the data directly to host memory. The
destination virtual address is translated to a physical address by the MMU. An acknowledgement (ACK) is returned to the source PPE on successful completion. In the event of an error or trap a not acknowledgement (NACK) is returned. The source will retransmit. On Elan5 put operations are purely one sided. They involve the source process and the Elans, the destination process does not participate and there is no need to run a main CPU thread. It is inefficient to transfer large amounts of data using PIO writes. Above a certain (programmable) threshold we write a DMA descriptor to the source Elan instead of the data.
descriptors to a queue in the Elan. The sender generates a command block containing the message header and (optionally) a small amount of data. This header is transmitted to the destination (step 1 in Figure 5) as per the short put example, except that this is a put to a remote queue rather than a put to memory. The input PPE recognizes the packet type and assigns the header to the user’s queue. C L U
Buffer MPI Send
H O S T
PPE 1 M M U
DMA read
DMA Buffers
Link
4 Input PPE
Out
Buffer
In
1
DMA 3
C L U
3
Buffer
H O S T
DMA
Acks
C L U
Link
DMA read
DMA Buffers In
QsNetIII Switch
Tag Match
MPI Recv
2
H O S T
Out M M U
Queue
Get request 3 Link
Queue
M M U
Input PPE
Out
Buffer
In
4
4
QsNetIII Switch Link
H O S T
M M U
Input PPE
Out
Buffer
In
Figure 4: Execution of a DMA transfer on Elan5 The CLU schedules the command on general purpose PPE, it issues DMA reads to the host interface and packet writes to the network. The input PPE on the destination node consumes packets, writing the data to memory and returning ACKs to the source. The PGAS models also require support for fetching data from a remote process, get. Elan5 implements ‘get’ operations by sending a request to the destination Elan. It performs the load (or DMA read) and puts the resulting data back to the source. Again the operation is purely one sided, being executed by the initiating process and the Elans. The Elan5 libraries provide a highly optimized implementation of MPI message passing [6], supporting both RDMA and independent progression of MPI messages. In an MPI application the source processes send messages and the destination process receive them, the transfer completes when a send is matched to a receive, using either the source rank or an integer tag. In a well written program non-blocking receives will have been posted early so as to hide the time taken to transfer data. On Elan5 the MPI tag matching is performed by one or more PPEs. The receiving process writes MPI receive
Figure 5: Execution of an MPI message on Elan5 Arrival of data in the packet queue schedules the matcher thread on one of the general purpose PPEs. It determines whether there is a matching receive (step 2). In the short message case data supplied with the header can then be written directly to memory (the receive descriptor contains the user virtual address) completing the transfer. In the large message case the matcher thread sends a get request to the source (step 3) which can then complete the transfer with a DMA (step 4). If no matching receive is available the header is buffered until a receive is posted. The QsNet firmware implements these features together with support for executing collectives in the adapter.
Software Model - Ethernet When used as an Ethernet device two PPEs are dedicated to output packet preparation. This entails Fetching the data from host memory space Data realignment, converting data aligned from the host memory layout to the packet alignment. Data preconditioning, such as the preparation of RDMA framing data. TCP/IP header generation The two input processes associated with the links handle packet recognition and classification, header checksum processing and RDMA CRC checking. The task of working out where to place the packets in memory, and writing the packets to the host memory is performed by two protocol processes, which also generate the TCP/IP acknowledges. The management PPE performs housekeeping tasks such as configuration, retransmission scheduling, and status monitoring.
Implementation The Elan5 has a number of different clock domains determined by the speeds for the various external interfaces. This presents a number of different clocking domains and associated issues. The 8 on chip phase locked loops allow the domains to be clocked asynchronously to ease timing closure and remove complex dependencies. The PPEs, fabric and local memory are clocked at 500MHz, giving a total instruction throughput of 7Gops. The device is implemented on a standard TSMC 90nm, 8 layer metal process and measures 10mm x 10mm. The total gate count for the functional logic is 4.5 million gates. The total on-chip memory exceeds one MByte. The memories include Built in Self Repair (BISR) to improve manufacturing yield. A full scan methodology is used for silicon manufacturing test, JTAG boundary scan is included for manufactured board connectivity testing. The Elan5 device is packaged in a 672 high performance flip-chip ball grid array device. Flip chip packaging was selected to meet the signal integrity requirements of the high-speed serial links. Elan5 uses a semi custom package with particular attention being paid to the routing of the differential pairs for the links. Worst case power dissipation for the device 17 Watts, requiring the use of a thermal enhanced package with a copper heat spreader. Power consumption is significantly lower, around 12W for an x8 PCI-Express interface and a single link, and lower still when the SDRAM interface is not required.
Conclusion The architecture of the Elan5 enables a single device to support a range of different communication protocols. Performing the protocol handling in firmware allows the device to provide a high level of communications processing off-load, without the complexity and verification challenges of a custom hardware implementation The architecture has been designed so that future variants can scale the number of links, and the number of packet processing engine. They will also utilize, higher bandwidth, host interfaces.
REFERENCES [1] Quadrics QsNetII: A Network for Supercomputing Applications. Jon Beecroft, David Addison, Fabrizio Petrini, Moray McLaren in Proceedings of Hot Chips 15, Palo Alto, California, 2003. See http://www.quadrics.com/ [2] IEEE Piscataway NY, IEEE 802ae. Network standard [3] Shmem, first referenced in the Cray T3E programmers guide. [4] ARMCI: A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems. Lecture Notes Jarek Nieplocha1 and Bryan Carpenter in Computer Science; Vol. 1586. See http://www.emsl.pnl.gov/docs/parsoft/armci/ [5] UPC Consortium, "UPC Language Specifications, v1.2", Lawrence Berkeley National Lab Tech Report LBNL-59208, 2005. See http://upc.gwu.edu/ [6] MPI: A message-passing interface standard. The MPI Forum. See http://www.mcs.anl.gov/mpi/