directCell: Hybrid systems with tightly coupled accelerators
H. Penner U. Bacher J. Kunigk C. Rund H. J. Schick
The Cell Broadband Enginet (Cell/B.E.) processor is a hybrid IBM PowerPCt processor. In blade servers and PCI Expresst card systems, it has been used primarily in a server context, with Linuxt as the operating system. Because neither Linux as an operating system nor a PowerPC processor-based architecture is the preferred choice for all applications, some installations use the Cell/B.E. processor in a coupled hybrid environment, which has implications for the complexity of systems management, the programming model, and performance. In the directCell approach, we use the Cell/B.E. processor as a processing device connected to a host via a PCI Express link using direct memory access and memory-mapped I/O (input/output). The Cell/B.E. processor functions as a processor and is perceived by the host like a device while maintaining the native Cell/B.E. processor programming approach. We describe the problems with the current practice that led us to the directCell approach. We explain the challenge in programming, execution control, and operation on the accelerators that were faced during the design and implementation of a prototype and present solutions to overcome them. We also provide an outlook on where the directCell approach promises to better solve customer problems.
Motivation and overview Multicore technology is a predominant trend in the information technology (IT) industry: Many-core and heterogeneous multicore systems are becoming common. This is true not only for the Cell Broadband Engine*** (Cell/B.E.) processor, which initiated this trend, but also for concepts such as NVIDIA CUDA** [1] and Intel Larrabee [2]. The Cell/B.E. processor and its successor, the IBM PowerXCell* 8i processor, are known for their superior computing performance and power efficiency. They provide more flexibility and easier programmability compared with graphics processing devices or field programmable arrays, alleviating the challenge of adapting software to accelerator technology. Yet, it is still difficult in some scenarios to fully leverage these performance capabilities with Cell/B.E. processor chips
alone. The main inhibitors are lack of applications for the IBM Power Architecture* and the performance characteristics of the IBM PowerPC* core implementation in the Cell/B.E. processor. For mixed workloads in environments for which the Cell/B.E. processor is not suitable, such as Microsoft Windows**, various hybrid topologies consisting of Cell/B.E. processors and non-Cell/B.E. processor-based components can be appropriate, depending on the workload. Adequate system structures such as the integration of accelerators into other architectures have to be employed to reduce management and software overhead. The most significant factors to be considered in the design of Cell/B.E. processor-based system complexes are the memory and communication architectures. The following sections describe the properties of the Cell/B.E. Architecture (CBEA) and their consequences for system design.
Copyright 2009 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/09/$5.00 ª 2009 IBM
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
H. PENNER ET AL.
2:1
Memory hierarchy of the CBEA The CBEA [3] contains various types of processor cores on one chip: The IBM PowerPC processor element (PPE) is compatible with the POWER* instruction set architecture (ISA) but employs a lightweight implementation with limited branch prediction and inorder execution. The eight synergistic processor elements (SPEs) are independent cores with a different ISA optimized for computation-intensive workloads. The SPE implementation on the chip targets efficiency, using inorder execution and no dynamic branch prediction. Code running on the PPE can access main memory through the on-chip memory controller. In contrast, the SPE instruction set allows access only to a very fast but small memory area, the local store (LS), that is local to the core. The LSs on the SPEs are not coherent to any other entity on the chip or in the system but can request atomicity of memory operations with respect to concurrent access to the same memory location. For communication and dataflow, the SPEs are able to access main memory and I/O (input/output) memory using direct memory access (DMA). Loosely coupled systems Loosely coupled systems cannot access the memories of other nodes. Most clusters in the area of highperformance computing (HPC) follow this system design, including the first petaflops supercomputer [4]. If the application permits, these systems scale very well. Interconnectivity is usually attained by a high-speed network infrastructure such as InfiniBand**. The data distribution across the nodes is coarse grained to keep the network overhead for synchronization and data exchange low compared with the data processing tasks. The predominant programming model for loosely coupled systems is Message Passing Interface (MPI) [5], potentially enriched by the hybrid IBM Accelerator Library Framework (ALF) [6] and hybrid Data Communication and Synchronization (DaCS) [7] libraries to enable the Cell/B.E. processor in mixed-architecture nodes. Another model based on remote procedure calls is IBM Dynamic Application Virtualization [8]. Tightly coupled systems Tightly coupled systems are usually smaller-scale systems consisting of few computing entities. In our context, a single host component, such as an x86 processor-based system, exploits one or several accelerator devices. Accelerators can range from field-programmable gate arrays (FPGAs) over graphics processing units (GPUs) to Cell/B.E. processor-based accelerator boards, each offering different performance and programmability attributes. Accelerators can be connected in a memorycoherent or noncoherent fashion. Interconnects for
2:2
H. PENNER ET AL.
memory-coherent complexes include Elastic Interface or Common Systems Interconnect. HyperTransport** can be used in memory coherent and PCI Express** (PCIe**) can be used in noncoherent memory setups. It is a common feature of tightly coupled systems that they are able to access the single main memory. Applications for tightly coupled systems usually run on one host part of the system (e.g., an x86 processor), while the other parts of the system (accelerators such as PowerXCell 8i processor-based entities or GPUs) will perform computation-intensive tasks on data in main memory. The tightly coupled approach makes deployment easier because only one host system with accelerator add-ons has to be maintained rather than a cluster of nodes. Depending on the type of the tightly coupled accelerator, the approach can be applied well to non-HPC applications and allows for efficient and finegrained dataflow between components, as latency and bandwidth are better than in loosely coupled systems. Programming models include OpenMP** [9], ALF [6], and DaCS [7] libraries, CUDA, the Apple OpenCL** stack [10], and FPGA programming kits. Compared with FPGA and GPU approaches, Cell/B.E. processor accelerators, with their fast and large memories, offer the highest flexibility and generality while maintaining an easy-to-use programming model.
High-level system description This section describes the directCell architecture. It reflects on properties of acceleration cores and gives insight on the design principles of efficiently attaching the accelerator node. General-purpose processor compared with SPEs on the Cell/B.E. processor The remarkable performance of the Cell/B.E. processor is achieved mainly through the SPEs and their design principles. The lack of a supervisor mode, interrupts, and virtual memory addressing keeps the cores small, resulting in a high number of cores per chip. High computational performance is achieved through singleinstruction, multiple-data (SIMD) capabilities. In directCell, the SPE cores are available for application code only, while the PPE is not used for user applications. This effectively makes the directCell accelerator an SPE multicore entity. Unlike limited acceleration concepts such as GPUs or FPGAs, the SPE cores are still able to execute general-purpose code, for example, that generated by an ordinary C compiler. In standalone Cell/B.E. processor-based systems, OS (operating system)-related tasks are handled on the PPE. The task of setting up SPEs and starting program codes on them is controlled from outside, using a set of memory-mapped registers. All of these registers are
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
accessible through memory accesses and no special instructions have to be used on the PowerPC core of the Cell/B.E processor. Therefore, control tasks can be performed by any entity having memory access to the Cell/B.E. processor-side memory bus. Coupling as device and not as system In most cases, porting Windows applications to the Power Architecture is not a viable option. Instead, identifying computational hotspots and porting these to the Cell/B.E. processor-based platform offers a way to speed up the application. As the SPE cores are used for acceleration, the port has to target the SPE instruction set. Cell/B.E. processor-based systems use PCIe as a highspeed I/O link. PCIe is designed for point-to-point shortdistance links. This communication model does not scale to many endpoints. However, it provides best-of-breed bandwidth and latency characteristics and supports memory-mapped I/O (MMIO) and DMA operations. Because the Cell/B.E. processor resources are accessible via MMIO, it is possible to directly integrate it into the address space of other systems through PCIe following the tightly coupled accelerator concept. Communication overhead in this PCIe-based setup can be kept minimal. The directCell approach takes advantage of the low latencies of PCIe and makes the SPEs available for existing applications as an acceleration device instead of a separate system. To export a Cell/B.E. processor-based system as an acceleration device on the PCIe level, it must be configured as a PCIe endpoint device, an approach currently used by all hybrid setups involving the Cell/B.E. processor, including the first petaflops computer [4]. The directCell solution evolves from the endpoint mode and uses the Cell/B.E. processor as a device at the host system software level. We propose to relocate all control from the PPE to the host system processor except for the bare minimum needed for accelerator operation. For directCell, we further propose to shift all application code executed on the PPE in a standalone configuration over to the host processor. This includes scheduling of SPE threads and SPE memory management and also applies to the PPE application code. Thus, the host controls all program flow on the accelerator device. None of the Cell/B.E. processor resources are hidden in our model. More importantly, nothing on the accelerator device obstructs the host from controlling it. In contrast to a distributed hybrid system approach that relies on an accelerator-resident OS to control all of its memory, we consider an accelerator-resident OS superfluous. The host OS runs the main application and is, therefore, in charge of all of the hardware resources that are needed for its execution.
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
In distributed hybrid systems, the host OS needs to coordinate data-transfer operations with the accelerator OS. MPI [5], which was designed for distributed systems, features a sophisticated protocol to establish zero-copy remote DMA (RDMA) transfers that are vital to hybrid system performance. Without using an OS on the accelerator, the host can act independently, rendering much more flexibility to the host system programmer and saving additional runtime overhead. In order to support this execution model from the host system software, the accelerator hardware should provide the following mechanisms: Endpoint functionality. MMIO control of processor cores, including next
instruction pointer, translation lookaside buffer (TLB), machine state register (MSR), and interrupt acknowledgment. Interrupt facilities, Optionally PCIe coherency. Providing these mechanisms in a processor design yields enormous rewards with regard to system configurations, especially in hybrid configurations. The directCell architecture applies to Cell/B.E. processor-based blades such as the IBM BladeCenter* QS22 server [11] with specific firmware code to operate in PCIe endpoint mode and the IBM PowerXCell accelerator board (PXCAB) offering [12].
Applications and programming Performance and flexibility are two essential and generally opposing criteria when evaluating computer architectures. On the performance side, specially designed circuits, digital signal processors, and FPGAs provide extremely high computing performance but are lacking generality and programmability. Regarding flexibility, general-purpose processors allow for a wide application spectrum but lack the performance of highly optimized systems. Cell/B.E. processors follow a tradeoff and provide high computing performance while still offering a good degree of flexibility and ease of programming. The tightly coupled hybrid approach proposed with directCell in this paper promises to capture the advantages of both extremes by combining Cell/B.E. processor-based accelerators for performance and x86 systems for wide applicability. Two aspects apply to programmability for these hybrid systems: first, the complexity to program the accelerator (i.e., the SPEs), and second, the integration of the tightly coupled accelerator attachment into the main application running on the host system (i.e., the model to exploit the SPEs from the x86 system).
H. PENNER ET AL.
2:3
Standard programming framework of the Cell/B.E. processor The SPE Runtime Management Library (libspe2) [13], developed by IBM, Sony, and Toshiba, defines an OS-independent interface to run code on the SPEs. It provides basic functions to control dataflow and communication from the PPE. Cell/B.E. processor applications use this library to start the execution of code on the SPEs using a threading model while the PPE runs control and I/O code. The code on the PPE is usually responsible for shuffling data in and out of the system and for controlling execution on the SPEs. In most cases, code on the SPEs pulls data from main memory into the LSs so that the SPEs can work on this data. Synchronization between the SPEs and between the PPE and SPEs can take place through a mailbox mechanism. Programming tightly coupled accelerators The programming model and software framework for tightly coupled accelerators such as the Cell/B.E. processor can be structured in several ways. One way, which resembles the loosely coupled systems approach [14], is to add a communication layer between the two standalone systems, the host and Cell/B.E. processorbased system. These additional layers increase the burden on software developers who have to deal with an explicit communication model instead of a programming library. A different model is used for directCell that replaces parts of the PPE functionality with host processor code. The PPE does not run application-specific code; instead, code on the host processor assumes the functions that were previously performed by the PPE. This allows applications to fully leverage the advantages of a tightly coupled accelerator model. In our implementation of directCell, the libspe2 library is ported to the x86 architecture, which allows control of SPE code execution and synchronization between an x86 system and SPE entities. This reduces the complexity of programming the directCell system to the level of complexity required to natively program the Cell/B.E. processor. Existing Cell/B.E. processor-aware code can continue to run unchanged or with limited changes in this environment, tapping a large number of kernels running on SPEs for x86 environments. Furthermore, existing Windows or Linux** applications can be extended by libspe2 calls to leverage SPE acceleration, opening up the x86 environment and other application environments to Cell/B.E. processor acceleration. Implications of the tightly coupled accelerator structure The directCell approach has several implications on programming and applications:
2:4
H. PENNER ET AL.
Most of the application code runs on x86. Any code
that previously ran on the PPE now has to be built for and run on the x86 platform. The libspe2 application programming interface (API) to control SPEs is moved to x86, i.e., libspe2 is an x86 library. SPEs can perform DMA transfers between the SPE LS and the x86 main memory as used by the main application. DMA to the northbound memory that is local to the Cell/B.E. processor accelerator systems is still possible for SPEs. For transfers between the x86 memory and SPEs, the different endianness characteristics of the architectures need to be considered. (Endianness refers to whether bytes are stored in the order of most significant to the least significant or the reverse.) SIMD shuffle byte instructions can be used for efficient data conversion on the SPEs and may have to be added as a conversion layer, depending on the type of data exchanged between host and accelerator. The PCIe interface between the Cell/B.E. processor and the x86 components does not permit atomic DMA operations, a restriction that applies to all accelerators attached through PCIe. Atomic DMA operations on the Cell/B.E. processor local memory are still possible. This constraint requires code changes to existing applications that use atomic DMA operations for synchronization. Memory local to the Cell/B.E. processors can be used by SPE code for buffering; because of its proximity to the Cell/B.E. processor chips, it has better performance characteristics than the remote x86 memory. However, adding this mid-layer of memory between the x86 main memory and the SPE LSs adds to application complexity. A good way of using the memory local to the Cell/B.E. processors might be to use it to prefetch and cache large chunks of host system data. The complexity of prefetching and loading data on-demand from the Cell/B.E. processor local memory can typically be encapsulated in a few functions.
Structure of directCell The directCell model can be applied to Cell/B.E. processor-based systems only by means of software. Several software components are needed on the host and accelerator system side. In the following sections, we introduce the components and how they interface with each other. Libspe2.dll The basic interface for controlling SPE applications in standalone Cell/B.E. processor-based systems is libspe2
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
[13]. To conform to this widely adopted programming interface, and to be able to reuse a large amount of existing code, directCell provides a Windows port via a Windows DLL (dynamic linked library) that provides such functions as spu_context_create() and spu_context_run(). The libspe2 routines create a directCell device instance and send I/O control commands processed by the device driver. The libspe2 used in standalone system configurations interfaces with the SPE hardware resources through the synergistic processor unit (SPU) file system (spufs) [15] in the Linux kernel. Spufs offers two system calls and a virtual file system for creating SPE threads that the Linux kernel schedules to run on the SPE. In the Windows implementation, the spe_context_create() libspe2 call allocates a user-space representation of the SPE program and also a kernel-space device instance that is accessed by a device handle in user space. The function spe_context_run() invokes a corresponding I/O control (ioctl) that passes the device handle and a pointer to the user-space buffer as parameters. SPE binaries are embedded in a given Windows application as executable and linking format (ELF) files. In our Windows implementation, the ELF files are integrated into accelerated applications as ELF resources. The Windows API provides standard resource-handling functions that are used to locate and extract a previously embedded file at runtime. The libspe2 port also implements ELF loading functionality for processing the embedded SPU program to pass it to the spe_context_run() call. Device driver Our implementation of directCell features a sophisticated device driver that provides a software representation of all of the Cell/B.E. processor facilities on a host system. Accelerated application code is loaded, executed, and unloaded by operating the device driver interfaces. From a conceptual level, this closely corresponds to the wellestablished spufs. The idea to span the spufs interface across a high-speed interconnect has been pursued by Heinig et al. [16] as rspufs [16]. Regarding the user-space representation, spufs is based on the UNIX** file concept by representing SPE hardware through a virtual file system. In the case of Windows, the representation of hardware in user space is dominated by a different model. In Windows, files, volumes, and devices are represented by file objects, but the concept of imposing a directory structure for device control is not widely supported. Windows allows the creation of device interface classes that are associated with a globally unique identifier (GUID) and a set of system services that it implements. A device interface class is instantiated by creating a file handle with the GUID. This handle is used to invoke
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
system services. The granularity of this scheme centers around the idea of a single device node as found in a Linux /dev tree. The fine-grain control gained by subdividing the device details into a file system hierarchy such as spufs is opposed by a large variety of system services, each tailored for a specific purpose on a distinct device class. Device drivers that support I/O operations implement a relatively small set of I/O functions via an I/O queue. The driver I/O queue registers callbacks, most importantly read, write, I/O-control, stop, and resume. The actual communication between a user-space application and a device takes place through an appropriate device service that invokes the Windows I/O manager, which forms an I/O request package, puts it into the I/O queue of the correct driver, and notifies the driver via the previously registered callbacks. The device driver can satisfy some of the function requirements of higher level libraries such as libspe2 by using existing system services. The command spe_context_create() generates an SPE context by invoking the CreateFile() system service using the Windows Cell/B.E. processor device handle. This will return a handle to a kernel representation of an SPE context that is similar to the representation returned by an spu_create() system call in spufs. The spu_run() function is implemented as an ioctl that operates on the previously created SPE handle. The ioctl is nonblocking, although upper-level software should wait for completion via designated wait routines for asynchronous I/O. The ioctl initiates a direct DMA transfer of the SPE context from host user-space memory to the LS of the SPU. All data transfers in directCell follow this zero-copy paradigm. Regarding the mechanisms to access the Cell/B.E. processor hardware facilities, the device driver must perform additional steps compared to spufs. These additional steps are due mainly to the conversion of the Cell/B.E. processor address space and the host system address space, as described in the following sections. Cell/B.E. processor firmware layer The standard product firmware of a Cell/B.E. processorbased system is supplemented by a minimal software layer that takes care of additional initialization tasks after the endpoint-aware standard firmware has finished the basic boot. Most importantly, this layer provides accelerator runtime management and control. The initialization tasks comprise the setup of the address translation mechanism of the SPEs, the reporting of available SPEs to the device driver before accelerator usage, and the setup of the I/O memory management unit (I/O MMU). The runtime management and control functionality centers around handling of SPE events such as page faults and completion signals. This handling is
H. PENNER ET AL.
2:5
4
Cell/B.E. processor-based system
Host system Host memory
Host processor Core
Local store
Device driver 1
Control block
3,6
L2 cache
Issue MMIO
Firmware
MMIO registers
L2 cache
2,5
Element interconnect bus
Host southbridge
Core
6
3
2,5
4 Execution units
MFC
DMA SPU context
PPE
SPE SPU
Memory Southbridge chipset
PCIe link DMA
Figure 1 In directCell, an SPE program is started on the host processor by directly programming an SPE via the PCIe link between the host processor southbridge and that of the Cell/B.E. processor. Shown are the most important directCell software components and their lifecycle in the directCell hybrid system configuration.
vital for the execution of SPE binaries. Additional runtime debug and monitoring facilities for the SPEs complete the firmware layer functionality. The event handler and the debug monitor each run on separate PPE hardware threads. In our current implementation, the debug monitor provides a simple facility for probing memory locations, memory-mapped resources, and interfaces via the serial console. In future implementations, the debug console I/O could be redirected via PCIe to a console application that runs on the host.
Accelerator operation While directCell maintains compatibility with the common programming models for the CBEA [17], it introduces a new runtime model. Figure 1 shows and the following paragraphs describe in detail how directCell delivers efficient workload acceleration and makes additional overhead imposed by a Cell/B.E. processorresident OS superfluous. The numbers in parentheses below refer to the numbers in red in the figure.
2:6
H. PENNER ET AL.
Consider a user-space application on x86 that needs to run an SPE program via a PCIe-attached directCell device. Through our libspe2 implementation, it interfaces with the device driver by using the standard Windows API functions to allocate an SPE context (representation of LS, general-purpose registers, and problem or privileged state). The SPE binary is retrieved from the executable main application and loaded into the SPE context. The device driver implements a standard Windows function to support opening a device representation of the directCell system and running the previously created SPE context (1). The spe_context_run() method is implemented as an ioctl. Upon invocation, the driver picks a free physical SPE on the remote system and orchestrates the transfer of the SPE context to that system via MMIO operations (2). For the actual transfer of data, it is crucial that only DMA is used to maximize throughput and minimize latency because the data has to traverse the PCIe bus. MMIO is used only as a control channel of the host system into the Cell/B.E. processor-based system. This concept of loading comprises a small bootstrap loader as described
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
in Reference [18] to achieve this. The loader is transferred to the SPE in a single 16-KB DMA operation that also receives and contains as parameters the origin address of the target SPE binary and its size. The loader transfers the contents of the SPE register array into an internal data structure and initializes the registers before filling the LS with the binary SPE code via DMA (3). The loader will stop and signal completion to the PPE via an external interrupt. This interrupt is forwarded to the device driver by the PPE (4), which runs a tiny firmware layer as part of the only software prevailing on the PPE (as introduced in the previous section). While the SPE binary code executes, it generates mapping faults (4) whenever a reference to an unmapped host-resident address is encountered. This applies, for instance, to parameter data that may reside in a control block on the host. The corresponding memory mapping is established by the device driver via MMIO (5), and the host-resident data is then transferred via DMA (6). When the SPE stops execution, the host is signaled once again via interrupt and initiates a context switch and unload procedure, as described in Reference [9]. Similar to the previous steps, this phase involves unloader binary code that facilitates a swift transfer of the context data to the host memory via the SPE DMA engines. Memory addressing model One of the concepts of the directCell approach is to define a global address mapping that allows the SPEs to access host memory and the host to access the MMIO-mapped registers of the Cell/B.E. processor, LS, and local main memory. The mechanism that achieves this is based on using the memory management unit (MMU) of the host, the MMU and I/O MMU of the Cell/B.E. processor, and the PCIe implementation features, which we show in the rest of this section. The BAR setup section introduces how MMIO requests from the host pass the PCIe bus and the section on inbound mappings shows how those are translated to Cell/B.E internal addresses. These internal addresses are further translated by the I/O MMU to PowerPC real addresses (RAs). The IOMMU defines two translation ranges:
1. Requests directed to the SPE MMIO space are furnished with a predefined displacement by the southbridge (I/O chipset). The I/O MMU translates these displaced addresses to the base address of the SPE MMIO spaces within the Cell/B.E. processor addressing domain and preserves offsets to navigate within this range. 2. Requests directed to Cell/B.E. processor memory are translated 1:1 by the I/O MMU, also preserving the offset for internal navigation.
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
Figure 2 shows a simplified model for MMIO communication. The MMIO transfer is triggered by writing to virtual addresses (VAs) on the host that are translated by the host MMU to RAs that, in turn, map to the host PCI bus. The MMIO requests are delivered as PCI memory cycles to the accelerator PCIe core where they are further redirected to the interface between the southbridge and Cell/B.E. processor interface, as described below in the section on inbound mappings. At that point, the I/O MMU maps the requests to the corresponding units on the Cell/B.E. processor. The SPEs transfer data via DMA to or from the accelerator system upon remote initiation by the host via MMIO. The target of the data-transfer operations may be local Cell/B.E. processor addresses (LS or local main memory) or remote host memory. The host addresses must be mapped to the I/O range of the Cell/B.E. processor; hence, accesses to these regions do not satisfy the coherency requirements that apply for local main memory. The device driver is responsible for locking the host memory regions to prevent swapping. For future enhancements of the directCell model, we encourage the introduction of a mechanism on the host system to explicitly request local main memory buffering to improve performance and coherency considerations. Further enhancements could provide a software-managed mechanism to prefetch and cache host-resident application buffers into local main memory [19]. The memory mappings are established through direct updates of the SPE TLBs whenever the SPE encounters an unmapped address. This handling is initiated by a page fault, as described in the section on interrupt management. In our current implementation of directCell, the host device driver performs this management via MMIO. The current implementation features a page size of 1 MB; hence, a single SPE can work on 256 MB of application data buffered in local memory or host-resident memory at once without the need of a page table. This area can be increased if larger common page sizes are available on the host and the accelerator. To introduce hardware management of the TLBs to directCell, the firmware layer would need to be augmented to manage a page table in the local main memory. The flow of DMA operations is shown in Figure 2. The SPE views a DMA operation as a transfer from one VA to another. When reading data from the host system (blue arrows), the target address translates to an LS address, and the source address translates to an RA in the I/O domain, which points to the outbound memory region of the accelerator PCIe core. The mapping of PCIe bus addresses to host RAs is likely to involve another inbound translation step; however, this is transparent to
H. PENNER ET AL.
2:7
Host MMIO operations Host VA
Cell/B.E. RA
Host RA
Accelerator memory
PCIe bus Accelerator memory mapping SPE MMIO mapping
SPE MMIO Cell/B.E. I/O MMU (a) Mappings of individual pages across address spaces Cell/B.E. VA
Accelerator memory and SPE LS
Cell/B.E. MMU
Accelerator DMAs from host
Accelerator DMAs to host
Cell/B.E. RA
Host RA PCIe bus
User-space buffer
Host memory mapping in I/O space
(b)
Figure 2 The various levels of addressing translation and indirection between the host and the accelerator in directCell: (a) a control channel implemented by means of MMIO; (b) a data-transfer channel implemented via DMA. (VA: virtual address; RA: real address.)
the accelerator system and entirely dependent on the host OS, which directCell does not interfere with. PCIe PCIe was developed as the next-generation bus system that replaces the older PCI and PCI-X** standards. In contrast to PCI and PCI-X, PCIe uses a serial communication protocol. The electrical and bit layer is similar to the InfiniBand protocol, which also uses 8b10b coding for the bit layer [20]. The entire PCIe protocol stack is handled in hardware. The programmer does not need to build any queues or messages to transfer actual data. Everything is done directly with the processor load and store instructions, which move data from and to the bus. It is possible to map a memory region from the remote side to the local memory map. This mapping is software compatible with the older PCI or PCI-X standards. Because of this, a mapping can be created so that the local processor can access remote memory directly via load or store operations and effectively it does not need any software libraries to use this high-speed communication channel.
2:8
H. PENNER ET AL.
PCIe is used as a point-to-point interconnection between devices and a system. This allows centralizing the traffic routing and resource management and enables quality of services. Thus, a PCIe switch can prioritize packets. For a multimedia system, it could result in fewer dropped frames and lower audio latency. PCIe itself is based on serial links. A link is a dual-simplex connection using two pairs of wires. One pair is used for transmitting data and one is used for receiving data. The two pairs are known as a PCIe lane. A PCIe link consists of multiple lanes, for example, x1, x2, x4, x8, x16, or x32. PCIe is a good communication protocol for highbandwidth and low-latency devices. For example, a PCIe 2.0 lane can transmit 500 MB/s per direction, which yields to 16 GB/s per direction for an x32 link configuration. A very compelling possibility of the PCIe external cabling specification is the option to interconnect an external accelerator module via a cable to a corresponding host system. PCIe endpoint configuration PCIe defines point-to-point links between an endpoint and a root-complex device. In a standalone server
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
whereas the Cell/B.E. processor-based system southbridge is configured to run in endpoint mode. The configuration is done by the service processors of the two systems prior to booting. From the host system, the Cell/B.E. processor-based system is seen as a PCIe device, with significant device identification, which allows the BIOS (basic I/O system) and OS to configure the MMIO ranges and load the appropriate device drivers. The scalability of this model of endpoint coupling is limited only by how much I/O memory BIOS can map.
Host Accelerated application Windows libspe 1) Move accelerator memory window 2) DMA to accelerator memory 3) Operate SPEs
Cell./B.E. system driver Host central processing unit (x86)
I/O MMU
PCe inbound mapping In order to remotely access the Cell/B.E. processor-based system, the device driver and the firmware layer configure the Cell/B.E. processor-based system southbridge to correctly route inbound PCIe requests to the Cell/B.E. processor and the local main memory. This inbound mapping involves PCI base address registers (BARs) on the bus level and PCI inbound mappings (PIMs) on the southbridge level. Both concepts are introduced in the following sections.
MMU Host southbridge
Accelerator PCIe core BAR BAR BAR Configuration space PIM PIM PIM
Outbound mapping
128 MB
128 MB Southbridge internal bus Companion chip (southbridge)
MMU
I/O MMU
LS
DMA
Registers
0x8000 Registers
DMA
0x0
MMIO region
Cell/B.E. system memory
LS
Cell/B.E. processor ...
Figure 3 Overview of the inbound addressing mechanisms of the Cell/B.E. southbridge. Inbound PCIe requests are mapped to distinct ranges within the southbridge internal address space, which in turn are mapped to specific units of the Cell/B.E. processor. The address mappings can be changed at runtime.
configuration such as the QS22 BladeServer, the southbridge of the Cell/B.E. processor-based system is configured as a root complex to support add-in daughter cards. For directCell, the host system southbridge configuration is unchanged and runs as a root complex,
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
BAR setup The PCI Standard [21] defines the BARs for device configuration by firmware or software as a means to assign PCI bus address ranges to RAs of the host system. The BARs of the PCIe core in the southbridge are configured by PCIe config cycles of the host system BIOS during boot. After the configuration, it is possible to access internal resources of the accelerator system through simple accesses to host RAs. This makes it possible to remotely control any Cell/B.E. processorrelated resources such as the MMIO register space, SPU resources, and all available local main memory. However, to achieve full control of the accelerator system, an additional step is required, which is introduced in the section on inbound mappings. The directCell implementation uses three BAR configurations to create a memory map of the device resources into the address space of the host system:
1. BARs 0 and 1 forming one 64-bit BAR provide a 128-MB region of prefetchable memory used to access all of the accelerator main memory (Figure 3). 2. A 32-bit BAR 2 is mapped to the internal resources of the end-point configuration itself, making it possible to change mappings from the host. This BAR is 32 KB in size. 3. BARs 5 and 6 forming one 64-bit BAR are used to map the LS, the privileged state, and problem state registers of all SPEs of the Cell/B.E. processor. Figure 3 is an overview of this concept. We can see how the driver can conveniently target different BARs for directly reaching the corresponding hardware resources.
H. PENNER ET AL.
2:9
Inbound mappings The Cell/B.E. processor companion chip, southbridge, introduces PIM registers to provide internal address translation capabilities. PIMs apply a displacement within the southbridge internal bus address space, as shown in Figure 3. For all BARs, corresponding PIM registers exist to map addresses on the PCIe bus to the southbridge internal bus [22] addresses connected to the southbridgeto-Cell/B.E. processor interface. The PIM register translation for BAR 2 is set up by the accelerator firmware to refer to the MMIO-accessible configuration area of the PCIe core. The PIMs for BARs 0 and 1 and 5 and 6 are adjusted by the host device driver during runtime. In order to do so, the device driver uses the inbound mapping for BAR 2. The resources of the Cell/B.E. processor are made accessible to a PPE space address by the I/O MMU setup (described previously in the section on PCIe and depicted in the lower part of Figure 3). The PIM register translation for BAR 0 and 1 is changed during the runtime of host applications according to the needs of the driver. The driver provides a movable window across the complete main memory of the accelerator, though at a given point in time, only 128 MB of contiguous accelerator memory space is available (also shown in Figure 3).
Table 1
Interrupt management The asynchronous execution of SPE programs requires an interrupt handling infrastructure for directCell. SPE programs use interrupts to signal such events as mapping faults, errors, or program completions. Several layers are involved in handling SPE interrupts. An external interrupt is routed to the interrupt controller on the PPE, which is handled by an interrupt handler in the PPE resident firmware layer. This handler determines which SPU triggered the interrupt and directly triggers an interrupt to the PCIe core on the southbridge, which is in turn handled by the host device driver. Three types of SPE interrupts are handled by directCell:
1. A BladeCenter server-based configuration IBM QS22 blade with an IBM HS21 blade. The interconnect is PCIe Gen1, x4. 2. A PowerXCell processor-based accelerator board PCIe add-in card connected to a standard x86-based advanced technology extended personal computer. The interconnect is PCIe Gen1, x16.
1. SPU external interrupt: page-fault exception handling, DMA error, program completion. 2. Internal mailbox synchronization interrupt. 3. Southbridge DMA engine interrupt. The inclusion of the PPE in the interrupt-handling scheme is necessary because the provided hardware does not allow the direct routing of the SPE interrupt to the PCIe bus. We propose to include the ability to freely reroute interrupts to different targets in future processors and I/O devices.
2 : 10
H. PENNER ET AL.
Throughput from SPE local store to x86-based host
memory. Throughput (MB/s)
Configuration QS22 at 3.2 GHz, SMP, PCIe x4, reads
719.22
QS22 at 3.2 GHz, SMP, PCIe x4, writes
859.44
PXCAB at 2.8 GHz, PCIe x16, reads
2,333.95
PXCAB at 2.8 GHz, PCIe x16, writes
2,658.60
Performance evaluation In this section, first results of the directCell implementation are presented. The results are sectioned into a discussion on general throughput and latency measurements for specific directCell hardware configurations and a discussion of an exemplary application that we apply to our implementation of directCell. The latter section elaborates on the effort of porting the application to directCell as well as application performance data. General measurements These tests aim to determine the throughput and latency characteristics of the host and accelerator interconnect in two distinct directCell hardware configurations:
All measurements are performed by a small firmware kernel running on the PPE of the Cell/B.E. processorbased system. The kernel runs on the PowerXCell PPE in real addressing mode and programs the SPE memory flow controller DMA facilities via MMIO and transfers LS contents to the host memory and back. Because the MMIO operations are significantly slower than programming the DMA directly from the SPE, the methodology is not ideal; however, each test determines the additional latency of all involved MMIO operations and subtracts it from the overall duration. Table 1 lists the throughput from the SPE LS to x86based host memory. The values shown were measured with two SPEs submitting requests; however, a single SPE will saturate the PCIe bus in both system configurations. With regards to latency, we conducted several measurements to determine the overhead of the I/O path. We regard the hop from the PowerXCell processor to the
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
6,000
1,600
SPE, southbridge loop back, local memory SPE, southbridge loop back, SPE SPE, southbridge PCIe, host memory
Latency (ns)
1,400 1,200 1,000 800 600 400 200
Simulations per second (millions)
1,800
5,000
4,000
Standalone QS22 1 SPE Standalone QS22 8 SPEs (avg.) directCell QS22 1 SPE directCell QS22 8 SPEs (avg.) directCell PXCAB 1 SPE directCell PXCAB 8 SPEs (avg.)
3,000
2,000
1,000
0 PXCAB 2.8-GHz read
QS22 3.2-GHz SMP read
PXCAB 2.8-GHz write
QS22 3.2 GHz-SMP write
0
Figure 5 Figure 4
Overall application performance.
Latency of 16 KB SPE transfers to various targets (using processor timebase).
southbridge separately from the overall access to the host memory. This is possible by providing a PowerXCell processor address that routes the request to the southbridge in a way that it gets immediately routed back to the PowerXCell processor by the southbridge. One of these loop-back tests refers back to LS, a second accesses the Cell/B.E. processor-resident DDR2 (double-data-rate 2) memory. The results for both tests are shown in Figure 4, along with results for end-to-end data transfers from LS to the x86 host memory. Application porting As an exemplary use case of directCell, we have ported a Cell/B.E. processor-based application from the financial services domain for European option trading. The underlying software is a straightforward implementation of a Monte Carlo pricing simulation. A derivative of this application has been part of an IBM showcase at the Securities Industry and Financial Markets Association Technology Management and Conference and Exhibit [23]. The computing-intensive tasks of the application lie in the acquisition of random integer numbers and Moro’s inversion of those [24]. The standalone Linux-based version of the application transfers and starts one or more SPE threads that will in turn read in a control block from main memory before starting the simulation. The control block contains such parameters as the number of simulations that should be performed, the simulated span of time, and the space to return result data. After
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
completion, the control block is transferred back to main memory. The development of the application port involved porting the PPE code to the x86 architecture and Windows as well as providing for endianness conversion of the application control block in the SPE program. Of the PPE code, 16% of all lines are reused verbatim. The remaining code underwent mostly syntax changes to conform to the data types and library function signatures of Windows. The control flow of the application was not changed; all lines of SPE code are reused verbatim. Some additional 85 lines of code were needed to perform endianness conversion of the control block data. The conversion code makes use of the spu_shuffle instruction with a set of predefined bit-swap patterns. With adequate programming language support, such a code block can be easily generated by a compiler, even for complex data structures. The application port was performed on a Microsoft Windows Server** 2008-based implementation of directCell. Application results We compare the application performance on a standalone native Cell/B.E. processor running Linux on a BladeCenter QS22 server as well as a directCell configuration of the same system with a BladeCenter HS21 server and a directCell PXCAB configuration connected to an AMD64. The most important application parameters are set to 10 timesteps and 1 million simulations per run per SPE. Figure 5 shows that the number of simulations per second for the QS22-based directCell configuration is
H. PENNER ET AL.
2 : 11
Table 2
800
Interrupt type
700 Standalone QS22 1 SPE directCell QS22 1 SPE directCell PXCAB 1 SPE Standalone QS22 8 SPEs (avg.) directCell QS22 8 SPEs (avg.) directCell PXCAB 8 SPEs (avg.)
Duration ( s)
600 500 400 300 200 100 0
spe_context_create spe_program_load spe_context_destroy
Figure 6 Duration of host-resident libspe2 functions.
only marginally lower than the number achieved in the nonhybrid standalone system. The overall performance for the PXCAB-based directCell configuration is linear to the lower processor frequency of PXCAB compared with QS22. We also evaluated the duration of distinct libspe2 function calls. Under Windows Server 2008, these measurements were conducted by using the QueryPerformanceCounter function. Under PowerPCLinux, the gettimeofday function was used to acquire the durations. The results shown for host-resident function calls in Figure 6 reveal that the host processors used in the directCell model perform certain tasks faster than the PPE of the PowerXCell processor. Also, the frequency of the function calls has a significant impact on their average duration. Finally, we measured the duration of handling SPE page and segment faults in directCell. For this, we used a PCIe analyzer that triggers first upon interrupt assertion on the bus and triggers again upon the bus cycle that updates the TLB of the SPEs with a new memory mapping. This method omits the time that the interrupt and the resulting MMIO cycle need to propagate through the Cell/B.E. processor-based system, but the orders of magnitude in terms of time for the handling of the page fault on the bus and the host system handling allow us to neglect this, as shown in Table 2. Conclusion and outlook We have presented directCell, an approach to attach a Cell/B.E. processor-based accelerator to a Windows-
2 : 12
SPE memory fault measurements.
H. PENNER ET AL.
Duration (ls)
SPE segment fault
15.984
SPE page fault
20.56
based x86 host system using PCIe connectivity. This configuration taps today’s software environment while the native Cell/B.E. processor programming model is maintained, leveraging existing code for the CBEA. The tight coupling of the accelerator and its low communication latency allows for a broad range of applications that go beyond HPC and streaming workload while offering an integrated and manageable system setup. The viability of this approach has been demonstrated in a prototype. The PPE portion of the Cell/B.E. processor runs a slim firmware layer implementing initialization, memory management, and interrupt handling code, granting the x86 host system full access to the SPEs. Further research should be undertaken to optimize memory management and the work split between host control software and the Cell/B.E. processor PPE firmware components. Exploitation of the memory local to the Cell/B.E. processor-based accelerator can improve application performance transparently. Efforts to evaluate scalability across several accelerator engines as well as building an environment for efficient development can lead to broader use of directCell. *Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both. **Trademark, service mark, or registered trademark of NVIDIA Corporation, Microsoft Corporation, InfiniBand Trade Association, HyperTransport Technology Consortium, PCI Special Interest Group, OpenMP Architecture Review Board Corporation, Apple, Inc., Linus Torvalds, or The Open Group in the United States, other countries, or both. ***Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc., in the United States, other countries, or both and is used under license therefrom.
References 1. CUDA Zone, NVIDIA Corporation; see http:// www.nvidia.com/object/cuda_home.html#. 2. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, et al., ‘‘Larrabee: A Many-Core x86 Architecture for Visual Computing,’’ ACM Trans. Graph. 27, No. 3, 1–15 (2008). 3. IBM Corporation, IBM Cell Broadband Engine Technology; see http://www-03.ibm.com/technology/cell/index.html. 4. IBM Corporation (June 9, 2008).Roadrunner Smashes the Petaflop Barrier. Press release; see http://www-03.ibm.com/ press/us/en/pressrelease/24405.wss.
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
5. ‘‘MPI: A Message-Passing Interface Standard Version 2.1,’’ Message Passing Interface Forum, 2008; see http://www. mpi-forum.org/docs/mpi21-report.pdf. 6. IBM Corporation, Accelerated Library Framework Programmer’s Guide and API Reference; see http:// publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ eiccn/alf/ALF_Prog_Guide_API_v3.1.pdf. 7. IBM Corporation, Data Communication and Synchronization Library Programmer’s Guide and API Reference; see http:// publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ eicck/dacs/DaCS_Prog_Guide_API_v3.1.pdf. 8. IBM Corporation, IBM Dynamic Application Virtualization; see http://www.alphaworks.ibm.com/tech/dav. 9. OpenMP Architecture Review Board, OpenMP Application Programming Interface Version 3.0, May 2008; see http:// www.openmp.org/mp-documents/spec30.pdf. 10. N. Trevett, ‘‘OpenCL Heterogeneous Parallel Programming,’’ Khronos Group; see http://www.khronos.org/developers/ library/2008_siggraph_bof_opengl/OpenCL%20and% 20OpenGL%20SIGGRAPH%20BOF%20Aug08.pdf. 11. IBM Corporation, IBM BladeCenter QS22; see ftp:// ftp.software.ibm.com/common/ssi/pm/sp/n/bld03019usen/ BLD03019USEN.PDF/. 12. IBM Corporation, Cell/B.E. Technology-Based Systems; see http://www-03.ibm.com/technology/cell/systems.html. 13. IBM Corporation, SPE Runtime Management Library Version 2.2; see http://www-01.ibm.com/chips/techlib/ techlib.nsf/techdocs/1DFEF31B3211112587257242007883F3/ $file/SPE_Runtime_Management_API_v2.2.pdf. 14. K. Koch, ‘‘Roadrunner Platform Overview,’’ Los Alamos National Laboratory; see http://www.lanl.gov/orgs/hpc/ roadrunner/pdfs/Koch%20-%20Roadrunner%20Overview/ RR%20Seminar%20-%20System%20Overview.pdf. 15. A. Bergmann, ‘‘Spufs: The Cell Synergistic Processing Unit as a Virtual File System’’; see http://www-128.ibm.com/ developerworks/power/library/pa-cell/. 16. A. Heinig, R. Oertel, J. Strunk, W. Rehm, and H. J. Schick, ‘‘Generalizing the SPUFS Concept–A Case Study Towards a Common Accelerator Interface’’; see http://private.ecit. qub.ac.uk/MSRC/Wednesday_Abstracts/Heinig_Chemnitz.pdf. 17. IBM Corporation, Software Development Kit for Multicore Acceleration Version 3.1, SDK Quick Start Guide; see http:// publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ eicce/eiccesdkquickstart.pdf. 18. IBM Corporation, Cell Broadband Engine Programming Handbook; see http://www-01.ibm.com/chips/techlib/techlib.nsf/ techdocs/9F820A5FFA3ECE8C8725716A0062585F. 19. IBM Corporation, Cell Broadband Engine SDK Example Libraries: Example Library API Reference, Version 3.1; see http://publib.boulder.ibm.com/infocenter/systems/scope/syssw/ topic/eicce/SDK_Example_Library_API_v3.1.pdf. 20. PCI-SIG, PCI Express Base 2.0 Specification Revision 1.1, March 28, 2005; see http://www.pcisig.com/specifications/ pciexpress/base2/. 21. PCI-SIG, PCI Conventional Specification 3.0 & 2.3: An Evolution of the Conventional PCI Local Bus Specification, February 4, 2004; see http://www.pcisig.com/specifications/ conventional/. 22. IBM Corporation, CoreConnect Bus Architecture; see http:// www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/ 852569B20050FF7785256991004DB5D9/$file/crcon_pb.pdf. 23. J. Easton, I. Meents, O. Stephan, H. Zisgen, and S. Kato, ‘‘Porting Financial Markets Applications to the Cell Broadband Engine Architecture,’’ IBM Corporation; see http://www-03.ibm.com/industries/financialservices/doc/content/ bin/fss_applications_cell_broadband.pdf. 24. IBM Corporation, Moro’s Inversion Example; see http:// publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ eiccr/mc/examples/moro_inversion.html.
IBM J. RES. & DEV.
VOL. 53 NO. 5 PAPER 2 2009
Received September 22, 2008; accepted for publication January 14, 2009 Hartmut Penner IBM Systems and Technology Group, IBM Boeblingen Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany (
[email protected]). Mr. Penner is a Senior Technical Staff Member responsible for the firmware architecture of Cell/B.E. processor-based systems. He holds an M.S. degree in computer science from the University of Kaiserslautern, Germany. During his career at IBM, he has been involved in many different fields, working on Linux, the GNU compiler collection, and various firmware stacks of IBM servers. Utz Bacher IBM Systems and Technology Group, IBM Boeblingen Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany (
[email protected]). Mr. Bacher is an architect for hybrid systems. He earned his B.S. degree in information technology from the University of Cooperative Education in Stuttgart, Germany. He developed highspeed networking code for Linux on System z*. From 2004 to 2007, he led the Linux on Cell/B.E. processor kernel development groups in Germany, Australia, and Brazil. Since 2007, he has been responsible for designing the system software structure of future IBM machines.
Jan Kunigk IBM Systems and Technology Group, IBM Boeblingen Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany (
[email protected]). Mr. Kunigk joined IBM in 2005 as a software engineer after receiving his B.S. degree in applied computer science from the University of Cooperative Education in Mannheim, Germany. Prior to his involvement in hybrid systems and the directCell prototype, he worked on firmware and memory initialization of various Cell/B.E. processor-based systems. Christian Rund IBM Systems and Technology Group, IBM Boeblingen Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany (
[email protected]). Mr. Rund received his M.S. degree in computer science from the University of Stuttgart, Germany. He joined IBM in 2001 as a Research and Development Engineer for the IBM zSeries* Fibre Channel Protocol channel. He has recently been involved in firmware and software development for the Cell/B.E. processor and the directCell prototype hybrid system. Heiko Joerg Schick IBM Systems and Technology Group, IBM Boeblingen Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany (
[email protected]). Mr. Schick earned his M.S. degree in communications and software engineering at the University of Applied Sciences in Albstadt-Sigmaringen and graduated in 2004. Currently, he is the firmware lead of the Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine (QPACE) project, which is a supercomputer mission of IBM and European universities and research centers to do quantum chromodynamics parallel computing on the Cell Broadband Engine Architecture.
H. PENNER ET AL.
2 : 13