Directcell: Hybrid Systems With Tightly Coupled Accelerators

Uploaded by: Heiko Joerg Schick
0
0

June 2020
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Directcell: Hybrid Systems With Tightly Coupled Accelerators as PDF for free.

More details

Words: 8,886
Pages: 13

Preview
Full text

directCell: Hybrid systems with tightly coupled accelerators

H. Penner U. Bacher J. Kunigk C. Rund H. J. Schick

The Cell Broadband Enginet (Cell/B.E.) processor is a hybrid IBM PowerPCt processor. In blade servers and PCI Expresst card systems, it has been used primarily in a server context, with Linuxt as the operating system. Because neither Linux as an operating system nor a PowerPC processor-based architecture is the preferred choice for all applications, some installations use the Cell/B.E. processor in a coupled hybrid environment, which has implications for the complexity of systems management, the programming model, and performance. In the directCell approach, we use the Cell/B.E. processor as a processing device connected to a host via a PCI Express link using direct memory access and memory-mapped I/O (input/output). The Cell/B.E. processor functions as a processor and is perceived by the host like a device while maintaining the native Cell/B.E. processor programming approach. We describe the problems with the current practice that led us to the directCell approach. We explain the challenge in programming, execution control, and operation on the accelerators that were faced during the design and implementation of a prototype and present solutions to overcome them. We also provide an outlook on where the directCell approach promises to better solve customer problems.

Motivation and overview Multicore technology is a predominant trend in the information technology (IT) industry: Many-core and heterogeneous multicore systems are becoming common. This is true not only for the Cell Broadband Engine*** (Cell/B.E.) processor, which initiated this trend, but also for concepts such as NVIDIA CUDA** [1] and Intel Larrabee [2]. The Cell/B.E. processor and its successor, the IBM PowerXCell* 8i processor, are known for their superior computing performance and power eﬃciency. They provide more ﬂexibility and easier programmability compared with graphics processing devices or ﬁeld programmable arrays, alleviating the challenge of adapting software to accelerator technology. Yet, it is still diﬃcult in some scenarios to fully leverage these performance capabilities with Cell/B.E. processor chips

alone. The main inhibitors are lack of applications for the IBM Power Architecture* and the performance characteristics of the IBM PowerPC* core implementation in the Cell/B.E. processor. For mixed workloads in environments for which the Cell/B.E. processor is not suitable, such as Microsoft Windows**, various hybrid topologies consisting of Cell/B.E. processors and non-Cell/B.E. processor-based components can be appropriate, depending on the workload. Adequate system structures such as the integration of accelerators into other architectures have to be employed to reduce management and software overhead. The most signiﬁcant factors to be considered in the design of Cell/B.E. processor-based system complexes are the memory and communication architectures. The following sections describe the properties of the Cell/B.E. Architecture (CBEA) and their consequences for system design.

Copyright 2009 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the ﬁrst page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/09/$5.00 ª 2009 IBM

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

H. PENNER ET AL.

2:1

Memory hierarchy of the CBEA The CBEA [3] contains various types of processor cores on one chip: The IBM PowerPC processor element (PPE) is compatible with the POWER* instruction set architecture (ISA) but employs a lightweight implementation with limited branch prediction and inorder execution. The eight synergistic processor elements (SPEs) are independent cores with a diﬀerent ISA optimized for computation-intensive workloads. The SPE implementation on the chip targets eﬃciency, using inorder execution and no dynamic branch prediction. Code running on the PPE can access main memory through the on-chip memory controller. In contrast, the SPE instruction set allows access only to a very fast but small memory area, the local store (LS), that is local to the core. The LSs on the SPEs are not coherent to any other entity on the chip or in the system but can request atomicity of memory operations with respect to concurrent access to the same memory location. For communication and dataﬂow, the SPEs are able to access main memory and I/O (input/output) memory using direct memory access (DMA). Loosely coupled systems Loosely coupled systems cannot access the memories of other nodes. Most clusters in the area of highperformance computing (HPC) follow this system design, including the ﬁrst petaﬂops supercomputer [4]. If the application permits, these systems scale very well. Interconnectivity is usually attained by a high-speed network infrastructure such as InﬁniBand**. The data distribution across the nodes is coarse grained to keep the network overhead for synchronization and data exchange low compared with the data processing tasks. The predominant programming model for loosely coupled systems is Message Passing Interface (MPI) [5], potentially enriched by the hybrid IBM Accelerator Library Framework (ALF) [6] and hybrid Data Communication and Synchronization (DaCS) [7] libraries to enable the Cell/B.E. processor in mixed-architecture nodes. Another model based on remote procedure calls is IBM Dynamic Application Virtualization [8]. Tightly coupled systems Tightly coupled systems are usually smaller-scale systems consisting of few computing entities. In our context, a single host component, such as an x86 processor-based system, exploits one or several accelerator devices. Accelerators can range from ﬁeld-programmable gate arrays (FPGAs) over graphics processing units (GPUs) to Cell/B.E. processor-based accelerator boards, each oﬀering diﬀerent performance and programmability attributes. Accelerators can be connected in a memorycoherent or noncoherent fashion. Interconnects for

2:2

H. PENNER ET AL.

memory-coherent complexes include Elastic Interface or Common Systems Interconnect. HyperTransport** can be used in memory coherent and PCI Express** (PCIe**) can be used in noncoherent memory setups. It is a common feature of tightly coupled systems that they are able to access the single main memory. Applications for tightly coupled systems usually run on one host part of the system (e.g., an x86 processor), while the other parts of the system (accelerators such as PowerXCell 8i processor-based entities or GPUs) will perform computation-intensive tasks on data in main memory. The tightly coupled approach makes deployment easier because only one host system with accelerator add-ons has to be maintained rather than a cluster of nodes. Depending on the type of the tightly coupled accelerator, the approach can be applied well to non-HPC applications and allows for eﬃcient and ﬁnegrained dataﬂow between components, as latency and bandwidth are better than in loosely coupled systems. Programming models include OpenMP** [9], ALF [6], and DaCS [7] libraries, CUDA, the Apple OpenCL** stack [10], and FPGA programming kits. Compared with FPGA and GPU approaches, Cell/B.E. processor accelerators, with their fast and large memories, oﬀer the highest ﬂexibility and generality while maintaining an easy-to-use programming model.

High-level system description This section describes the directCell architecture. It reﬂects on properties of acceleration cores and gives insight on the design principles of eﬃciently attaching the accelerator node. General-purpose processor compared with SPEs on the Cell/B.E. processor The remarkable performance of the Cell/B.E. processor is achieved mainly through the SPEs and their design principles. The lack of a supervisor mode, interrupts, and virtual memory addressing keeps the cores small, resulting in a high number of cores per chip. High computational performance is achieved through singleinstruction, multiple-data (SIMD) capabilities. In directCell, the SPE cores are available for application code only, while the PPE is not used for user applications. This eﬀectively makes the directCell accelerator an SPE multicore entity. Unlike limited acceleration concepts such as GPUs or FPGAs, the SPE cores are still able to execute general-purpose code, for example, that generated by an ordinary C compiler. In standalone Cell/B.E. processor-based systems, OS (operating system)-related tasks are handled on the PPE. The task of setting up SPEs and starting program codes on them is controlled from outside, using a set of memory-mapped registers. All of these registers are

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

accessible through memory accesses and no special instructions have to be used on the PowerPC core of the Cell/B.E processor. Therefore, control tasks can be performed by any entity having memory access to the Cell/B.E. processor-side memory bus. Coupling as device and not as system In most cases, porting Windows applications to the Power Architecture is not a viable option. Instead, identifying computational hotspots and porting these to the Cell/B.E. processor-based platform oﬀers a way to speed up the application. As the SPE cores are used for acceleration, the port has to target the SPE instruction set. Cell/B.E. processor-based systems use PCIe as a highspeed I/O link. PCIe is designed for point-to-point shortdistance links. This communication model does not scale to many endpoints. However, it provides best-of-breed bandwidth and latency characteristics and supports memory-mapped I/O (MMIO) and DMA operations. Because the Cell/B.E. processor resources are accessible via MMIO, it is possible to directly integrate it into the address space of other systems through PCIe following the tightly coupled accelerator concept. Communication overhead in this PCIe-based setup can be kept minimal. The directCell approach takes advantage of the low latencies of PCIe and makes the SPEs available for existing applications as an acceleration device instead of a separate system. To export a Cell/B.E. processor-based system as an acceleration device on the PCIe level, it must be conﬁgured as a PCIe endpoint device, an approach currently used by all hybrid setups involving the Cell/B.E. processor, including the ﬁrst petaﬂops computer [4]. The directCell solution evolves from the endpoint mode and uses the Cell/B.E. processor as a device at the host system software level. We propose to relocate all control from the PPE to the host system processor except for the bare minimum needed for accelerator operation. For directCell, we further propose to shift all application code executed on the PPE in a standalone conﬁguration over to the host processor. This includes scheduling of SPE threads and SPE memory management and also applies to the PPE application code. Thus, the host controls all program ﬂow on the accelerator device. None of the Cell/B.E. processor resources are hidden in our model. More importantly, nothing on the accelerator device obstructs the host from controlling it. In contrast to a distributed hybrid system approach that relies on an accelerator-resident OS to control all of its memory, we consider an accelerator-resident OS superﬂuous. The host OS runs the main application and is, therefore, in charge of all of the hardware resources that are needed for its execution.

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

In distributed hybrid systems, the host OS needs to coordinate data-transfer operations with the accelerator OS. MPI [5], which was designed for distributed systems, features a sophisticated protocol to establish zero-copy remote DMA (RDMA) transfers that are vital to hybrid system performance. Without using an OS on the accelerator, the host can act independently, rendering much more ﬂexibility to the host system programmer and saving additional runtime overhead. In order to support this execution model from the host system software, the accelerator hardware should provide the following mechanisms: Endpoint functionality. MMIO control of processor cores, including next

instruction pointer, translation lookaside buﬀer (TLB), machine state register (MSR), and interrupt acknowledgment. Interrupt facilities, Optionally PCIe coherency. Providing these mechanisms in a processor design yields enormous rewards with regard to system conﬁgurations, especially in hybrid conﬁgurations. The directCell architecture applies to Cell/B.E. processor-based blades such as the IBM BladeCenter* QS22 server [11] with speciﬁc ﬁrmware code to operate in PCIe endpoint mode and the IBM PowerXCell accelerator board (PXCAB) oﬀering [12].

Applications and programming Performance and ﬂexibility are two essential and generally opposing criteria when evaluating computer architectures. On the performance side, specially designed circuits, digital signal processors, and FPGAs provide extremely high computing performance but are lacking generality and programmability. Regarding ﬂexibility, general-purpose processors allow for a wide application spectrum but lack the performance of highly optimized systems. Cell/B.E. processors follow a tradeoﬀ and provide high computing performance while still oﬀering a good degree of ﬂexibility and ease of programming. The tightly coupled hybrid approach proposed with directCell in this paper promises to capture the advantages of both extremes by combining Cell/B.E. processor-based accelerators for performance and x86 systems for wide applicability. Two aspects apply to programmability for these hybrid systems: ﬁrst, the complexity to program the accelerator (i.e., the SPEs), and second, the integration of the tightly coupled accelerator attachment into the main application running on the host system (i.e., the model to exploit the SPEs from the x86 system).

H. PENNER ET AL.

2:3

Standard programming framework of the Cell/B.E. processor The SPE Runtime Management Library (libspe2) [13], developed by IBM, Sony, and Toshiba, deﬁnes an OS-independent interface to run code on the SPEs. It provides basic functions to control dataﬂow and communication from the PPE. Cell/B.E. processor applications use this library to start the execution of code on the SPEs using a threading model while the PPE runs control and I/O code. The code on the PPE is usually responsible for shuﬄing data in and out of the system and for controlling execution on the SPEs. In most cases, code on the SPEs pulls data from main memory into the LSs so that the SPEs can work on this data. Synchronization between the SPEs and between the PPE and SPEs can take place through a mailbox mechanism. Programming tightly coupled accelerators The programming model and software framework for tightly coupled accelerators such as the Cell/B.E. processor can be structured in several ways. One way, which resembles the loosely coupled systems approach [14], is to add a communication layer between the two standalone systems, the host and Cell/B.E. processorbased system. These additional layers increase the burden on software developers who have to deal with an explicit communication model instead of a programming library. A diﬀerent model is used for directCell that replaces parts of the PPE functionality with host processor code. The PPE does not run application-speciﬁc code; instead, code on the host processor assumes the functions that were previously performed by the PPE. This allows applications to fully leverage the advantages of a tightly coupled accelerator model. In our implementation of directCell, the libspe2 library is ported to the x86 architecture, which allows control of SPE code execution and synchronization between an x86 system and SPE entities. This reduces the complexity of programming the directCell system to the level of complexity required to natively program the Cell/B.E. processor. Existing Cell/B.E. processor-aware code can continue to run unchanged or with limited changes in this environment, tapping a large number of kernels running on SPEs for x86 environments. Furthermore, existing Windows or Linux** applications can be extended by libspe2 calls to leverage SPE acceleration, opening up the x86 environment and other application environments to Cell/B.E. processor acceleration. Implications of the tightly coupled accelerator structure The directCell approach has several implications on programming and applications:

2:4

H. PENNER ET AL.

Most of the application code runs on x86. Any code

that previously ran on the PPE now has to be built for and run on the x86 platform. The libspe2 application programming interface (API) to control SPEs is moved to x86, i.e., libspe2 is an x86 library. SPEs can perform DMA transfers between the SPE LS and the x86 main memory as used by the main application. DMA to the northbound memory that is local to the Cell/B.E. processor accelerator systems is still possible for SPEs. For transfers between the x86 memory and SPEs, the diﬀerent endianness characteristics of the architectures need to be considered. (Endianness refers to whether bytes are stored in the order of most signiﬁcant to the least signiﬁcant or the reverse.) SIMD shuﬄe byte instructions can be used for eﬃcient data conversion on the SPEs and may have to be added as a conversion layer, depending on the type of data exchanged between host and accelerator. The PCIe interface between the Cell/B.E. processor and the x86 components does not permit atomic DMA operations, a restriction that applies to all accelerators attached through PCIe. Atomic DMA operations on the Cell/B.E. processor local memory are still possible. This constraint requires code changes to existing applications that use atomic DMA operations for synchronization. Memory local to the Cell/B.E. processors can be used by SPE code for buﬀering; because of its proximity to the Cell/B.E. processor chips, it has better performance characteristics than the remote x86 memory. However, adding this mid-layer of memory between the x86 main memory and the SPE LSs adds to application complexity. A good way of using the memory local to the Cell/B.E. processors might be to use it to prefetch and cache large chunks of host system data. The complexity of prefetching and loading data on-demand from the Cell/B.E. processor local memory can typically be encapsulated in a few functions.

Structure of directCell The directCell model can be applied to Cell/B.E. processor-based systems only by means of software. Several software components are needed on the host and accelerator system side. In the following sections, we introduce the components and how they interface with each other. Libspe2.dll The basic interface for controlling SPE applications in standalone Cell/B.E. processor-based systems is libspe2

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

[13]. To conform to this widely adopted programming interface, and to be able to reuse a large amount of existing code, directCell provides a Windows port via a Windows DLL (dynamic linked library) that provides such functions as spu_context_create() and spu_context_run(). The libspe2 routines create a directCell device instance and send I/O control commands processed by the device driver. The libspe2 used in standalone system conﬁgurations interfaces with the SPE hardware resources through the synergistic processor unit (SPU) ﬁle system (spufs) [15] in the Linux kernel. Spufs oﬀers two system calls and a virtual ﬁle system for creating SPE threads that the Linux kernel schedules to run on the SPE. In the Windows implementation, the spe_context_create() libspe2 call allocates a user-space representation of the SPE program and also a kernel-space device instance that is accessed by a device handle in user space. The function spe_context_run() invokes a corresponding I/O control (ioctl) that passes the device handle and a pointer to the user-space buﬀer as parameters. SPE binaries are embedded in a given Windows application as executable and linking format (ELF) ﬁles. In our Windows implementation, the ELF ﬁles are integrated into accelerated applications as ELF resources. The Windows API provides standard resource-handling functions that are used to locate and extract a previously embedded ﬁle at runtime. The libspe2 port also implements ELF loading functionality for processing the embedded SPU program to pass it to the spe_context_run() call. Device driver Our implementation of directCell features a sophisticated device driver that provides a software representation of all of the Cell/B.E. processor facilities on a host system. Accelerated application code is loaded, executed, and unloaded by operating the device driver interfaces. From a conceptual level, this closely corresponds to the wellestablished spufs. The idea to span the spufs interface across a high-speed interconnect has been pursued by Heinig et al. [16] as rspufs [16]. Regarding the user-space representation, spufs is based on the UNIX** ﬁle concept by representing SPE hardware through a virtual ﬁle system. In the case of Windows, the representation of hardware in user space is dominated by a diﬀerent model. In Windows, ﬁles, volumes, and devices are represented by ﬁle objects, but the concept of imposing a directory structure for device control is not widely supported. Windows allows the creation of device interface classes that are associated with a globally unique identiﬁer (GUID) and a set of system services that it implements. A device interface class is instantiated by creating a ﬁle handle with the GUID. This handle is used to invoke

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

system services. The granularity of this scheme centers around the idea of a single device node as found in a Linux /dev tree. The ﬁne-grain control gained by subdividing the device details into a ﬁle system hierarchy such as spufs is opposed by a large variety of system services, each tailored for a speciﬁc purpose on a distinct device class. Device drivers that support I/O operations implement a relatively small set of I/O functions via an I/O queue. The driver I/O queue registers callbacks, most importantly read, write, I/O-control, stop, and resume. The actual communication between a user-space application and a device takes place through an appropriate device service that invokes the Windows I/O manager, which forms an I/O request package, puts it into the I/O queue of the correct driver, and notiﬁes the driver via the previously registered callbacks. The device driver can satisfy some of the function requirements of higher level libraries such as libspe2 by using existing system services. The command spe_context_create() generates an SPE context by invoking the CreateFile() system service using the Windows Cell/B.E. processor device handle. This will return a handle to a kernel representation of an SPE context that is similar to the representation returned by an spu_create() system call in spufs. The spu_run() function is implemented as an ioctl that operates on the previously created SPE handle. The ioctl is nonblocking, although upper-level software should wait for completion via designated wait routines for asynchronous I/O. The ioctl initiates a direct DMA transfer of the SPE context from host user-space memory to the LS of the SPU. All data transfers in directCell follow this zero-copy paradigm. Regarding the mechanisms to access the Cell/B.E. processor hardware facilities, the device driver must perform additional steps compared to spufs. These additional steps are due mainly to the conversion of the Cell/B.E. processor address space and the host system address space, as described in the following sections. Cell/B.E. processor firmware layer The standard product ﬁrmware of a Cell/B.E. processorbased system is supplemented by a minimal software layer that takes care of additional initialization tasks after the endpoint-aware standard ﬁrmware has ﬁnished the basic boot. Most importantly, this layer provides accelerator runtime management and control. The initialization tasks comprise the setup of the address translation mechanism of the SPEs, the reporting of available SPEs to the device driver before accelerator usage, and the setup of the I/O memory management unit (I/O MMU). The runtime management and control functionality centers around handling of SPE events such as page faults and completion signals. This handling is

H. PENNER ET AL.

2:5

4

Cell/B.E. processor-based system

Host system Host memory

Host processor Core

Local store

Device driver 1

Control block

3,6

L2 cache

Issue MMIO

Firmware

MMIO registers

L2 cache

2,5

Element interconnect bus

Host southbridge

Core

6

3

2,5

4 Execution units

MFC

DMA SPU context

PPE

SPE SPU

Memory Southbridge chipset

PCIe link DMA

Figure 1 In directCell, an SPE program is started on the host processor by directly programming an SPE via the PCIe link between the host processor southbridge and that of the Cell/B.E. processor. Shown are the most important directCell software components and their lifecycle in the directCell hybrid system conﬁguration.

vital for the execution of SPE binaries. Additional runtime debug and monitoring facilities for the SPEs complete the ﬁrmware layer functionality. The event handler and the debug monitor each run on separate PPE hardware threads. In our current implementation, the debug monitor provides a simple facility for probing memory locations, memory-mapped resources, and interfaces via the serial console. In future implementations, the debug console I/O could be redirected via PCIe to a console application that runs on the host.

Accelerator operation While directCell maintains compatibility with the common programming models for the CBEA [17], it introduces a new runtime model. Figure 1 shows and the following paragraphs describe in detail how directCell delivers eﬃcient workload acceleration and makes additional overhead imposed by a Cell/B.E. processorresident OS superﬂuous. The numbers in parentheses below refer to the numbers in red in the ﬁgure.

2:6

H. PENNER ET AL.

Consider a user-space application on x86 that needs to run an SPE program via a PCIe-attached directCell device. Through our libspe2 implementation, it interfaces with the device driver by using the standard Windows API functions to allocate an SPE context (representation of LS, general-purpose registers, and problem or privileged state). The SPE binary is retrieved from the executable main application and loaded into the SPE context. The device driver implements a standard Windows function to support opening a device representation of the directCell system and running the previously created SPE context (1). The spe_context_run() method is implemented as an ioctl. Upon invocation, the driver picks a free physical SPE on the remote system and orchestrates the transfer of the SPE context to that system via MMIO operations (2). For the actual transfer of data, it is crucial that only DMA is used to maximize throughput and minimize latency because the data has to traverse the PCIe bus. MMIO is used only as a control channel of the host system into the Cell/B.E. processor-based system. This concept of loading comprises a small bootstrap loader as described

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

in Reference [18] to achieve this. The loader is transferred to the SPE in a single 16-KB DMA operation that also receives and contains as parameters the origin address of the target SPE binary and its size. The loader transfers the contents of the SPE register array into an internal data structure and initializes the registers before ﬁlling the LS with the binary SPE code via DMA (3). The loader will stop and signal completion to the PPE via an external interrupt. This interrupt is forwarded to the device driver by the PPE (4), which runs a tiny ﬁrmware layer as part of the only software prevailing on the PPE (as introduced in the previous section). While the SPE binary code executes, it generates mapping faults (4) whenever a reference to an unmapped host-resident address is encountered. This applies, for instance, to parameter data that may reside in a control block on the host. The corresponding memory mapping is established by the device driver via MMIO (5), and the host-resident data is then transferred via DMA (6). When the SPE stops execution, the host is signaled once again via interrupt and initiates a context switch and unload procedure, as described in Reference [9]. Similar to the previous steps, this phase involves unloader binary code that facilitates a swift transfer of the context data to the host memory via the SPE DMA engines. Memory addressing model One of the concepts of the directCell approach is to deﬁne a global address mapping that allows the SPEs to access host memory and the host to access the MMIO-mapped registers of the Cell/B.E. processor, LS, and local main memory. The mechanism that achieves this is based on using the memory management unit (MMU) of the host, the MMU and I/O MMU of the Cell/B.E. processor, and the PCIe implementation features, which we show in the rest of this section. The BAR setup section introduces how MMIO requests from the host pass the PCIe bus and the section on inbound mappings shows how those are translated to Cell/B.E internal addresses. These internal addresses are further translated by the I/O MMU to PowerPC real addresses (RAs). The IOMMU deﬁnes two translation ranges:

1. Requests directed to the SPE MMIO space are furnished with a predeﬁned displacement by the southbridge (I/O chipset). The I/O MMU translates these displaced addresses to the base address of the SPE MMIO spaces within the Cell/B.E. processor addressing domain and preserves oﬀsets to navigate within this range. 2. Requests directed to Cell/B.E. processor memory are translated 1:1 by the I/O MMU, also preserving the oﬀset for internal navigation.

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

Figure 2 shows a simpliﬁed model for MMIO communication. The MMIO transfer is triggered by writing to virtual addresses (VAs) on the host that are translated by the host MMU to RAs that, in turn, map to the host PCI bus. The MMIO requests are delivered as PCI memory cycles to the accelerator PCIe core where they are further redirected to the interface between the southbridge and Cell/B.E. processor interface, as described below in the section on inbound mappings. At that point, the I/O MMU maps the requests to the corresponding units on the Cell/B.E. processor. The SPEs transfer data via DMA to or from the accelerator system upon remote initiation by the host via MMIO. The target of the data-transfer operations may be local Cell/B.E. processor addresses (LS or local main memory) or remote host memory. The host addresses must be mapped to the I/O range of the Cell/B.E. processor; hence, accesses to these regions do not satisfy the coherency requirements that apply for local main memory. The device driver is responsible for locking the host memory regions to prevent swapping. For future enhancements of the directCell model, we encourage the introduction of a mechanism on the host system to explicitly request local main memory buﬀering to improve performance and coherency considerations. Further enhancements could provide a software-managed mechanism to prefetch and cache host-resident application buﬀers into local main memory [19]. The memory mappings are established through direct updates of the SPE TLBs whenever the SPE encounters an unmapped address. This handling is initiated by a page fault, as described in the section on interrupt management. In our current implementation of directCell, the host device driver performs this management via MMIO. The current implementation features a page size of 1 MB; hence, a single SPE can work on 256 MB of application data buﬀered in local memory or host-resident memory at once without the need of a page table. This area can be increased if larger common page sizes are available on the host and the accelerator. To introduce hardware management of the TLBs to directCell, the ﬁrmware layer would need to be augmented to manage a page table in the local main memory. The ﬂow of DMA operations is shown in Figure 2. The SPE views a DMA operation as a transfer from one VA to another. When reading data from the host system (blue arrows), the target address translates to an LS address, and the source address translates to an RA in the I/O domain, which points to the outbound memory region of the accelerator PCIe core. The mapping of PCIe bus addresses to host RAs is likely to involve another inbound translation step; however, this is transparent to

H. PENNER ET AL.

2:7

Host MMIO operations Host VA

Cell/B.E. RA

Host RA

Accelerator memory

PCIe bus Accelerator memory mapping SPE MMIO mapping

SPE MMIO Cell/B.E. I/O MMU (a) Mappings of individual pages across address spaces Cell/B.E. VA

Accelerator memory and SPE LS

Cell/B.E. MMU

Accelerator DMAs from host

Accelerator DMAs to host

Cell/B.E. RA

Host RA PCIe bus

User-space buffer

Host memory mapping in I/O space

(b)

Figure 2 The various levels of addressing translation and indirection between the host and the accelerator in directCell: (a) a control channel implemented by means of MMIO; (b) a data-transfer channel implemented via DMA. (VA: virtual address; RA: real address.)

the accelerator system and entirely dependent on the host OS, which directCell does not interfere with. PCIe PCIe was developed as the next-generation bus system that replaces the older PCI and PCI-X** standards. In contrast to PCI and PCI-X, PCIe uses a serial communication protocol. The electrical and bit layer is similar to the InﬁniBand protocol, which also uses 8b10b coding for the bit layer [20]. The entire PCIe protocol stack is handled in hardware. The programmer does not need to build any queues or messages to transfer actual data. Everything is done directly with the processor load and store instructions, which move data from and to the bus. It is possible to map a memory region from the remote side to the local memory map. This mapping is software compatible with the older PCI or PCI-X standards. Because of this, a mapping can be created so that the local processor can access remote memory directly via load or store operations and eﬀectively it does not need any software libraries to use this high-speed communication channel.

2:8

H. PENNER ET AL.

PCIe is used as a point-to-point interconnection between devices and a system. This allows centralizing the traﬃc routing and resource management and enables quality of services. Thus, a PCIe switch can prioritize packets. For a multimedia system, it could result in fewer dropped frames and lower audio latency. PCIe itself is based on serial links. A link is a dual-simplex connection using two pairs of wires. One pair is used for transmitting data and one is used for receiving data. The two pairs are known as a PCIe lane. A PCIe link consists of multiple lanes, for example, x1, x2, x4, x8, x16, or x32. PCIe is a good communication protocol for highbandwidth and low-latency devices. For example, a PCIe 2.0 lane can transmit 500 MB/s per direction, which yields to 16 GB/s per direction for an x32 link conﬁguration. A very compelling possibility of the PCIe external cabling speciﬁcation is the option to interconnect an external accelerator module via a cable to a corresponding host system. PCIe endpoint configuration PCIe deﬁnes point-to-point links between an endpoint and a root-complex device. In a standalone server

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

whereas the Cell/B.E. processor-based system southbridge is conﬁgured to run in endpoint mode. The conﬁguration is done by the service processors of the two systems prior to booting. From the host system, the Cell/B.E. processor-based system is seen as a PCIe device, with signiﬁcant device identiﬁcation, which allows the BIOS (basic I/O system) and OS to conﬁgure the MMIO ranges and load the appropriate device drivers. The scalability of this model of endpoint coupling is limited only by how much I/O memory BIOS can map.

Host Accelerated application Windows libspe 1) Move accelerator memory window 2) DMA to accelerator memory 3) Operate SPEs

Cell./B.E. system driver Host central processing unit (x86)

I/O MMU

PCe inbound mapping In order to remotely access the Cell/B.E. processor-based system, the device driver and the ﬁrmware layer conﬁgure the Cell/B.E. processor-based system southbridge to correctly route inbound PCIe requests to the Cell/B.E. processor and the local main memory. This inbound mapping involves PCI base address registers (BARs) on the bus level and PCI inbound mappings (PIMs) on the southbridge level. Both concepts are introduced in the following sections.

MMU Host southbridge

Accelerator PCIe core BAR BAR BAR Configuration space PIM PIM PIM

Outbound mapping

128 MB

128 MB Southbridge internal bus Companion chip (southbridge)

MMU

I/O MMU

LS

DMA

Registers

0x8000 Registers

DMA

0x0

MMIO region

Cell/B.E. system memory

LS

Cell/B.E. processor ...

Figure 3 Overview of the inbound addressing mechanisms of the Cell/B.E. southbridge. Inbound PCIe requests are mapped to distinct ranges within the southbridge internal address space, which in turn are mapped to speciﬁc units of the Cell/B.E. processor. The address mappings can be changed at runtime.

conﬁguration such as the QS22 BladeServer, the southbridge of the Cell/B.E. processor-based system is conﬁgured as a root complex to support add-in daughter cards. For directCell, the host system southbridge conﬁguration is unchanged and runs as a root complex,

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

BAR setup The PCI Standard [21] deﬁnes the BARs for device conﬁguration by ﬁrmware or software as a means to assign PCI bus address ranges to RAs of the host system. The BARs of the PCIe core in the southbridge are conﬁgured by PCIe config cycles of the host system BIOS during boot. After the conﬁguration, it is possible to access internal resources of the accelerator system through simple accesses to host RAs. This makes it possible to remotely control any Cell/B.E. processorrelated resources such as the MMIO register space, SPU resources, and all available local main memory. However, to achieve full control of the accelerator system, an additional step is required, which is introduced in the section on inbound mappings. The directCell implementation uses three BAR conﬁgurations to create a memory map of the device resources into the address space of the host system:

1. BARs 0 and 1 forming one 64-bit BAR provide a 128-MB region of prefetchable memory used to access all of the accelerator main memory (Figure 3). 2. A 32-bit BAR 2 is mapped to the internal resources of the end-point conﬁguration itself, making it possible to change mappings from the host. This BAR is 32 KB in size. 3. BARs 5 and 6 forming one 64-bit BAR are used to map the LS, the privileged state, and problem state registers of all SPEs of the Cell/B.E. processor. Figure 3 is an overview of this concept. We can see how the driver can conveniently target diﬀerent BARs for directly reaching the corresponding hardware resources.

H. PENNER ET AL.

2:9

Inbound mappings The Cell/B.E. processor companion chip, southbridge, introduces PIM registers to provide internal address translation capabilities. PIMs apply a displacement within the southbridge internal bus address space, as shown in Figure 3. For all BARs, corresponding PIM registers exist to map addresses on the PCIe bus to the southbridge internal bus [22] addresses connected to the southbridgeto-Cell/B.E. processor interface. The PIM register translation for BAR 2 is set up by the accelerator ﬁrmware to refer to the MMIO-accessible conﬁguration area of the PCIe core. The PIMs for BARs 0 and 1 and 5 and 6 are adjusted by the host device driver during runtime. In order to do so, the device driver uses the inbound mapping for BAR 2. The resources of the Cell/B.E. processor are made accessible to a PPE space address by the I/O MMU setup (described previously in the section on PCIe and depicted in the lower part of Figure 3). The PIM register translation for BAR 0 and 1 is changed during the runtime of host applications according to the needs of the driver. The driver provides a movable window across the complete main memory of the accelerator, though at a given point in time, only 128 MB of contiguous accelerator memory space is available (also shown in Figure 3).

Table 1

Interrupt management The asynchronous execution of SPE programs requires an interrupt handling infrastructure for directCell. SPE programs use interrupts to signal such events as mapping faults, errors, or program completions. Several layers are involved in handling SPE interrupts. An external interrupt is routed to the interrupt controller on the PPE, which is handled by an interrupt handler in the PPE resident ﬁrmware layer. This handler determines which SPU triggered the interrupt and directly triggers an interrupt to the PCIe core on the southbridge, which is in turn handled by the host device driver. Three types of SPE interrupts are handled by directCell:

1. A BladeCenter server-based conﬁguration IBM QS22 blade with an IBM HS21 blade. The interconnect is PCIe Gen1, x4. 2. A PowerXCell processor-based accelerator board PCIe add-in card connected to a standard x86-based advanced technology extended personal computer. The interconnect is PCIe Gen1, x16.

1. SPU external interrupt: page-fault exception handling, DMA error, program completion. 2. Internal mailbox synchronization interrupt. 3. Southbridge DMA engine interrupt. The inclusion of the PPE in the interrupt-handling scheme is necessary because the provided hardware does not allow the direct routing of the SPE interrupt to the PCIe bus. We propose to include the ability to freely reroute interrupts to diﬀerent targets in future processors and I/O devices.

2 : 10

H. PENNER ET AL.

Throughput from SPE local store to x86-based host

memory. Throughput (MB/s)

Conﬁguration QS22 at 3.2 GHz, SMP, PCIe x4, reads

719.22

QS22 at 3.2 GHz, SMP, PCIe x4, writes

859.44

PXCAB at 2.8 GHz, PCIe x16, reads

2,333.95

PXCAB at 2.8 GHz, PCIe x16, writes

2,658.60

Performance evaluation In this section, ﬁrst results of the directCell implementation are presented. The results are sectioned into a discussion on general throughput and latency measurements for speciﬁc directCell hardware conﬁgurations and a discussion of an exemplary application that we apply to our implementation of directCell. The latter section elaborates on the eﬀort of porting the application to directCell as well as application performance data. General measurements These tests aim to determine the throughput and latency characteristics of the host and accelerator interconnect in two distinct directCell hardware conﬁgurations:

All measurements are performed by a small ﬁrmware kernel running on the PPE of the Cell/B.E. processorbased system. The kernel runs on the PowerXCell PPE in real addressing mode and programs the SPE memory ﬂow controller DMA facilities via MMIO and transfers LS contents to the host memory and back. Because the MMIO operations are signiﬁcantly slower than programming the DMA directly from the SPE, the methodology is not ideal; however, each test determines the additional latency of all involved MMIO operations and subtracts it from the overall duration. Table 1 lists the throughput from the SPE LS to x86based host memory. The values shown were measured with two SPEs submitting requests; however, a single SPE will saturate the PCIe bus in both system conﬁgurations. With regards to latency, we conducted several measurements to determine the overhead of the I/O path. We regard the hop from the PowerXCell processor to the

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

6,000

1,600

SPE, southbridge loop back, local memory SPE, southbridge loop back, SPE SPE, southbridge PCIe, host memory

Latency (ns)

1,400 1,200 1,000 800 600 400 200

Simulations per second (millions)

1,800

5,000

4,000

Standalone QS22 1 SPE Standalone QS22 8 SPEs (avg.) directCell QS22 1 SPE directCell QS22 8 SPEs (avg.) directCell PXCAB 1 SPE directCell PXCAB 8 SPEs (avg.)

3,000

2,000

1,000

0 PXCAB 2.8-GHz read

QS22 3.2-GHz SMP read

PXCAB 2.8-GHz write

QS22 3.2 GHz-SMP write

0

Figure 5 Figure 4

Overall application performance.

Latency of 16 KB SPE transfers to various targets (using processor timebase).

southbridge separately from the overall access to the host memory. This is possible by providing a PowerXCell processor address that routes the request to the southbridge in a way that it gets immediately routed back to the PowerXCell processor by the southbridge. One of these loop-back tests refers back to LS, a second accesses the Cell/B.E. processor-resident DDR2 (double-data-rate 2) memory. The results for both tests are shown in Figure 4, along with results for end-to-end data transfers from LS to the x86 host memory. Application porting As an exemplary use case of directCell, we have ported a Cell/B.E. processor-based application from the ﬁnancial services domain for European option trading. The underlying software is a straightforward implementation of a Monte Carlo pricing simulation. A derivative of this application has been part of an IBM showcase at the Securities Industry and Financial Markets Association Technology Management and Conference and Exhibit [23]. The computing-intensive tasks of the application lie in the acquisition of random integer numbers and Moro’s inversion of those [24]. The standalone Linux-based version of the application transfers and starts one or more SPE threads that will in turn read in a control block from main memory before starting the simulation. The control block contains such parameters as the number of simulations that should be performed, the simulated span of time, and the space to return result data. After

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

completion, the control block is transferred back to main memory. The development of the application port involved porting the PPE code to the x86 architecture and Windows as well as providing for endianness conversion of the application control block in the SPE program. Of the PPE code, 16% of all lines are reused verbatim. The remaining code underwent mostly syntax changes to conform to the data types and library function signatures of Windows. The control ﬂow of the application was not changed; all lines of SPE code are reused verbatim. Some additional 85 lines of code were needed to perform endianness conversion of the control block data. The conversion code makes use of the spu_shuffle instruction with a set of predeﬁned bit-swap patterns. With adequate programming language support, such a code block can be easily generated by a compiler, even for complex data structures. The application port was performed on a Microsoft Windows Server** 2008-based implementation of directCell. Application results We compare the application performance on a standalone native Cell/B.E. processor running Linux on a BladeCenter QS22 server as well as a directCell conﬁguration of the same system with a BladeCenter HS21 server and a directCell PXCAB conﬁguration connected to an AMD64. The most important application parameters are set to 10 timesteps and 1 million simulations per run per SPE. Figure 5 shows that the number of simulations per second for the QS22-based directCell conﬁguration is

H. PENNER ET AL.

2 : 11

Table 2

800

Interrupt type

700 Standalone QS22 1 SPE directCell QS22 1 SPE directCell PXCAB 1 SPE Standalone QS22 8 SPEs (avg.) directCell QS22 8 SPEs (avg.) directCell PXCAB 8 SPEs (avg.)

Duration ( ␮s)

600 500 400 300 200 100 0

spe_context_create spe_program_load spe_context_destroy

Figure 6 Duration of host-resident libspe2 functions.

only marginally lower than the number achieved in the nonhybrid standalone system. The overall performance for the PXCAB-based directCell conﬁguration is linear to the lower processor frequency of PXCAB compared with QS22. We also evaluated the duration of distinct libspe2 function calls. Under Windows Server 2008, these measurements were conducted by using the QueryPerformanceCounter function. Under PowerPCLinux, the gettimeofday function was used to acquire the durations. The results shown for host-resident function calls in Figure 6 reveal that the host processors used in the directCell model perform certain tasks faster than the PPE of the PowerXCell processor. Also, the frequency of the function calls has a signiﬁcant impact on their average duration. Finally, we measured the duration of handling SPE page and segment faults in directCell. For this, we used a PCIe analyzer that triggers ﬁrst upon interrupt assertion on the bus and triggers again upon the bus cycle that updates the TLB of the SPEs with a new memory mapping. This method omits the time that the interrupt and the resulting MMIO cycle need to propagate through the Cell/B.E. processor-based system, but the orders of magnitude in terms of time for the handling of the page fault on the bus and the host system handling allow us to neglect this, as shown in Table 2. Conclusion and outlook We have presented directCell, an approach to attach a Cell/B.E. processor-based accelerator to a Windows-

2 : 12

SPE memory fault measurements.

H. PENNER ET AL.

Duration (ls)

SPE segment fault

15.984

SPE page fault

20.56

based x86 host system using PCIe connectivity. This conﬁguration taps today’s software environment while the native Cell/B.E. processor programming model is maintained, leveraging existing code for the CBEA. The tight coupling of the accelerator and its low communication latency allows for a broad range of applications that go beyond HPC and streaming workload while oﬀering an integrated and manageable system setup. The viability of this approach has been demonstrated in a prototype. The PPE portion of the Cell/B.E. processor runs a slim ﬁrmware layer implementing initialization, memory management, and interrupt handling code, granting the x86 host system full access to the SPEs. Further research should be undertaken to optimize memory management and the work split between host control software and the Cell/B.E. processor PPE ﬁrmware components. Exploitation of the memory local to the Cell/B.E. processor-based accelerator can improve application performance transparently. Eﬀorts to evaluate scalability across several accelerator engines as well as building an environment for eﬃcient development can lead to broader use of directCell. *Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both. **Trademark, service mark, or registered trademark of NVIDIA Corporation, Microsoft Corporation, InﬁniBand Trade Association, HyperTransport Technology Consortium, PCI Special Interest Group, OpenMP Architecture Review Board Corporation, Apple, Inc., Linus Torvalds, or The Open Group in the United States, other countries, or both. ***Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc., in the United States, other countries, or both and is used under license therefrom.

References 1. CUDA Zone, NVIDIA Corporation; see http:// www.nvidia.com/object/cuda_home.html#. 2. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, et al., ‘‘Larrabee: A Many-Core x86 Architecture for Visual Computing,’’ ACM Trans. Graph. 27, No. 3, 1–15 (2008). 3. IBM Corporation, IBM Cell Broadband Engine Technology; see http://www-03.ibm.com/technology/cell/index.html. 4. IBM Corporation (June 9, 2008).Roadrunner Smashes the Petaﬂop Barrier. Press release; see http://www-03.ibm.com/ press/us/en/pressrelease/24405.wss.

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

5. ‘‘MPI: A Message-Passing Interface Standard Version 2.1,’’ Message Passing Interface Forum, 2008; see http://www. mpi-forum.org/docs/mpi21-report.pdf. 6. IBM Corporation, Accelerated Library Framework Programmer’s Guide and API Reference; see http:// publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ eiccn/alf/ALF_Prog_Guide_API_v3.1.pdf. 7. IBM Corporation, Data Communication and Synchronization Library Programmer’s Guide and API Reference; see http:// publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ eicck/dacs/DaCS_Prog_Guide_API_v3.1.pdf. 8. IBM Corporation, IBM Dynamic Application Virtualization; see http://www.alphaworks.ibm.com/tech/dav. 9. OpenMP Architecture Review Board, OpenMP Application Programming Interface Version 3.0, May 2008; see http:// www.openmp.org/mp-documents/spec30.pdf. 10. N. Trevett, ‘‘OpenCL Heterogeneous Parallel Programming,’’ Khronos Group; see http://www.khronos.org/developers/ library/2008_siggraph_bof_opengl/OpenCL%20and% 20OpenGL%20SIGGRAPH%20BOF%20Aug08.pdf. 11. IBM Corporation, IBM BladeCenter QS22; see ftp:// ftp.software.ibm.com/common/ssi/pm/sp/n/bld03019usen/ BLD03019USEN.PDF/. 12. IBM Corporation, Cell/B.E. Technology-Based Systems; see http://www-03.ibm.com/technology/cell/systems.html. 13. IBM Corporation, SPE Runtime Management Library Version 2.2; see http://www-01.ibm.com/chips/techlib/ techlib.nsf/techdocs/1DFEF31B3211112587257242007883F3/ $ﬁle/SPE_Runtime_Management_API_v2.2.pdf. 14. K. Koch, ‘‘Roadrunner Platform Overview,’’ Los Alamos National Laboratory; see http://www.lanl.gov/orgs/hpc/ roadrunner/pdfs/Koch%20-%20Roadrunner%20Overview/ RR%20Seminar%20-%20System%20Overview.pdf. 15. A. Bergmann, ‘‘Spufs: The Cell Synergistic Processing Unit as a Virtual File System’’; see http://www-128.ibm.com/ developerworks/power/library/pa-cell/. 16. A. Heinig, R. Oertel, J. Strunk, W. Rehm, and H. J. Schick, ‘‘Generalizing the SPUFS Concept–A Case Study Towards a Common Accelerator Interface’’; see http://private.ecit. qub.ac.uk/MSRC/Wednesday_Abstracts/Heinig_Chemnitz.pdf. 17. IBM Corporation, Software Development Kit for Multicore Acceleration Version 3.1, SDK Quick Start Guide; see http:// publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ eicce/eiccesdkquickstart.pdf. 18. IBM Corporation, Cell Broadband Engine Programming Handbook; see http://www-01.ibm.com/chips/techlib/techlib.nsf/ techdocs/9F820A5FFA3ECE8C8725716A0062585F. 19. IBM Corporation, Cell Broadband Engine SDK Example Libraries: Example Library API Reference, Version 3.1; see http://publib.boulder.ibm.com/infocenter/systems/scope/syssw/ topic/eicce/SDK_Example_Library_API_v3.1.pdf. 20. PCI-SIG, PCI Express Base 2.0 Speciﬁcation Revision 1.1, March 28, 2005; see http://www.pcisig.com/speciﬁcations/ pciexpress/base2/. 21. PCI-SIG, PCI Conventional Speciﬁcation 3.0 & 2.3: An Evolution of the Conventional PCI Local Bus Speciﬁcation, February 4, 2004; see http://www.pcisig.com/speciﬁcations/ conventional/. 22. IBM Corporation, CoreConnect Bus Architecture; see http:// www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/ 852569B20050FF7785256991004DB5D9/$ﬁle/crcon_pb.pdf. 23. J. Easton, I. Meents, O. Stephan, H. Zisgen, and S. Kato, ‘‘Porting Financial Markets Applications to the Cell Broadband Engine Architecture,’’ IBM Corporation; see http://www-03.ibm.com/industries/ﬁnancialservices/doc/content/ bin/fss_applications_cell_broadband.pdf. 24. IBM Corporation, Moro’s Inversion Example; see http:// publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ eiccr/mc/examples/moro_inversion.html.

IBM J. RES. & DEV.

VOL. 53 NO. 5 PAPER 2 2009

Received September 22, 2008; accepted for publication January 14, 2009 Hartmut Penner IBM Systems and Technology Group, IBM Boeblingen Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany ([email protected]). Mr. Penner is a Senior Technical Staﬀ Member responsible for the ﬁrmware architecture of Cell/B.E. processor-based systems. He holds an M.S. degree in computer science from the University of Kaiserslautern, Germany. During his career at IBM, he has been involved in many diﬀerent ﬁelds, working on Linux, the GNU compiler collection, and various ﬁrmware stacks of IBM servers. Utz Bacher IBM Systems and Technology Group, IBM Boeblingen Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany ([email protected]). Mr. Bacher is an architect for hybrid systems. He earned his B.S. degree in information technology from the University of Cooperative Education in Stuttgart, Germany. He developed highspeed networking code for Linux on System z*. From 2004 to 2007, he led the Linux on Cell/B.E. processor kernel development groups in Germany, Australia, and Brazil. Since 2007, he has been responsible for designing the system software structure of future IBM machines.

Jan Kunigk IBM Systems and Technology Group, IBM Boeblingen Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany ([email protected]). Mr. Kunigk joined IBM in 2005 as a software engineer after receiving his B.S. degree in applied computer science from the University of Cooperative Education in Mannheim, Germany. Prior to his involvement in hybrid systems and the directCell prototype, he worked on ﬁrmware and memory initialization of various Cell/B.E. processor-based systems. Christian Rund IBM Systems and Technology Group, IBM Boeblingen Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany ([email protected]). Mr. Rund received his M.S. degree in computer science from the University of Stuttgart, Germany. He joined IBM in 2001 as a Research and Development Engineer for the IBM zSeries* Fibre Channel Protocol channel. He has recently been involved in ﬁrmware and software development for the Cell/B.E. processor and the directCell prototype hybrid system. Heiko Joerg Schick IBM Systems and Technology Group, IBM Boeblingen Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany ([email protected]). Mr. Schick earned his M.S. degree in communications and software engineering at the University of Applied Sciences in Albstadt-Sigmaringen and graduated in 2004. Currently, he is the ﬁrmware lead of the Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine (QPACE) project, which is a supercomputer mission of IBM and European universities and research centers to do quantum chromodynamics parallel computing on the Cell Broadband Engine Architecture.

H. PENNER ET AL.

2 : 13

Directcell: Hybrid Systems With Tightly Coupled Accelerators

Overview

More details

Related Documents

Directcell: Hybrid Systems With Tightly Coupled Accelerators

Fuel Cell - Gt Hybrid Systems

Heterogeneous Multiprocessing - On A Tightly Coupled Opteron Cell Evaluation Platform

Hybrid

Clothim Hybrid With Poser7 Tutorial

Clother Hybrid With Poser7 Tutorial

More Documents from ""

Heterogeneous Multiprocessing - On A Tightly Coupled Opteron Cell Evaluation Platform

Code Optimization For Cell Be - Opportunities For Abinit

Status Of The Qpace Project

Status Of The Qpace Project

Generalizing The Spufs Concept - A Case Study Towards A Common Accelerator Interface

Qpace Paper In Computing In Science And Engineering 2008