Qpace Paper In Computing In Science And Engineering 2008

  • Uploaded by: Heiko Joerg Schick
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Qpace Paper In Computing In Science And Engineering 2008 as PDF for free.

More details

  • Words: 6,377
  • Pages: 7
QPACE: QCD parallel omputing on the Cell

H. Baier,1 M. Dro hner,2 N. Ei ker,3 G. Goldrian,1 U. Fis her,1 Z. Fodor,4 D. Hierl,5 S. Heybro k,5 B. Krill,1 T. Lippert,3 T. Maurer,5 N. Meyer,5 A. Nobile,6, 7 I. Ouda,8 H. Penner,1 D. Pleiter,9 A. S hafer,5 H. S hi k,1 F. S hifano,10 H. Simma,6, 9 S. Solbrig,5 T. Streuer,9 K.-H. Sulanke,9 R. Tripi

ione,10 T. Wettig,5 and F. Winter9 1 IBM Development Laboratory, 71032 Boblingen, Germany 2 ZEL, Resear h Center Juli h, 52425 Juli h, Germany 3 Juli h Super omputing Centre, 52425 Juli h, Germany 4 Department of Physi s, University of Wuppertal, 42119 Wuppertal, Germany 5 Department of Physi s, University of Regensburg, 93040 Regensburg, Germany 6 Department of Physi s, University of Milano-Bi o

a, 20126 Milano, Italy 7 European Centre for Theoreti al Studies ECT and INFN, Sezione di Trento, 13050 Villazzano, Italy 8 IBM Systems & Te hnology Group, Ro hester, MN 55901, USA 9 Deuts hes Elektronen-Syn hrotron DESY, 15738 Zeuthen, Germany 10 Dipartimento di Fisi a, Universita di Ferrara and INFN, Sezione di Ferrara, 43100 Ferrara, Italy (Dated: February 18, 2007)

We give an overview of the QPACE proje t, whi h is pursuing the development of a massively parallel, s alable super omputer for appli ations in latti e quantum hromodynami s (QCD). The ma hine will be a three-dimensional torus of identi al pro essing nodes, based on the PowerXCell 8i pro essor. These nodes will be tightly oupled by an FPGA-based, appli ation-optimized network pro essor atta hed to the PowerXCell 8i. We rst present a performan e analysis of latti e QCD

odes on QPACE and orresponding hardware ben hmarks. We then des ribe the ar hite ture of QPACE in some detail. In parti ular, we dis uss the hallenges arising from the spe ial multi- ore nature of the PowerXCell 8i and from the use of an FPGA for the network pro essor.

I. INTRODUCTION The properties and intera tions of quarks and gluons, whi h are the building blo ks of parti les su h as the proton and the neutron, an be studied in the framework of quantum hromodynami s (QCD), whi h is very well established both experimentally and theoreti ally. In some physi ally interesting dynami al regions, QCD

an be studied perturbatively, i.e., event amplitudes and probabilities an be worked out as a systemati expansion in the so- alled strong oupling onstant s . In other (often more interesting) regions, su h an expansion is not possible, and other approa hes have to be found. The most systemati and most widely used non-perturbative approa h is based on latti e gauge theory (LGT), a dis retized and omputer-friendly version of the theory proposed more than 30 years ago by Wilson [1℄. In the LGT framework, the problem an be reformulated as a problem in statisti al me hani s, whi h an be studied with Monte Carlo te hniques. LGT Monte Carlo simulations have been performed by a large ommunity of physi ists that have, over the years, developed very ef ient simulation algorithms and sophisti ated analysis te hniques. LGT has been pioneering the use of massively parallel omputers for large s ienti appli ations sin e the early 1980s, using virtually all available omputing systems su h as traditional mainframes, large PC

lusters, and high-performan e systems in luding several dedi ated ar hite tures. In the ontext of this paper, we are only interested in parallel ar hite tures on whi h LGT odes s ale to a large number (thousands) of nodes. If several ar hite tures ful-

l this requirement, a de ision between them an be made on the basis of pri e-performan e and power-performan e ratios. The most ompetitive systems with respe t to both of these metri s are ustom-designed LGT ar hite tures (QCDOC [2℄ and apeNEXT [3℄, whi h have been in operation sin e 2004/2005) and IBM's BlueGene [4℄ series of ma hines. These ar hite tures are based on systemon- hip (SoC) designs, but in modern hip te hnologies the development osts of a VLSI hip are so high that an SoC approa h is no longer sensible for an a ademi proje t. An alternative approa h is the use of a ompute node based on a ommer ial multi- ore pro essor that is tightly oupled to a ustom-designed network pro essor, whose implementation is fa ilitated by the adoption of Field Programmable Gate Arrays (FPGA). This approa h will be presented here. Con retely, we are developing a massively parallel ma hine based on IBM's PowerXCell 8i and a Xilinx Virtex-5 FPGA, in whi h the nodes are onne ted in a three-dimensional torus with nearest-neighbor onne tions. The name of this proje t is QPACE (QCD PArallel omputing on the CEll). This approa h is ompetitive with the BlueGene approa h in both pri e-performan e and power-performan e, at mu h lower development osts. There are some unique hallenges to our approa h that will be dis ussed below. QPACE is a ollaboration of several a ademi institutions with the IBM development lab in Boblingen, Germany.1

1

The funding proposal for QPACE has re eived a positive evalu-

2 The paper is stru tured as follows. In Se . II we present an overview of the appli ation (LGT) for whi h the ma hine is to be optimized. In Se . III we very brie y des ribe the PowerXCell 8i pro essor. A detailed performan e analysis of LGT odes on a parallel ma hine based on this pro essor as well as mi ro-ben hmarks of the relevant memory operations obtained on Cell hardware are presented in Se . IV. The QPACE ar hite ture is presented in detail in Se . V, and the software environment is des ribed in Se . VI. We lose with a summary and outlook in Se . VII.

II. LATTICE GAUGE THEORY COMPUTING LGT simulations are among the grand hallenges of s ienti omputing. Today, a group a tive in highend LGT simulations must have a

ess to omputing resour es on the order of tens of sustained TFlops-years. Computing resour es used for LGT simulations have grown over the years at a faster pa e than predi ted by Moore's law. An a

urate des ription of the theory and its algorithms is beyond the s ope of the present paper. We only stress that LGT simulations have an algorithmi stru ture exhibiting a large degree of parallelism and several additional features that make it simple to extra t a large fra tion of this parallelism. Continuous 4-d spa e-time is repla ed by a dis rete and nite latti e (e.g., of linear size L, ontaining N = L4 sites). All ompute-intensive tasks involve repeated exe ution of just one basi step, the produ t of the latti e Dira operator and a quark eld . A quark eld xia is de ned at latti e site x and

arries so- alled olor indi es a = 1; : : : ; 3 and spinor indi es i = 1; : : : ; 4. Thus, is a ve tor with 12N omplex entries. The so- alled hopping-term Dh of the Dira operator a ts on as follows,2

0 x = Dh 4 =

X nU

=1

o:

(1)

y

x; (1 +  ) x+^ + Ux ^; (1

 )

x ^

Here,  labels the four spa e-time dire tions, and the Ux; are the SU(3) gauge matri es asso iated with the links between nearest-neighbor latti e sites. The gauge matri es are themselves dynami al degrees of freedom of the theory and arry olor indi es (3  3 omplex entries for ea h of the 4N di erent Ux; ). The  are the ( onstant) Dira matri es, arrying spinor indi es. Today's state-of-the-art simulations use latti es with a linear size of at least 32 sites.

2

ation, and the nal funding de ision will be made in May 2008 by the German Resear h Foundation (DFG). Equation (1) represents the so- alled Wilson Dira operator. Other dis retization s hemes exists, but the implementation details are all very similar.

Roughly speaking, the ideal LGT simulation engine is a system able to keep in lo al storage the degrees of freedom de ned above and to repeatedly apply the Dira operator with very high eÆ ien y. Expli it parallelism is straightforward. Inspe tion of (1) shows that Dh has non-zero entries only between nearest-neighbor latti e sites. This suggests an obvious parallel stru ture for an LGT omputer: a d-dimensional grid of pro essing nodes. Ea h node ontains lo al data storage and a ompute engine. The physi al latti e is mapped onto the pro essor grid as regular tiles (if d < 4, then 4 d dimensions of the physi al latti e are fully mapped onto ea h pro essing node). In the on eptually simple ase of a 4-d square grid of p4 pro essors, ea h pro essing node would store and pro ess a 4-d subset of (L=p)4 latti e sites. Ea h pro essing node handles its sub-latti e, using lo al data, and a

esses data orresponding to latti e sites just outside the surfa e of the sub-latti e, stored on nearest-neighbor pro essors. Pro essing pro eeds in parallel, sin e there are no dire t data dependen ies for latti e sites that are not nearest-neighbor, and the same sequen e of operations must be applied to all pro essing nodes. Parallel performan e will be linear in the number of nodes, as long as the latti e an be evenly partitioned on the pro essor grid and as long as the inter onne tion bandwidth is large enough to sustain the performan e on ea h node. The rst onstraint is met up to a very large number of pro essors while the latter is less trivial as the required inter-node bandwidth in reases linearly with p. A

urate gures on bandwidth requirements will be provided below. The dis ussion outlined above on eptually favors an implementation in whi h a desired total pro essing power is sustained by the most powerful pro essor available, sin e this redu es pro essor ount and node-to-node bandwidth. Hen e the obvious suggestion to onsider the Cell Broadband Engine (BE) as the target pro essor.

III. THE PowerXCell 8i PROCESSOR The Cell BE is des ribed in detail in Refs. [5, 6℄. The pro essor ontains one PowerPC Pro essor Element (PPE) and 8 Synergisti Pro essor Elements (SPE). Ea h of the SPEs runs a single thread and has its own 256 kB on- hip memory (lo al store, LS) whi h is a

essible by dire t memory a

ess (DMA) or by lo al load/store operations to/from 128 general-purpose 128-bit registers. An SPE an exe ute two instru tions per y le, performing up to 8 single pre ision (SP) oating point (FP) operations. Thus, the total SP peak performan e of all 8 SPEs on a Cell BE is 204.8 GFlops at 3.2 GHz. The PowerXCell 8i is an enhan ed version of the Cell BE with the same SP performan e and a peak double pre ision (DP) performan e of 102.4 GFlops with IEEE ompliant rounding. It has an on- hip memory ontroller supporting a memory bandwidth of 25.6 GB/s and a on-

3 gurable I/O interfa e (Rambus FlexIO) supporting a

oherent as well as a non- oherent proto ol with a total bidire tional bandwidth of 25.6 GB/s. Internally, all units of the pro essor are onne ted to the oherent element inter onne t bus (EIB) by DMA ontrollers. The power onsumption of the pro essor will be dis ussed below.

where Ii is the size of the pro essed data. In the following, we assume that all tasks an be run on urrently at maximal throughput and that all dependen ies and laten ies an be hidden by suitable s heduling. The total exe ution time is then (3)

i

If Tpeak is the minimal ompute time for the FP operations of an appli ation that an be a hieved with an \ideal implementation", the oating point eÆ ien y "FP is de ned as "FP = Tpeak =Texe. Fig. 1 shows the data- ow paths and asso iated exe ution times Ti that enter our analysis, in parti ular:

and LS, TRF

register le (RF)

 o - hip memory a

ess, Tmem  internal ommuni ations between same pro essor, Tint

T LS

Local Store (LS)

T mem

Network Interface T ext

T EIB

 external ommuni ations between di erent pro essors, Text

 transfers via the EIB (memory a

ess, internal and external ommuni ations), TEIB

We onsider a ommuni ation network with the topology of a 3-d torus in whi h ea h of the 6 links between a given site and its nearest neighbors has a bidire tional bandwidth of 1 GB/s so that a bidire tional bandwidth of ext = 6 GB/s is available between ea h Cell BE and the network.3 All other hardware parameters i are taken from the Cell BE manuals [6℄. Our performan e model was applied to various linear algebra kernels and tested su

essfully by ben hmarks on several hardware systems (see [9℄ for details). In the following, we report on results relevant for the main LGT kernels. These results have in uen ed the design of the QPACE ar hite ture.

(2)

Texe ' max Ti :

Main Memory (MM)

T RF

FIG. 1: The data- ow paths and their exe ution times Ti for a single SPE.

The Cell BE was developed for the PlayStation 3, but it is obviously very attra tive for s ienti appli ations as well [7{9℄. This is even more true for the PowerXCell 8i. We therefore investigate the performan e of this pro essor in a theoreti al model along the lines of Ref. [10℄. The results of this se tion have been reported at the Latti e 2007 onferen e [9℄. We onsider two lasses of hardware devi es: (i) Storage devi es (e.g., registers or LS) store data and/or instru tions and are hara terized by their storage size. (ii) Pro essing devi es a t on data (e.g., FP units, hara terized by their performan e) or transfer data/instru tions from one storage devi e to another (e.g., DMA ontrollers or buses, hara terized by their bandwidths i and startup laten ies i ). An algorithm implemented on a spe i ar hite ture

an be divided into mi ro-tasks performed by the pro essing devi es of our model. The exe ution time Ti of ea h task i is estimated by a linear ansatz,

 oating-point operations, TFP  load/store operations between

T ILB

T link

T FP

RF

EIB

IV. PERFORMANCE ANALYSIS OF THE PowerXCell 8i A. Performan e model

Ti ' Ii = i + O(i ) ;

ILB

B. Latti e QCD kernel The omputation of Eq. (1) on a single latti e site amounts to 1320 Flops (not ounting sign ips and

omplex onjugation) and thus yields Tpeak = 330 y les per site (in DP). However, the implementation of Eq. (1) requires at least 840 multiply-add operations and TFP  420 y les per latti e site to exe ute. Thus, any implementation of Eq. (1) on an SPE annot perform better than 78% of peak. The latti e data layout greatly in uen es the time spent on load/store operations and on remote ommuni ations for the operands (9  12+8  9 omplex numbers) of the hopping term (1). We assign to ea h Cell BE a lo al latti e with VCell = L1  L2  L3  L4 sites and arrange the 8 SPEs logi ally as s1  s2  s3  s4 = 8. A single SPE thus holds a subvolume of VSPE = (L1 =s1 ) 

3

SPEs on the

Note that the dimensionality of the torus network as well as the bandwidth gures are strongly onstrained by the te hnologi al

apabilities of present-day FPGAs, i.e., number of high-speed serial trans eivers and total pin ount. See also Se . V.

4 (L2 =s2 )  (L3 =s3 )  (L4 =s4 ) = VCell =8 sites. Ea h SPE on average has Aint neighboring sites on other SPEs within and Aext neighboring sites outside a Cell BE. In the following we investigate two di erent strategies for the data layout: Either all data are kept in the on- hip lo al store of the SPEs, or the data reside in o - hip main memory. Data in on- hip memory (LS)

In this ase we require that all data for a ompute task

an be held in the lo al store of the SPEs. In addition, the lo al store must hold a minimal program kernel, the run-time environment, and intermediate results. Therefore, the storage requirements strongly onstrain the lo al latti e volumes VSPE and VCell . A spinor eld x needs 24 real words (192 Bytes in DP) per site, while a gauge eld Ux; needs 18 words (144 Bytes) per link. If for a solver we need storage for 8 spinors and 34 links per site, the subvolume arried by a single SPE is restri ted to about VSPE = 79 sites. In a 3-d network, the fourth latti e dimension must be distributed lo ally within the same Cell BE a ross the SPEs (logi ally arranged as a 13  8 grid). L4 is then a global latti e extension and may be as large as L4 = 64. This yields a very asymmetri lo al latti e with VCell = 23  64 and VSPE = 23  8.4 Data in o - hip main memory (MM)

When all data are stored in main memory, there are no a-priori restri tions on VCell . However, we have to avoid redundant loads of the operands of Eq. (1) from main memory into lo al store when sweeping through the latti e. To also allow for on urrent omputation and data transfers (to/from main memory or remote SPEs), we

onsider a multiple bu ering s heme.5 A possible implementation of su h a s heme is to ompute the hopping term (1) on a 3-d sli e of the lo al latti e and then move the sli e along the 4-dire tion. Ea h SPE stores all sites along the 4-dire tion, and the SPEs are logi ally arranged as a 23  1 grid to minimize internal ommuni ations between SPEs and to balan e external ones. To have all operands in Eq. (1) available in the lo al store, we must be able to keep the U - and - elds asso iated with all sites of three 3-d sli es in the LS at the same time. This optimization requirement again onstrains the lo al latti e size, now to VCell  800  L4 sites.

4 5

When distributed over 4096 nodes, this gives a global latti e size of 323  64. In multiple bu ering s hemes several bu ers are used in an alternating fashion to either pro ess or load/store data. This allows for on urrent omputation and data transfer at the pri e of additional storage (here in the LS).

data in on- hip LS VCell Aint Aext Tpeak TFP TRF Tmem Tint Text TEIB "FP

23  64 16 192 21 27 12 | 2 79 20

27%

data in o - hip MM L1

 L2  L3

Aint =L4 Aext =L4 Tpeak =L4 TFP =L4 TRF =L4 Tmem =L4 Tint =L4 Text =L4 TEIB =L4 "FP

83 48 48 21 27 12 61 5 20 40

43 12 12 2.6 3.4 1.5 7.7 1.2 4.9 6.1

23 3 3 0.33 0.42 0.19 0.96 0.29 1.23 1.06

34% 34% 27%

TABLE I: Theoreti al time estimates Ti (in 1000 SPE y les) for some mi ro-tasks arising in the omputation of Eq. (1) for the LS ase (left) and the MM ase (right). Aint and Aext are the numbers of neighboring sites per SPE. All other symbols are de ned in Se s. IV A and IV B.

In Table I we display the predi ted exe ution times for some of the mi ro-tasks onsidered in our model for both data layouts and reasonable hoi es of the lo al latti e size. In the LS ase, the theoreti al eÆ ien y of about 27% is limited by the ommuni ation bandwidth (Texe  Text). This is also the limiting fa tor for the smallest lo al latti e in the MM ase, while for larger lo al latti es the memory bandwidth is the limiting fa tor (Texe  Tmem). We have not yet ben hmarked a representative QCD kernel su h as Eq. (1) sin e in all relevant ases TFP is far from being the limiting fa tor. Rather, we have performed hardware ben hmarks with the same memory a

ess pattern as (1), using the above-mentioned multiple bu ering s heme for the MM ase. We found that the exe ution times were at most 20% higher than the theoreti al predi tions for Tmem. (The freely available fullsystem simulator is not useful in this respe t sin e it does not model memory transfers a

urately.)

C. LS-to-LS DMA transfers Sin e DMA transfer speeds determine Tmem, Tint , and Text, their optimization is ru ial to exploit the Cell BE

performan e. Our analysis of detailed mi ro-ben hmarks for LS-to-LS transfers shows that the linear model (2) does not a

urately des ribe the exe ution time of DMA operations with arbitrary size I and arbitrary address alignment. We re ned our model to take into a

ount the fragmentation of data transfers, as well as the sour e and destination addresses, As and Ad , of the bu ers:

TDMA(I; As ; Ad ) = 0 + a  Na (I; As ; Ad ) + Nb (I; As ) 

128 Bytes



(4)

:

5

T [cycles]

800 linear model (1) refined model (5) QS20 benchmarks

600

400 As = Ad = 0 (mod 128)

200

0

512

1024 I [bytes]

1536

2048

T [cycles]

800

600

8i pro essors add up to approximately 200 TFlops (DP peak) orresponding to about 50 TFlops sustained for typi al LQCD appli ations. As dis ussed above, a simple nearest-neighbor d-dimensional inter onne tion among these pro essor is all we need to support the data ex hange patterns asso iated with our algorithms. This simple stru ture helps to make the design and onstru tion of the system fast and ost-e e tive, making it a

ompetitive option for a QCD-oriented number run her to be deployed in 2009. In a ollaboration of a ademi and industrial partners we embarked on the QPACE proje t to design, implement, and deploy a next generation of massively parallel and s alable omputer ar hite tures optimized for LQCD. While the primary goal of this proje t is to make an additional vast amount of omputing power available for LQCD resear h, the proje t is also driven by te hni al goals:

 Unlike

similar previous proje ts aiming at massively parallel ar hite tures optimized for LQCD appli ations, QPACE uses ommodity pro essors, inter onne ted by a ustom network.

400 As = 32, Ad = 16 (mod 128)

200

0

512

1024 I [bytes]

1536

2048

FIG. 2: Exe ution time of LS-to-LS DMA transfers as a fun tion of the transfer size with aligned (top) and misaligned (bottom) sour e and destination addresses. The data points show the measured values on an IBM QS20 system. The dashed and solid lines orrespond to the theoreti al predi tions of Eq. (2) and Eq. (4), respe tively.

Our hardware ben hmarks, tted to Eq. (4), indi ate that ea h LS-to-LS DMA transfer has a (zero-size transfer) laten y of 0  200 y les. The DMA ontrollers fragment all transfers into Nb 128-Byte blo ks aligned at lo al store lines (and orresponding to single EIB transa tions). When ÆA = As Ad is a multiple of 128, the sour e lo al store lines an be dire tly mapped onto the destination lines. Then, we have Na = 0, and the e e tive bandwidth e = I=(TDMA 0 ) is approximately the peak value. Otherwise, if the alignments do not mat h (ÆA not a multiple of 128), an additional laten y of a  16 y les is introdu ed for ea h transferred 128-Byte blo k, redu ing e by about a fa tor of two. Fig. 2 illustrates how learly these e e ts are observed in our ben hmarks and how a

urately they are des ribed by Eq. (4).

V. THE QPACE ARCHITECTURE Our performan e model and hardware ben hmarks indi ate that the PowerXCell 8i pro essor is a promising option for latti e QCD (LQCD). We expe t that a sustained performan e above 20% an be obtained on large ma hines. Parallel systems with O(2000) PowerXCell

 For the implementation of the network we leverage the potential of Field Programmable Gate Arrays.

 The

QPACE design aims at an unpre edentedly small ratio of power onsumption versus oating point performan e.

The building blo ks of the QPACE ar hite ture are the node ards. These pro essing nodes, whi h run independently of ea h other, in lude as main omponents one PowerXCell 8i pro essor, whi h provides the omputing power, and a network pro essor (NWP), whi h implements a dedi ated interfa e to onne t the pro essor to a 3-d high-speed torus network used for ommuni ations between the nodes and to an Ethernet network for I/O. Additional logi s needed to boot and ontrol the ma hine is kept to the bare minimum. The node ard furthermore

ontains 4 GBytes of private memory, suÆ ient for all the data stru tures { in luding auxiliary variables { of present-day lo al latti e sizes. The NWP is implemented using an FPGA (Xilinx Virtex-5 LX110T). The advantage of using FPGAs is that they allow us to develop and test logi s within a reasonably short amount of time and that development

osts an be kept low. However, the devi es themselves tend to be expensive.6 The main task of the NWP is to route data between the Cell pro essor, the torus network links, and the Ethernet I/O interfa e. The bandwidth for a torus network link is on the order of 1 GBytes/s in ea h of the two dire tions. The interfa e between NWP

6

For us this issue is less severe sin e Xilinx is supporting QPACE by providing the FPGAs at a substantial dis ount.

6 and Cell pro essor has a bandwidth of 6 GBytes/s,7 in balan e with the overall bandwidth of the 6 torus links atta hed to ea h NWP. Unlike in other Cell-based parallel ma hines, in QPACE node-to-node ommuni ations will pro eed from the lo al store (LS) of an SPE on a pro essor to the LS of a nearest-neighbor pro essor. For ommuni ation the data do not have to be moved through the main memory (re all that the bandwidth of the interfa e to main memory is parti ularly performan e riti al). They will rather be routed from an LS via the EIB dire tly to the I/O interfa e of the Cell pro essor. The PPE is not needed to

ontrol these ommuni ations. The laten y for LS-to-LS

opy operations is expe ted to be on the order of 1s. To start a ommuni ation the sending devi e (e.g., an SPE) has to initiate a DMA transfer of the data from its lo al store to a bu er atta hed to any of the link modules. On e the data arrives in su h a bu er the NWP will take are of moving the data a ross the network without intervention of the pro essor. On the re eiving side the re eiving devi e has to post a re eive request to trigger the DMA transfer of the data from the re eive bu er in the NWP to the nal destination. The physi al layer of the torus network links relies on

ommer ial standards for whi h well-tested and heap

ommuni ation hardware is available. This allows us to move the most timing- riti al logi s out of the FPGA. Spe i ally, we are using the 10 Gbit/s trans eiver PMC Sierra PM8358 (in XAUI mode), whi h provides redundant link interfa es that an be used to sele t among several topologies of the torus network (see below). A mu h bigger hallenge is the implementation of the link to the I/O interfa e of the Cell pro essor, a Rambus FlexIO pro essor bus interfa e. By making use of spe ial features of the Ro ketIO trans eivers available in Xilinx Virtex-5 FPGAs it has a tually been possible to onne t a NWP prototype and a Cell pro essor at a speed of 3 GBytes/s per link and dire tion. (At the time of this writing only a single 8-bit link has been tested, but in the nal design two links will be used.) The node ards are atta hed to a ba kplane through whi h all network signals are routed. One ba kplane hosts 32 node ards plus 2 root ards. Ea h root ard

ontrols 16 node ards via a mi ropro essor whi h an be a

essed via Ethernet. One QPACE abinet has room for 8 ba kplanes, i.e., 256 node ards. Ea h abinet therefore has a peak double pre ision performan e of about 25 TFlops. On the ba kplane, subsets of nodes are inter onne ted in one dimension in a ring topology. By sele ting the primary or redundant XAUI links it will be possible to sele t a ring size of 2, 4 or 8 nodes. By the same me hanism the number of nodes is on gurable in the se -

7

Existing southbridges do not provide this bandwidth to the Cell, and thus a ommodity network solution is ruled out.

ond dimension, in whi h the nodes are onne ted by a

ombination of ba kplane onne tions and ables. In the third dimension ables are used. A large system of N QPACE abinets ould be operated as a single partition with 2  N  16  8 nodes. Input and output operations are implemented by a Gigabit Ethernet tree network. Ea h node ard is an end-point of this tree and onne ted to one of six abinetlevel swit hes, ea h of whi h has one or more 10-Gigabit Ethernet uplinks depending on bandwidth requirements. When QPACE is deployed we expe t that latti es of size 483  96 will be typi al. A gauge eld on guration for su h a latti e size is about 6 GBytes, so the available I/O bandwidth should allow us to read or write the database in O(10) se onds. The power onsumption of a single node ard is expe ted to be less than 150 Watts. A single QPACE

abinet would therefore onsume about 35 kWatts. This translates into a power eÆ ien y of about 1.5 Watts/GFlops. A liquid ooling system is being developed in order to rea h the planned pa kaging density.

VI. QPACE SOFTWARE The QPACE nodes will be operated using Linux, whi h runs on the PPE. As on most other pro essor platforms, the operating system annot dire tly be started after system start. Instead the hardware is rst initialized using a host rmware. The QPACE rmware will be based on Slimline Open Firmware [11℄. Start-up of the system is

ontrolled by the mi ropro essor on the root ard. EÆ ient implementation of appli ations on the Cell pro essor is more diÆ ult ompared to more standard pro essors. For optimizing large appli ation odes on a Cell pro essor the programmer has to fa e a number of

hallenges. For instan e, the data layout has to be hosen

arefully to maximize utilization of the memory interfa e. Optimal use of the on- hip memory to minimize external memory a

esses is mandatory. The overall performan e of the program will furthermore depend on the fra tion of the ode that is parallelized on- hip. To relieve the programmer from the burden of porting e orts we apply two strategies. In typi al LQCD appli ations almost all y les are spent in just a few kernel routines su h as Eq. (1). For a number of su h kernels we will therefore provide highly optimized assembly implementations, possibly with the aid of an assembly generator. Se ondly, we will leverage the work of the USQCD

ollaboration [12℄. This ollaboration has performed pioneering work in de ning and implementing software layers with the goal of hiding hardware details. In su h a framework LQCD appli ations an be built in a portable way on top of these software layers. For QPACE our goal is to implement the QMP (QCD Message Passing) and at least parts of the QDP (QCD Data-Parallel) appli ation programming interfa es. The QMP interfa e omprises all ommuni ation op-

7 erations required for LQCD appli ations. It relies on the fa t that in LQCD appli ations ommuni ation patterns are typi ally regular and repetitive with data being sent between adja ent nodes in a torus grid.8 QMP would be implemented on top of a small number of low-level ommuni ation primitives whi h, e.g., trigger the transmission of data via one parti ular link, initiate the re eive operation and allow to wait for ompletion. The QDP interfa e in ludes operations on distributed data obje ts.

VII. CONCLUSION AND OUTLOOK We have presented an overview of the ar hite ture of a novel massively parallel omputer that is optimized for appli ations in latti e QCD. This ma hine is based on the PowerXCell 8i pro essor, an enhan ed version of the Cell Broadband Engine whi h has re ently be ome available and provides support for eÆ ient double pre ision

al ulations. Based on a detailed analysis of the requirements of latti e QCD appli ations we developed a set of performan e models. These have allowed us to investigate the expe ted performan e of these appli ations on the target pro essor. A relevant subset of these models has been tested on real hardware using the standard Cell Broadband Engine pro essor. Our on lusion is that for 8 An API as general as MPI is therefore not required.

10

[1℄ K.G. Wilson, Con nement of quarks, Phys. Rev. D (1974) 2445 [2℄ P.A. Boyle et al., Overview of the QCDSP and QCDOC

omputers, IBM J. Res. & Dev. (2005) 351 [3℄ F. Belletti et al., Computing for LQCD: apeNEXT, Computing in S ien e & Engineering (2006) 18 [4℄ A. Gara et al., Overview of the Blue Gene/L system ar hite ture, IBM J. Res. & Dev. (2005) 195 [5℄ H.P. Hofstee et al., Cell Broadband Engine te hnology and systems, IBM J. Res. & Dev. (2007) 501 [6℄ http://www.ibm. om/developerworks/power/ ell [7℄ S. Williams et al., The Potential of the Cell Pro essor

49 8 49 51

the most relevant appli ation kernels a sustained performan e of more than 20% an be a hieved. While the PowerXCell 8i pro essor turned out to be a suitable devi e for our target appli ation, a ustom network is needed to inter onne t a larger number of pro essors. In QPACE we will use FPGAs to implement a dedi ated interfa e to 6 high-speed network links whi h onne t ea h pro essor to its nearest neighbors in a 3-d mesh. While this approa h turned out to be very promising in ase of the QPACE ar hite ture, we would like to add some autious remarks. For almost all relevant (multi- ore) pro essors the I/O interfa e to whi h an external network pro essor an be atta hed is highly non-trivial. The implementation of su h an interfa e on an FPGA is likely to be a signi ant te hni al hallenge. With respe t to future developments it also remains to be seen whether FPGAs will be able to ope with in reasing bandwidth requirements. Another advantage of using the PowerXCell 8i pro essor is its low ratio of power onsumption and oating point performan e. For QPACE we expe t a gure as low as 1.5 Watts/GFlops. The ambitious goal of the QPACE proje t is to omplete hardware development by the end of 2008 and to start manufa turing and deployment of larger systems beginning 2009. By the middle of 2009 the ma hines should be fully available for resear h in latti e QCD.

for S ienti Computing, Pro . of the 3rd onferen e on Computing frontiers (2006) 9 [8℄ A. Nakamura, Development of QCD- ode on a Cell ma hine, PoS (LAT2007) 040 [9℄ F. Belletti et al., QCD on the Cell Broadband Engine, PoS (LAT2007) 039 [10℄ G. Bilardi et al., The Potential of On-Chip Multipro essing for QCD Ma hines, Springer Le ture Notes in Computer S ien e 3769 (2005) 386 [11℄ http://www.openbios.info/SLOF [12℄ http://www.usq d.org/software.html

Related Documents


More Documents from "Kailas Sree Chandran"