Code optimization for Cell BE - Opportunities for ABINIT Timo Schneider, Toni Volkmer and Wolfgang Rehm Technical University of Chemnitz, Germany {timos,tovo,rehm}@cs.tu-chemnitz.de Torsten Hoefler Open Systems Laboratory, Indiana University, Bloomington, IN 47404 USA
[email protected] Heiko Schick IBM Deutschland Entwicklung GmbH, Germany
[email protected]
Abstract
intensive application kernels have to be identified and replaced by Cell optimized implementations. This methodology is suitable to optimize legacy applications for the architecture. In this work, we demonstrate the library-based optimization of a large and feature-rich quantum mechanical code for the Cell BE architecture. We identify two main kernels and replace them with library calls to Celloptimized implementations. In order to do this, we propose a new implementation for dense matrix multiplication and leverage a Fast Fourier Transformation library for the Cell BE. The following section describes key-properties of the Cell architecture followed by an introduction of the optimized application ABINIT.
The Cell BE is a new processor architecture. In order to obtain the maximum performance, it is necessary to adapt the code to match the underlying hardware architecture. This means addressing issues of vectorization, memory alignment and communication between main memory and local stores. The goal of this work is to study the performance potential of the Cell BE by implementing selected parallel compute kernels of ABINIT, a widely used open source code for ab initio electronic structure calculations. We identified through profiling that only 2% of ABINITs code make up over 80% of it’s run-time. The most important parts turned out to be double precision dense matrix multiplication and double precision Fourier Transformations. We therefore implemented a new high-performance double precision dense matrix multiplication routine for Cell BE. Thus, our work may serve as guideline for implementing similar algorithms. With FFTW3 there is a very efficient FFT library available which supports the SPEs in Cell systems. We managed to make use of this library in ABINIT what lead to an reduced runtime too.1
1.1
The Cell Broadband Engine Architecture
The Cell BE architecture [21] is a multicore microprocessor with a general purpose core called Power Processing Element (PPE) and multiple vector co-processing elements, called Synergistic Processing Elements (SPEs). The PPE is a stripped-down general purpose core to administer the SPEs, which handle the computational workload. It is easy to run conventional software on the PPE due to its compatibility to the PPC64 architecture. The PPE is connected to the system memory and the SPEs via the Element Interconnect Bus (EIB), a high-bandwidth circular data bus. Each SPE hosts some local memory, called Local Store (LS), an Synergistic Execution Unit (SXE) and a Memory Flow Controller (MFC) which connects the SPE to the EIB. The MFC operates independently of the SPU, so memory transactions can be overlapped with computations. Figure 1 provides an overview of the Cell BE architecture.
1 Introduction The Cell BE architecture is, due to its distributedmemory parallelism, a challenge for programmers and algorithm designers. Different programming models have been proposed for this architecture. The simplest of those models is a library-based accelerator approach where compute1 This research is supported by the Center for Advanced Studies (CAS) of the IBM B¨oblingen Laboratory as part of the NICOLL Project.
1
SPE
SPE
SPE
SXU
SXU
SXU
LS
LS
LS
MFC
MFC
MFC
two floating point instructions (FLOP). As all double precision arithmetic instructions need the same number of clock cycles, these instructions yield the best floating point operation per second (Flop/s) ratio.
1.2
The software package ABINIT [12] is used to perform ab initio electronic structure calculations by solving the timeindependent Schr¨odinger equation
EIB
b tot φ = Etot φ H with a conjugate gradient method, using the density functional theory (DFT) of Hohenberg and Kohn [15] and Kohn and Sham [18] to construct the Hamiltonian operator b tot . The solution φ, the wavefunction of the system, H describes the state of all electrons and nuclei in the system, while Etot is the total energy of this state. The code is the object of an ongoing open software project of the Universite Catholique de Louvain, Corninge Incorporated, and other contributors. ABINIT mostly aims at solid state research. Periodic boundary conditions are applied and the majority of the integrals that have to be calculated are represented in reciprocal space (k-space). The interactions between valence electrons and ionic cores are modeled by pseudopotentials, and the electronic wave-functions are expanded in a set of plane waves.
PPE L2 L1
Power core
ABINIT
Memory
Figure 1. Cell BE Architecture
Let’s take a closer look at the SPEs now, as the performance of our implementation depends solely on our ability to use their full potential. SPEs are special cores: They are meant to fill the gap between standard desktop PC cores and special number crunching devices like the Graphics Processing Units (GPUs) in graphics cards. The SPEs can not access the main memory directly, they can only operate on their Local Store which is capable of holding 256 KiB data. One should not think of the LS as of a cache in a standard CPU because it is not updated automatically or transparently to the running process. To get new data into the LS one has to use the Memory Flow Controller to issue a DMA PUT or GET transfer. The DMA transfers from and to SPUs have to be 16 byte aligned. However they yield the best performance when multiples of 128 byte are transferred.
ABINIT is free software which can be redistributed under the terms of the GPL Licence. It is mainly written in FORTRAN and consists of 240.000 lines of code.
2 Related Work Others have already successfully ported a wide range of applications to the Cell BE architecture, for example an real-time ray-caster [20] or financial market applications [8]. ABINITs performance and scalability on cluster systems was evaluated in [13]. One of ABINITs performance critical routines, three dimensional Fourier transformation, has been analyzed and optimized for cluster systems by leveraging non-blocking collective operations in [14].
The Synergistic Execution Unit is a vector processor which operates on 128 registers, each 128 bit wide. That means when coping with double precision floating point numbers (which are 8 byte wide) we can do two similar operations simultaneously if we manage to put our input data in a single vector. Unfortunately the first generation of Cell BE processors are very slow when doing double precision arithmetic (1.83 GFlop/s per SPE [22], which is 14 times lower than the single precision performance). But this improved with future generations of this chip. The performance cited above can only be reached when fused multiply add instructions are used. These instructions perform the operation c := a ∗ b + c or similar and therefore count as
3 Optimization Principles Since ABINIT has a very large code base and relies on quite complex theories from the field of physics we knew from the beginning of the project that it would be impossible to rewrite the whole program for the Cell BE architecture. Instead we had to take a different approach, we had to focus on small and understandable compute kernels and optimize those. 2
3.1
Profiling ABINIT
the executed SPE program does not do anything besides waiting until all used SPE’s are running (which is ensured via fast SPE to SPE communication) and hand the control flow back to the PPE which will just destroy the SPE contexts after all SPE’s completed their little task. This is done 105 times. The overall time for all iterations is then divided by 105 and therefore we get a good lower bound for the minimal overhead induced through SPE context creation and destruction by the PPE, including the overhead of the PPE function call. We found out that this overhead increases almost linearly as the number of SPEs involved in this micro-benchmark rises. See Figure 3 for the detailed results.
Profiling the application shows that only 2% of ABINITs code make up over 80 % of it’s run-time. The most important parts turned out to be ZGEMM, a double precision matrix multiplication routine from BLAS, which contributed about 25 % of the total run time in our test runs, Fourier Transformations are another important part of ABINIT. The functions which perform those contribute another 25 % of the total run time. Another important function is the ABINIT internal application of the non-local Hamiltonian operator which is realized in the routines opernl4*. Together these contribute about 35 % of ABINITs run time. Similar profiling results have been obtained in [13] with an older version of ABINIT, with the exception that ZGEMM was not used in this version, since it is leveraged by the LOBPCG eigensolver [17], which ABINIT did not use back then.
Initialization Overhead [ms]
3.2
2.5
Our Acceleration Model
As ABINIT can run unmodified on the PPE we could just substitute single performance critical functions with an optimized ones which are able to leverage the SPEs for the computation task. This approach is quite simple as does not need any background knowledge about the relationship of the different parts of ABINIT. But it also has a serious drawback: Some actions like creating a PPE pthread for each used SPE and the creation and destruction of the SPE contexts have to be performed every time an optimized function is executed. This overhead could of course slow down the measured overall performance if the input dataset is very small and the function is called very often.
ALLOCATE(A) .... CALL FUNCTION(A)
original code
function.o (PPE) copy parameters start SPE tasks wait for completion
2 1.5 1 0.5 0 0
2
4
6
8
10
12
14
16
Number of used SPEs
Figure 3. SPE context creation/destruction overhead benchmark
On the one hand this benchmark clearly shows that our acceleration model has it’s limits, especially for compute kernels working on a small set of input data or consisting of low complexity class algorithms. Therefore this overhead should be tackled, especially when it comes to feature generations of Cell CPUs which will have more SPEs.
Memory ABINIT (PPE)
Overhead
SPE0 get input do calculations write back
...
4 Dense Matrix Multiplication
SPE8 get input do calculations write back
Basic Linear Algebra Subprograms (BLAS) is an widely used application programming interface for libraries to perform basic dense linear algebra operations such as matrix multiplication. They were first published in 1979 [19]. Highly optimized implementations of the BLAS interface have been developed by different vendors or groups for many architectures.
our contribution
Figure 2. Fundamentals of our acceleration model
ZGEMM performs the matrix-matrix operation on input of complex matrices:
To measure this startup overhead we implemented a function which executes an SPE program in the same way our optimized functions work. The only difference is that
C := α · op(A) · op(B) + β · C 3
column of op(B), we will always have to consider the operators for our memory access.
Where op(A) specifies if the normal, transposed or conjugated version of the matrix is to be used. A,B and C are matrices consisting of complex numbers and α and β are complex scalars. The FORTRAN interface is: SUBROUTINE ZGEMM(TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC) TRANSA and TRANSB contains the operator to be used on matrix A and B as a single character which can be n (normal), t (transposed) or c (transposed, conjugated). op(A) is M by K matrix, op(B) is a K by N matrix, and C is a M by N matrix. Note that M, N and K refer to the matrices after the operators are applied, not the original input matrices. ALPHA and BETA correspond to α and β in the equation above. LDA, LDB and LDC specify the first dimension of the input matrices so it is possible to use ZGEMM the top-left part of the input matrices only.
We investigated works that use Strassen’s or Winograd’s implementation to reduce the asymptotic complexity of the matrix multiplication [7]. However, those optimized algorithms work only with well conditioned matrices which we can not guarantee in the general case. Thus, we chose to implement a traditional O(N 3 ) algorithm for our ZGEMM.
4.1
We had to apply two important concepts to be able to design a well-performing ZGEMM implementation: We partitioned the input data, distributed it among the Local Stores of the available SPEs to minimize memory latencies during calculation and vectorized all calculations in order to exploit the SIMD architecture.
The input matrices A, B and C are stored in column major order, as they come from a program written in FORTRAN. Figure 4 illustrates the meaning of the different ZGEMM parameters which deal with the representation of the input matrices.
4.1.1 Data Partitioning As the Local Store of an SPE space is limited to 256KiB, the goal should be to save space and memory transfers. A first idea was to load parts of a row of op(A) and a column of op(B) and to compute exactly one element of C. There are some problems with this: depending on the operator, the rows (or columns) of the matrices are stored sequentially in memory or scattered with a displacement (of LDx), forcing us to get each element separately. This would decrease performance, as the MFC operates best with memory chunks that are multiples of 128 byte in size.
K
A M LDA
A better idea is to load blocks instead of lines, and perform small matrix-matrix multiplications instead of scalar products. This gives us independence from the operator: the decision whether rows or columns should be used in the scalar product of the matrix multiplications on the SPEs does not affect performance, as we have random access to the Local Store. Another advantage is the number of operations. For n elements which fit in each input buffer of our Local Store, O(n) multiply and add √ operations can be done with the scalar product, but O( n3 ) = O(n1.5 ) operations can be achieved with small matrix multiplications. Of course, with more operations on the same amount of local data the total number of memory transfers is reduced.
Memory layout: LDA
A
M ...
column 1
Our ZGEMM implementation
column K
Figure 4. FORTRAN/ZGEMM Matrix representation
4.1.2 Work Assignment With our partitioning approach, each part of the result matrix can be independently computed with the block row of op(A) and the block column of op(B). The blocks to be computed are simply distributed circular on the SPEs. Figure 5 illustrates the assignment scheme for 6 SPEs. The shaded result block is computed using the shaded row in
The tricky part is the operator: Depending on if its normal or not, elements which are stored sequentially in memory can be in one row or one column. As one result element is computed based on one row of op(A) and one 4
op(A) and the shaded column in op(B).
*
=
In Figure 4 you can see how the block-wise multiplication can be implemented in C, using the SPU intrinsics. 3 1
2
3
4
5
6
1
3
4
5
6
1
2
3
2 4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
...
# d e f i n e VPTR ” ( v e c t o r d o u b l e ∗) ” v e c t o r char h i g h d b l = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 16 ,17 ,18 ,19 ,20 ,21 ,22 ,23}; v e c t o r char l o w d b l = { 8 , 9 , 1 0 , 1 1 , 1 2 , 1 3 , 1 4 , 1 5 , 24 ,25 ,26 ,27 ,28 ,29 ,30 ,31}; v e c t o r do ubl e r r e = { 0 , 0 } , ri m = { 0 , 0 } , t r e , tim , s r e , sim ; f o r ( k = 0 ; k < k l e n ; k ++ , a a += a s t p , bb += b s t p ) { fi m = s p u s h u f f l e ( ∗ ( VPTR a a ) , ∗(VPTR ( a a + a s t p ) ) , l o w d b l ) ; gim = s p u s h u f f l e ( ∗ ( VPTR bb ) , ∗(VPTR bb ) , l o w d b l ) ; f r e = s p u s h u f f l e ( ∗ ( VPTR a a ) , ∗(VPTR ( a a + a s t p ) ) , h i g h d b l ) ; g r e = s p u s h u f f l e ( ∗ ( VPTR bb ) , ∗(VPTR bb ) , h i g h d b l ) ; t r e = s p u n ms u b ( fim , gim , s r e ) ; t i m = spu madd ( f r e , gim , sim ) ; s r e = s p u ms u b ( f r e , g re , t r e ) ; sim= spu madd ( fim , g re , t i m ) ; }
Figure 5. SPE Block assignment We investigated the use of the PPE with an experimental implementation. The PPE has a theoretical peak performance of 6.4 GFlop/s. Our code spawns N threads on the PPE, each of them computes the same chunk of op(C) as an SPE does2 , using a PPC970 optimized BLAS implementation to perform the computation. Despite the given peak performance of the SPE, we achieved only 1.7 GFlop/s with ATLAS on the PPE, which makes this partitioning scheme suboptimal. Thus, we did not include the PPE measurements in our benchmarks.
r r e = s p u s h u f f l e ( s r e , sim , h i g h d b l ) ; ri m = s p u s h u f f l e ( s r e , sim , l o w d b l ) ; ∗(VPTR c c ) = s p u a d d ( ∗ ( VPTR c c ) , r r e ) ; ∗(VPTR ( c c + 1 ) ) = s p u a d d ( ∗ ( VPTR ( c c + 1 ) ) , ri m ) ;
Figure 6. Inner loop of the blockwise matrix multiplication, implemented in C
4.1.3 Vectorization
4.2
In our matrix multiplication, each element is a 128 bit complex number, consisting of 64 bit double precision floating point values for real part and imaginary part. We can safely assume that only fused multiply add operations are used, as two elements of each matrix are multiplied and added to the temporary scalar product. One multiply-add operation of complex numbers a and b added to y (y = y + a · b) is split up like this for its real and imaginary parts:
This section provides a performance evaluation of our implementation and a qualitative and quantitative comparison to BLAS implementations on other modern architectures. 30
RefBLAS CellBLAS, PS3, 6 SPUs CellBLAS, QS20, 8 SPUs CellBLAS, QS20, 16 SPUs IBM BLAS (DGEMM), QS20, 8 SPUs
25
yre := yre + are bre − aim bim yim := yim + are bim + aim bre
GFlop/s
20
This makes 4 fused multiply add operations, with 64 bit operands. With the SIMD-ability of the SPU, two complex multiply-adds can be done instead of one. To use SIMD instructions, the real parts and imaginary parts have to be splitted and packed into separate registers. This can be done with the SPU shuffle instruction. Now the calculation can be done as described above, and the only thing left to do is to separate the real and imaginary part into the result registers before we write back into C. One little obstacle remains: The fused multiply subtract operation on the SPU spu msub(a, b, c) calculates a · b − c, but we would need c − a · b. To achieve this without adding further instructions to change the sign, the real part can be calculated as follows:
15 10 5 0 0
200 400 600 800 1000 1200 1400 1600 1800 2000 Matrix size (NxN)
Figure 7. Performance Comparison The current Cell BE chip’s SPEs are capable of issuing one double precision arithmetic instruction every six clock cycles. This instruction needs another seven cycles until the result is available in the target register. But if we assume
yre := are bre − ((aim bim ) − yre ) 2 theoretically,
Benchmarks
3 Our code and the tests that were used to obtain the presented benchmark results can be fetched from http://files.perlplexity.org/zgemm.tar.gz.
N = 4 should be optimal
5
refblas by using different numactl configurations (numactl controls which CPU uses which memory bank), we were not able to achieve more than one Gflop. This is due to the fact that the current compilers do not automatically generate code for the SPUs. Thus, the refblas implementation used only the rather slow PPC core. We outperform the IBM DGEMM implementation by large for all different matrix sizes and our code scales very well to up to 16 SPUs. We can also reproduce similar performance on the specialized Playstation 3 (PS3) hardware (only 6 SPEs are accessible with Linux).
to execute a very large number of data-independent double precision operations we would get a cycles per instruction (CPI) value of 6. Considering FMADD operations and a vector size of two, the theoretical peak performance of a single Cell BE CPU with 8 SPE and a clock rate of 3.2 GHz is 3.2 ∗ 109 Hz ·8 SPE·4 Flop/SPE = 17.07 GFlop/s 6 This is the number in theory, in practical tests (back to back execution of fused multiply add instructions with no data dependencies) we were able to measure up to 14.5 GFlop/s. This number is said to be the Cell BE double precision peak performance. [22] Rpeak =
Another optimization technique that has been proposed [6] is to overlap memory (DMA) accesses with computation. However, this increases the code complexity significantly. To evaluate the potential benefit, we removed all the memory (DMA) accesses from our implementation to simulate the overlap. This invalidates the results but provides an upper bound to the performance-gain due to overlap. Figure 8 shows the comparison to our implementation. Our experiments show that we could gain up to one Gflop/s performance with this overlap technique.
Even though our implementation supports arbitrary matrices, we benchmarked square matrices to enable easy comparisons to other publications. We used ppu-gcc, version 4.1.1 with the flags -O3 -mabi=altivec -maltivec to compile all PPE code and spu-gcc, version 4.1.1 with -O3 for the SPE code. The Cell BE specific benchmarks were run on a 3.2 GHz IBM QS20 Cell Blade, which contains 2 Cell BE processors with 8 SPEs per processor and two 512 MiB RAM banks and a Playstation 3 running at 3.2 GHz with 200 MiB memory. Both systems run Linux 2.6 (with IBM patches applied).
50 45 40
12 GFlop/s
35
10
GFlop/s
8
30 25
RefBLAS CellBLAS PS3 CellBLAS, 8 SPEs CellBLAS, 16 SPEs Big Red Jeltz Odin Sif
20 15
6
10 5
4
0 0
2 cellblas, 8 SPEs no DMA, cellblas, 8 SPEs
0 0
1000
2000
3000 4000 5000 Matrix size (NxN)
6000
7000
Figure 9. Absolute efficiency of different BLAS implementations
200 400 600 800 1000 1200 1400 1600 1800 2000 Matrix size (NxN)
Figure 8. Effects of overlapping the memory accesses with computation
Our next benchmark compares the Cell BE and our optimized implementation (currently the fastest available) with different modern High Performance Computing (HPC) architectures. We chose a variety of different systems to be able to evaluate the suitability of the Cell BE for scientific calculations using ZGEMM as an example. The different systems and their peak floating point performances are described in the following. We leveraged all available processing units (CPUs/Cores) that share a common system memory (are in the same physical node). Thus we compare our multi-core Cell BE implementation with other multi-core BLAS implementations. The test systems
In our first benchmark (Figure 7), we compare the performance of netlib.org’s refblas ZGEMM with the IBM DGEMM implementation4 and our optimized implementation for different matrix sizes. The results show that the our implementation performs very well on Cell BE CPUs. Even though we tried to tune 4 The current IBM BLAS implements no ZGEMM. Thus, we used DGEMM for comparison, because of its similarity to ZGEMM
6
ment. However, the Cell BE represents a completely new approach of the “explicit cache” (Local Store). Additionally to that, the Cell architecture introduces additional overheads for loading the code to the SPUs. The relative performance results are presented in Figure 10. The highly optimized Goto BLAS implementation delivers the best performance on the available architectures. IBM’s Engineering and Scientific Subroutine Library (ESSL) delivers good performance on PowerPC. Our implementation which explores a new CPU architecture is performing very well in comparison to the well established ones and even better than Apple’s Veclib.
are described in the following: a node in Big Red has two dual-core PowerPC 970 MP processors (2.5GHz) with 8GB RAM per node. The peak-performance (with FMADD) is 40 GFlop/s and we ran the IBM ESSL library. We used the Goto BLAS [16] library 1.19 on Odin, a dual CPU dualcore Opteron running at 2 GHz with a peak performance of 16 GFlop/s, and Sif, a dual CPU quad-core 1.86 GHz Intel Xeon with 59.5 GFlop/s peak. The theoretically fastest tested system, Jeltz, as two quad-core Intel Xeon 3.0 GHz and a peak performance of 96 GFlop/s. Jeltz runs Mac OS X Tiger and we used the vendor supplied Veclib for our experiments. The absolute performance results for all those systems are plotted in Figure 9.
5 Fast Fourier Transformation on Cell BE Reached peak performance
0.9
ABINIT 5.4.4 contains two different FFT implementations written by Stefan Goedecker in 1993 and 2002 [10] [11], called goedecker and goedecker2002 in the following text. The goedecker2002 algorithm can be used with MPI. In addition to that ABINIT supports the FFTW2 library [4], but this did not work out of the box for us, but we ware able to write a small patch which fixes the problem.
0.8 0.7 0.6 0.5 RefBLAS CellBLAS PS3 CellBLAS, 8 SPEs CellBLAS, 16 SPEs ESSL (Big Red) Veclib (Jeltz) Goto (Odin) Goto (Sif)
0.4 0.3 0.2 0.1 0 0
Only the goedecker2002 algorithm can be used together with the LOBPCG [17] solver without further modifications. This is because only goedecker2002 uses a LOBPCG-compatible data layout when used with MPI in parallel. In the sequential case all FFT algorithms work.
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Matrix size (NxN)
Figure 10. Relative efficiency of different BLAS implementations
5.1
Benchmark of different FFT libraries
Due to memory and CPU time limits, not all matrix sizes could be run on all systems (e.g., the PS3 had only 200 MiB). Our benchmarks show that the current generation Cell BE is not really suited to perform double precision floating point calculations because it is largely outperformed by systems in the same and lower price-range. However, the specialized low-cost Playstation 3 makes a big difference in this price-performance game but its limited memory might be a big obstacle to scientific use.
To compare the performance capabilities of different FFT algorithms on the Cell architecture we employed benchFFT [4], which is an FFT benchmarksuite. It contains and supports benchmarking a large number of FFT algorithms like ACML (AMD Core Math Library, [2]), FFTW2, FFTW3.1 ([4], [9]), Intel MKL (Intel Math Kernel Library, [5]) and goedecker. A list of all supported FFT libraries can be obtained from [3]. The benchmarks have been carried out on a Opteron 244 (1,8 GHz) machine and an IBM QS 20 Cell Blade.
Those absolute value comparisons do not allow any qualitative comparisons between the different libraries. The main problem is the high variance in peak performance. To compare our implementation to other BLAS libraries, we normalized the measured performance to the peak performance of the architecture to get an estimate of the efficiency of use of the floating point units. We expect a pretty high efficiency on the standard superscalar and cache-based architectures due to the high spatial and temporal locality in matrix multiplication algorithms and decades of develop-
By the time of writing only FFTW3.2-alpha3 could leverage the SPEs during a three dimensional fast Fourier transformation. Therefore we use the term FFTW3 as a synonym for FFTW3.2-alpha3 in the following text. As we are interested to find possible candidates for FFT algorithms in ABINIT on Cell we did not benchmark algorithms which are only available for x86 architectures (like ACML, Intel MKL) or do not support a three dimensional FFT. As you can see in Figure 11, FFTW3.2-alpha3 is the fastest Algorithm on the Opteron. On the Cells PPE (Fig7
Figure 11. 3D-FFT on Opteron 244 (1,8 GHz)
Figure 13. 3D-FFT on IBM BladeCenter QS20 FFTW3 (with SPE support)
has to be transformed. Further informations about the 3D FFT in ABINIT can be found in [14]. To evaluate the benefit of FFTW3 over the goedecker2002 implementation we therefore performed benchmarks of ABINIT using different FFT implementations. For those benchmarks we analysed the wall clock times of ABINIT runs on an input file defining a unit cell consisting of 108 Aluminum atoms. before transformation
x
Figure 12. 3D-FFT on IBM BladeCenter QS20 (PPE only)
z
ure 12) FFTW3.2-alpha3, ffte and goedecker are the fastest available algorithms, depending on the problem size. If one allows the usage of the SPEs, FFTW3.2-alpha3 is by far the fastest algorithm available (see Figure 13).
5.2
ABINIT and FFTW3
x
We have shown that FFTW3 is the fastest FFT implementation for Cell BE available. On the other hand FFTW3, unlike the goedecker2002 implementation, performs a full transformation for every dimension. Due to the NyquistShannon sampling theorem this is not necessary for all FFTs which occur in ab initio calculations. The optimized transformations which make use of this fact are illustrated in Figure 14. Note that only the shaded part of the FFT box really
z
transformation in x 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111
1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111
x
y
11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 y
z
transformation in y 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11
x
z
y
transformation in z 11111111111 00000000000 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 y
Figure 14. 3D-FFT with zero padding
8
5.2.1 Benchmarks
Cell.
For the final benchmarks we installed FFTW3.2-alpha3 and ABINIT 5.4.4 on an IBM QS20 Blade. The following compiler flags were used: CC="ppu-gcc" CFLAGS="-O3 -fomit-frame-pointer -mcpu=cell" CXXFLAGS="-O3 -fomit-frame-pointer -mcpu=cell" FC="gfortran -m64 -O3 -fomit-frame-pointer" FFLAGS=""
The input file which was used for the benchmarks specifies an 543 FFT box. All calculations have been done on one PPE with 8 SPEs. For this benchmark we varied the fftalg parameter, which means we used a number of different FFT algorithms. The wfoptalg parameter was set to 4 which means that the LOBPCG solver was used. A complete list of ABINIT parameters can be obtained from [1].
fftalg 100 300 400 401
Wallclock Time 2349,6 1950,4 2282,7 2294,5
References
Algorithm goedecker FFTW3 goedecker2002 goedecker2002, zero padding
[1] ABINIT documentation . http://www.abinit.org/documentation/. [2] ACML: AMD Core Math Library . http://developer.amd.com. [3] benchFFT: Benchmarked FFT Implementations . http://www.fftw.org/benchfft/ffts.html. [4] FFTW homepage . http://www.fftw.org. [5] Intel Math Kernel Library . http://www.intel.com. [6] T. Chen, R. Raghavan, J. N. Dale, and E. Iwata. Cell broadband engine architecture and its first implementation A performance view. IBM Journal of Research and Development, 51:559–572, 2007. [7] C. C. Douglas, M. Heroux, G. Slishman, and R. M. Smith. GEMMW: a portable level 3 BLAS winograd variant of strassen’s matrix-matrix multiply algorithm. J. Comput. Phys., 110(1):1–10, 1994. [8] J. Easton, I. Meents, O. Stephan, H. Zisgen, and S. Kato. Porting financial markets applications to the cell broadband engine architecture. 6 2007. [9] M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216–231, 2005. special issue on “Program Generation, Optimization, and Platform Adaptation”. [10] S. Goedecker. Fast radix 2, 3, 4, and 5 kernels for fast Fourier transformations on computers with overlapping multiply-add instructions. SIAM Journal on Scientific Computing, 18(6):1605–1611, 1997. [11] S. Goedecker, M. Boulet, and T. Deutsch. An efficient 3-dim FFT for plane wave electronic structure calculations on massively parallel machines composed of multiprocessor nodes. Computer Physics Communications, 154(2):105–110, Aug. 2003.
If one compares the execution times of fftalg=400 with those from FFTW3 a performance increase over 14% becomes visible. Please note that the possibly performance improvements heavily depend on the size of the FFT box used in the calculation. We used a rather small one in this example (543 ) to demonstrate that even in that case there is a benefit over the standard algorithms.
5.3
Our ZGEMM implementation shows the best performance of all publicly available ZGEMM or DGEMM implementations for Cell BE. Thus, our work may serve as guideline for implementing similar algorithms. With FFTW3 there is a very efficient FFT library available which supports the SPEs in Cell systems. We managed to make it possible to use this library together with ABINIT. This lead to an reduced runtime (14% in our example). Currently we focused on the sequential variant of ABINIT, but as the FFTW3 library comes with MPI support and employs a data layout similar to the one used in the goedecker2002 algorithm implementation it should be feasibly to achieve a similar performance improvement in the parallel case. This could be evaluated in further studies. Another approach could be to port the goedecker2002 implementations to the Cell architecture. This could prove to be a bit more difficult as the goedecker algorithms are hardly documented and specifically crafted for ABINIT.
Conclusion and Future Work
Since scientific simulations heavily rely on optimized linear algebra functions we presented in this article an optimized ZGEMM implementation for the IBM Cell BE processor. As a part of the BLAS package, the ZGEMM routine performs a complex matrix–matrix multiplication. We discussed the strategies to distribute data and to exploit the double precision floating point elements of the SPEs. The benchmarks showed that the performance of our ZGEMM algorithm achieves up to 70% of the peak performance and scales linearly from 1 to 16 SPEs. We assume that our code will also perform well on the next generation Cell BE which supports a fully-pipelined double precision unit that does not stall 6 cycles after every instruction. We compared the algorithm with the IBM DGEMM implementation since there is no ZGEMM implementation available for 9
[12] X. Gonze, J.-M. Beuken, R. Caracas, F. Detraux, M. Fuchs, G.-M. Rignanese, L. Sindic, M. Verstraete, G. Zerah, F. Jollet, M. Torrent, A. Roy, M. Mikami, P. Ghosez, J.-Y. Raty, and D. C. Allan. First-principles computation of material properties: the ABINIT software project. Computational Materials Science, 25:478–493, 2002. [13] T. Hoefler, R. Janisch, and W. Rehm. A Performance Analysis of ABINIT on a Cluster System. In Parallel Algorithms and Cluster Computing, pages 37–51. Springer, Lecture Notes in Computational Science and Engineering, 12 2005. [14] T. Hoefler and G. Zerah. Transforming the highperformance 3d-FFT in ABINIT to enable the use of nonblocking collective operations. Technical report, Commissariat a l’Energie Atomique - Direction des applications militaires (CEA-DAM), 2 2007. [15] P. Hohenberg and W. Kohn. Inhomogeneous Electron Gas. Physical Review, 136:B864, 1964. [16] G. K. and G. R. On reducing TLB misses in matrix multiplication. Technical report tr-2002-55, The University of Texas at Austin, Department of Computer Sciences, 2002. [17] A. KNYAZEV. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM journal on scientific computing(Print), 23(2):517–541, 2002. [18] W. Kohn and L. Sham. Self-Consistent Equations Including Exchange and Correlation Effects. Physical Review, 140(4A):1133–1138, 1965. [19] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. In In ACM Trans. Math. Soft., 5 (1979), pp. 308-323, 1979. [20] B. Minor, G. Fossum, and V. To. Terrain rendering engine (tre): Cell broadband engine optimized real-time ray-caster. [21] J. C. R. and B. D. A. Introduction to the cell broadband engine architecture. IBM Journal of Research and Development, 51:503–519, 2007. [22] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. The potential of the cell processor for scientific computing. In CF ’06: Proceedings of the 3rd conference on Computing frontiers, pages 9–20, New York, NY, USA, 2006. ACM.
10