Implementing a Linux Cluster A project report submitted in partial fulfilment of the requirements for the award of the degree of
Bachelor of Technology in Computer Science and Engineering by Deepak Lukose Roll No:16 Group No:8 S8 CSE
Department of Computer Engineering
National Institute of Technology, Calicut Kerala - 673601 2006
Certificate
This is to certify that the project entitled “Implementing a Linux Cluster” is a bonafide record of the project presented by Deepak Lukose, (Y2.025) under our supervision and guidance. The project report has been submitted to the Department of Computer Engineering of National Institute of Technology, Calicut in partial fulfilment of the award of the Degree of Bachelor of Technology in Computer Science and Engineering.
Dr. M.P. Sebastian Professor and Head Dept.of Computer Engineering NIT Calicut
Mr. Vinod Pathari Lecturer Dept.of Computer Engineering NIT Calicut
iii
Abstract
A computer cluster is a group of loosely coupled computers that work together closely so that in many respects it can be viewed as though it were a single computer. Clusters are commonly connected through fast local area networks and are usually deployed to improve speed and/or reliability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed or reliability. Clusters built from open source software, particularly based on the GNU/Linux operating system, are increasingly popular. Their success is not hard to explain because they can cheaply solve an ever-widening range of number-crunching applications. A wealth of open source or free software has emerged to make it easy to set up, administer, and program these clusters. This work aims at an implementation of free and open source clusters for performing scientific computations at a faster pace.
iv
Acknowledgements
I express my sincere thanks to Mr. Vinod Pathari for his constant backing and support. I would like to extend my gratitude to entire faculty and staff of CSE Department of NITC, who stood by me in all the difficulties I had to face during the completion of this project. Last but not the least I thank God Almighty for being the guiding light ...all throughout.
Deepak Lukose
v
Contents
Chapter 1 Introduction
1
1.1
Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Cluster history . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.3
High-performance (HPC) clusters . . . . . . . . . . . . . . . . . . .
2
1.4
Cluster technologies . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2 Motivation
4
3 Design
5
3.1
Design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3.1.1
Mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
3.1.2
Operating System . . . . . . . . . . . . . . . . . . . . . . . .
6
3.1.3
Cluster Software . . . . . . . . . . . . . . . . . . . . . . . .
6
3.1.4
Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
4 Cluster Installation
8
4.1
OSCAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
4.2
Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
4.3
Installation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . .
10
vi 5 Parallel Programming 5.1
5.2
11
Hello World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
5.1.1
MPI Init . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
5.1.2
MPI Finalize . . . . . . . . . . . . . . . . . . . . . . . . . .
12
5.1.3
MPI Comm size . . . . . . . . . . . . . . . . . . . . . . . . .
13
5.1.4
MPI Comm rank . . . . . . . . . . . . . . . . . . . . . . . .
13
5.1.5
MPI Get processor name . . . . . . . . . . . . . . . . . . . .
14
Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . .
15
6 Molecular Dynamics Modeling of Thermal Conductivity of Engineering Fluids and its Enhancement due to Nanoparticle Inclusion
19
6.1
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
6.2
Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
7 Testing and Benchmarking
26
8 Conclusion
29
Bibliography
30
Chapter 1 Introduction
1.1
Problem Definition To setup a cluster which can increase the performance of CPU intensive
tasks like scientific computations, rendering graphics etc. In short, to create a High performance cluster for general purpose computing with special regard to modeling and simulation of molecular systems. 1.2
Cluster history The first commodity clustering product was ARCnet, developed by Dat-
apoint in 1977. ARCnet was not a commercial success and clustering did not really take off until DEC released their VAXcluster product in the 1980s for the VAX/VMS operating system. The ARCnet and VAXcluster products not only supported parallel computing, but also shared file systems and peripheral devices. They were supposed to give us the advantage of parallel processing while maintaining data reliability and uniqueness. VAXcluster, now VMScluster, is still available on OpenVMS systems from HP running on Alpha and Itanium systems. The history of cluster computing is intimately tied up with the evolution of networking technology. As networking technology has become cheaper and faster, cluster computers have become significantly more attractive.
2 1.3
High-performance (HPC) clusters High-performance clusters are implemented primarily to provide increased
performance by splitting a computational task across many different nodes in the cluster, and are most commonly used in scientific computing. One of the more popular HPC implementations is a cluster with nodes running Linux as the OS and free software to implement the parallelism. This configuration is often referred to as a Beowulf cluster. Such clusters commonly run custom programs which have been designed to exploit the parallelism available on HPC clusters. Many such programs use libraries such as Message Passing Interface(MPI) which are specially designed for writing scientific applications for HPC computers. 1.4
Cluster technologies The Message Passing Interface (MPI) is a computer communications protocol.
It is a de facto standard for communication among the nodes running a parallel program on a distributed memory system. MPI is a library of routines that can be called from Fortran, C, C++ and Ada programs. MPI’s advantage over older message passing libraries is that it is both portable (because MPI has been implemented for almost every distributed memory architecture) and fast (because each implementation is optimized for the hardware it runs on). The GNU/Linux world sports various cluster software, such as: Specialized application clusters - Those clusters which are built using specialized applications fall in this category. Eg: Beowulf, distcc, MPICH and other. distcc provides parallel compilation when using GCC. Director-based clusters - These clusters allow incoming requests for services to be distributed across multiple cluster nodes. Eg:Linux Virtual Server, Linux-HA Full-blown clusters - This category includes those clusters which integrated into the kernel mechanism for automatic process migration among homogeneous nodes.
3 Eg:Mosix, openMosix, Kerrighed, OpenSSI etc. OpenSSI and Kerrighed are single-system image implementations. DragonFly BSD, a recent fork of FreeBSD 4.8 is being redesigned at its core to enable native clustering capabilities. It also aims to achieve single-system image capabilities. MSCS is Microsoft’s high-availability cluster service for Windows. Based on technology developed by Digital Equipment Corporation. The current version supports up to eight nodes in a single cluster, typically connected to a SAN. A set of APIs support cluster-aware applications, generic templates provide support for non-cluster aware applications. Grid computing is a technology closely related to cluster computing. The key differences between grids and traditional clusters are that grids connect collections of computers which do not fully trust each other, and hence operate more like a computing utility than like a single computer. In addition, grids typically support more heterogeneous collections than are commonly supported in clusters.
Chapter 2 Motivation
Most of the time, the computer is idle. Start a program like xload or top that monitors the system use, and one will probably find that the processors load is not even hitting the 1.0 mark. If one has two or more computers, chances are that at any given time, at least one of them is doing nothing. Unfortunately, when we really do need CPU power - during a C++ compile, or encoding Ogg Vorbis music files - we need a lot of it at once. The idea behind clustering is to spread these loads among all available computers, using the resources that are free on other machines. The basic unit of a cluster is a single computer, also called a “node”. Clusters can grow in size - they “scale” - by adding more machines. The power of the cluster as a whole will be based on the speed of individual computers and their connection speeds are. In addition, the operating system of the cluster must make the best use of the available hardware in response to changing conditions. This becomes more of a challenge if the cluster is composed of different hardware types (a “heterogeneous” cluster), if the configuration of the cluster changes unpredictably (machines joining and leaving the cluster), and the loads cannot be predicted ahead of time.
Chapter 3 Design
Designing a cluster entails four sets of design decisions:
(1) Determine the overall mission of the cluster. (2) Select a general architecture for the cluster. (3) Select the operating system, cluster software, and other system software that will be used. (4) Select the hardware for the cluster. While each of these tasks, in part, depends on the others, the first step is crucial. If at all possible, the cluster’s mission should drive all other design decisions. At the very least, the other design decisions must be made in the context of the cluster’s mission and be consistent with it. Selecting the hardware should be the final step in the design, but often we won’t have as much choice as we would like. A number of constraints may force us to select the hardware early in the design process. The most obvious is the budget constraints. Defining what we want to do with the cluster is really the first step in designing it. For many clusters, the mission will be clearly understood in advance. This is particularly true if the cluster has a single use or a few clearly defined uses.
6 But it should be noted that clusters have a way of evolving. What may be a reasonable assessment of needs today may not be tomorrow. Good design is often the art of balancing today’s resources with tomorrow’s needs. 3.1
Design decisions
3.1.1
Mission
To setup a cluster which can increase the performance of CPU intensive tasks like scientific computations, rendering graphics etc. In short, to create a High performance cluster for general purpose computing with special regard to modeling and simulation of molecular systems.
3.1.2
Operating System
RedHat Enterprise Linux 4(RHEL4) was selected because it is one of the most widely supported platform by all cluster softwares.
3.1.3
Cluster Software
OSCAR (Open Source Cluster Application Resources) is a high performance Linux cluster that is available for parallel/distributed computing needs. OSCAR scales well over Linux distributions. For greater control over how the cluster is configured, one will be happier with OSCAR in the long run. Typically, OSCAR provides better documentation than other Cluster kits like Rocks. OSCAR was chosen over traditional Beowulf due to the ease of installation as well as its comprehensive package which includes many compilers and application softwares. One of the main package that is being used on OSCAR is LAM-MPI, a popular implementation of the MPI parallel programming paradigm. LAM-MPI can be used for developing parallel programs and it is one of the leading implementations available.
7 3.1.4
Hardware
Hardware preferred is P4 3.0GHz machines with 2GB DDR RAM(on each node) and 1Gbps/100Mbps NIC which supports PXE(Preboot Execution Environment), for network booting on computers. The Head node must have at least 7GB Hard disk capacity and all other client nodes need a minimum of 5GB hard disk capacity.
Chapter 4 Cluster Installation
One of the more important developments in the short life of high performance clusters has been the creation of cluster installation kits such as OSCAR (Open Source Cluster Application Resources) and Rocks. With software packages like these, it is possible to install everything one needs and very quickly have a fully functional cluster. A fully functional cluster will have a number of software packages each addressing a different need, such as programming, management, and scheduling. 4.1
OSCAR OSCAR is a software package that is designed to simplify cluster installation.
A collection of open source cluster software, OSCAR includes everything that one is likely to need for a dedicated, high-performance cluster. OSCAR uses a best-incategory approach, selecting the best available software for each type of clusterrelated task. One will often have several products to choose from for any given need. The design goals for OSCAR include using the best-of-class software, eliminating the downloading, installation, and configuration of individual components, and moving toward the standardization of clusters. OSCAR, it is said, reduces the need for expertise in setting up a cluster because OSCAR takes us completely through the installation of a cluster. In practice, it might be more fitting to say
9 that OSCAR delays the need for expertise and allows us to create a fully functional cluster before mastering all the skills one will eventually need. In the long run, one will want to master those packages in OSCAR. OSCAR makes it very easy to experiment with packages and dramatically lowers the barrier to getting started. OSCAR is designed with high-performance computing in mind. So, unless one customizes the installation, the computer nodes are meant to be dedicated to the cluster. 4.2
Packages OSCAR brings together a number of software packages for clustering. Most
of the packages are available as standalone packages. The main packages in OSCAR are: Core: This is the core OSCAR package. C3: The Cluster, Command, and Control tool suite provides a command-line administration interface.“cexec mkdir /opt/c3-4” will create the directory on all clients. Similarly we can use commands like cget, ckill, cpush, crm, cshutdown. Environmental Switcher: This is based on Modules, a Perl script that allows the user to make changes to the environment of future shells. For example, Switcher allows a user to change between MPICH and LAM/MPI. “switcher mpi = mpich-ch-p4-gcc-1.2.5.10” can be used to set the execution environment needed for mpich library. SIS: The System Installation Suite is used to install the operating systems on the clients. Monitoring systems: Clumon, a web-based performance-monitoring system, and Ganglia, a real-time monitoring system and execution environment, are the softwares included in this category. MAUI: This job scheduler is used with openPBS.
10 openPBS: The portable batch system is a workload management system. PVFS: Parallel Virtual File System is a high-performance, scalable, parallel virtual file system.
Any high-performance cluster would be incomplete without programming tools. The OSCAR distribution includes: LAM/MPI :This is one implementation of the message passing interface (MPI) libraries. MPICH :This is another implementation of the message passing interface (MPI) libraries. PVM :This package provides the parallel virtual machine system, another message passing library.
4.3
Installation Strategy With OSCAR, one first installs Linux (but only on the head node) and then
installs OSCAR. The installations of the two are separate. This makes the installation more involved, but it gives us more control over the configuration of the system, and it is somewhat easier to recover when we encounter installation problems. And because the OSCAR installation is separate from the Linux installation, we are not tied to a single Linux distribution. OSCAR uses a system image cloning strategy to distribute the disk image to the compute nodes. With OSCAR it is best to use the same hardware throughout the cluster. OSCAR’s thin client model is designed for diskless systems.
Chapter 5 Parallel Programming
5.1
Hello World It is customary to start a new programming language with “Hello world”
program. So here also we will start with “Hello World” program. The parallel version of the program(using LAM/MPI) is as follows: #include "mpi.h" #include <stdio.h>
int main(int argc, char *argv[]) { int
processId;
/* rank of process */
int
noProcesses;
/* number of processes */
int
nameSize;
/* length of name */
char computerName[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &noProcesses); MPI_Comm_rank(MPI_COMM_WORLD, &processId);
MPI_Get_processor_name(computerName, &nameSize);
12 fprintf(stderr,"Hello from process %d on %s\n", processId, computerName);
MPI_Finalize( ); return 0;
} This example introduces five MPI functions, defined through the inclusion of the header file for the MPI library, mpi.h, and included when the MPI library is linked to the program. While this example uses C, similar libraries are available for C++ and FORTRAN. Four of these functions, MPI Init, MPI Comm size, MPI Comm rank, and MPI Finalize, are seen in virtually every MPI program.
5.1.1
MPI Init
MPI Init is used to initialize an MPI session. All MPI programs must have a call to MPI Init. MPI Init is called once, typically at the start of a program. One can have lots of other code before this call, or one can even call MPI Init from a subroutine, but one should call it before any other MPI functions are called. (There is an exception: the function MPI Initialized can be called before MPI Init. MPI Initialized is used to see if MPI Init has been previously called. ) In C, MPI Init can be called with the addresses for argc and argv as shown in the example. This allows the program to take advantage of command-line arguments. Alternatively, these addresses can be replaced with a NULL.
5.1.2
MPI Finalize
MPI Finalize is called to shut down MPI. MPI Finalize should be the last MPI call made in a program. It is used to free memory, etc. It is the user’s
13 responsibility to ensure that all pending communications are complete before a process calls MPI Finalize. Every process must call MPI Finalize.
5.1.3
MPI Comm size
This routine is used to determine the total number of processes running in a communicator (the communications group for the processes being used). It takes the communicator as the first argument and the address of an integer variable used to return the number of processes. For example, if one is executing a program using five processes and the default communicator, the value returned by MPI Comm size will be five, the total number of processes being used. This is number of processes, but not necessarily the number of machines being used. In the hello world program, both MPI Comm size and MPI Comm rank used the default communicator, MPI COMM WORLD. This communicator includes all the processes available at initialization and is created automatically. Communicators are used to distinguish and group messages. As such, communicators provide a powerful encapsulation mechanism. While it is possible to create and manipulate one’s own communicators, the default communicator will probably satisfy most of the needs.
5.1.4
MPI Comm rank
MPI Comm rank is used to determine the rank of the current process within the communicator. MPI Comm rank takes a communicator as its first argument and the address of an integer variable is used to return the value of the rank. Basically, each process is assigned a different process number or rank within a communicator.
Ranks range from 0 to one less than the size returned by
MPI Comm size. For example, if one is running a set of five processes, the individual processes will be numbered 0, 1, 2, 3, and 4. By examining its rank, a process
14 can distinguish itself from other processes. The values returned by MPI Comm size and MPI Comm rank are often used to divide up a problem among processes. Next, each individual process can examine its rank to determine its role in the calculation. For example, the process with rank 0 might work on the first part of the problem; the process with rank 1 will work on the second part of the problem, etc. One can divide up the problem differently also. For example, the process with rank 0 might collect all the results from the other processes for the final report rather than participate in the actual calculation. It is really up to the programmer to determine how to use this information.
5.1.5
MPI Get processor name
MPI Get processor name is used to retrieve the host name of the node on which the individual process is running. In the sample program, we used it to display host names. The first argument is an array to store the name and the second is used to return the actual length of the name. Each of the C versions of these five functions returns an integer error code. With a few exceptions, the actual code is left up to the implementers. Error codes can be translated into meaningful messages using the MPI Error string function. Here is an example of compiling and running the code: [deepak@ssl1 ~]$ mpicc hello.c -o hello [deepak@ssl1 ~]$ mpirun -np 5 hello Messages received by process 0 on ssl1.
Greetings from process 1 on ssl3.deltaforce! Greetings from process 2 on ssl1! Greetings from process 3 on ssl3.deltaforce! Greetings from process 4 on ssl1!
15 There is no apparent order in the output. This will depend on the speed of the individual machines, the loads on the machines, and the speeds of the communications links. Unless one takes explicit measures to control the order of execution among processors, one should not make assumptions about the order of execution. When running the program, the user specifies the number of processes on the command line. MPI Comm size provides a way to get that information back into the program. Next time, if one wants to use a different number of processes, just change the command line and the code will take care of the rest. 5.2
Numerical Integration This is a fairly standard problem for introducing parallel calculations because
it can be easily decomposed into parts that can be shared among the computers in a cluster. Although in most cases it can be solved quickly on a single processor, the parallel solution illustrates all the basics one needs to get started writing MPI code. The reason this area problem is both interesting and commonly used is that it is very straightforward to subdivide this problem. We can let different computers calculate the areas for different rectangles. Basically, MPI Comm size and MPI Comm rank are used to divide the problem among processors. MPI Send is used to send the intermediate results back to the process with rank 0, which collects the results with MPI Recv and prints the final answer. Here is the program: #include "mpi.h" #include <stdio.h>
/* problem parameters */ #define f(x)
((x) * (x))
#define numberRects
50
16 #define lowerLimit
2.0
#define upperLimit
5.0
int main(int argc, char *argv[]) { /* MPI variables */ int dest, noProcesses, processId, src, tag; MPI_Status status;
/* problem variables */ int
i;
double
area, at, height, lower, width, total, range;
/* MPI setup */ MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &noProcesses); MPI_Comm_rank(MPI_COMM_WORLD, &processId);
/* adjust problem size for subproblem*/ range = (upperLimit - lowerLimit) / noProcesses; width = range / numberRects; lower = lowerLimit + range * processId;
/* calculate area for subproblem */ area = 0.0; for (i = 0; i < numberRects; i++) {
17 at = lower + i * width + width / 2.0; height = f(at); area = area + width * height; }
/* collect information and print results */ tag = 0; if (processId = = 0)
/* if rank is 0, collect results */
{ total = area; for (src=1; src < noProcesses; src++) { MPI_Recv(&area, 1, MPI_DOUBLE, src, tag, MPI_COMM_WORLD, &status); total = total + area; } fprintf(stderr, "The area from %f to %f is: %f\n", lowerLimit, upperLimit, total ); } else
/* all other processes only send */
{ dest = 0; MPI_Send(&area, 1, MPI_DOUBLE, dest, tag, MPI_COMM_WORLD); }; /* finish */ MPI_Finalize( ); return 0; }
18 In this example, we are calculating the area under the curve y = f (x2 ) between x=2 and x=5. Since each process only needs to do part of this calculation, we need to divide the problem among the processes so that each process gets a different part and all the parts are accounted for. MPI Comm size is used to determine the number of parts the problem will be broken into, noProcesses. That is, we divide the total range (2 to 5) equally among the processes and adjust the start of the range for an individual process based on its rank. In the next section of code, each process calculates the area for its part of the problem. Then we need to collect and combine all our individual results. One process will act as a collector to which the remaining processes will send their results. Using the process with rank 0 as the receiver is the logical choice. The remaining processes act as senders. A fair amount of MPI code development can be done on a single processor system and then moved to a multiprocessor environment.
Chapter 6 Molecular Dynamics Modeling of Thermal Conductivity of Engineering Fluids and its Enhancement due to Nanoparticle Inclusion
In the present study, an attempt is made to estimate the enhancement of the thermal conductivity of water by suspension of nanoparticles, using the method of Molecular Dynamics simulation. This involves the process of generating the atomic trajectories of a system of a finite number of particles by direct integration of the classical Newton’s equations of motion, with appropriate specification of interatomic potentials and application of suitable initial and boundary conditions. Initially a general simulation procedure is developed for the thermal conductivity of a liquid and the procedure is validated with standard values. A few among the plausible theoretical models for the thermal conductivity of nanofluids has been selected and studied. Algorithms are made for simulating them, abiding the procedural steps of the Molecular Dynamics method. The thermal conductivity enhancement in the base fluid due to suspension of nanoparticles, estimated using the simulations of the models considered are compared among themselves and with the existing experimental results, and further investigated to select the most appropriate model which matches best with a practical case of interest (metal oxide - water system). Parametric studies are conducted to study the variation of thermal conductivity enhancement with temperature, and the optimal dosing levels of nanofluids are also investigated upon. Further, an optimization of the simulation procedure and algorithms is attempted to bring out an efficient computation
20 strategy. 6.1
Algorithm
main() { initializePositions(); initializeVelocities();
for(time=0; time< totalTimeSteps; time++) { velocityVerlet(dt); instantaneousTemperature(); if(i%200==0) rescaleVelocities(); if(time>equilibrationtime) jt(); } thermalConductivity(); }
velocityVerlet() { computeAccelerations(); //at time ’t’ updatepositions; //function of a[t] computeAccelerations(); //at time ’t+dt’; updateVelocity; //function of a[t] and a[t+dt] }
21 6.2
Profiling It is generally said that a typical program will spend over 90% of its execution
time in less that 10% of the actual code. This is just a rule of thumb or heuristic, and as such, will be wildly inaccurate or totally irrelevant for some programs. But for many, if not most, programs, it is a reasonable observation. The actual numbers don’t matter since they will change from program to program. It is the idea that is important for most programs, most of the execution time spent is in a very small portion of the code. If the application spends 95% of its time in 5% of the code, there is little to be gained by optimizing the other 95% of the code. Even if one could completely eliminate it, one would only see a 5% improvement. But if one can manage a 10% improvement in the critical 5% of the code, for example, we will see a 9.5% overall improvement in the program. Thus, the key to improving the code’s performance is to identify that crucial 5%. That is the region where one should spend one’s time optimizing code. There is a point of diminishing returns when optimizing code. We will need to balance the amount of time we spend optimizing code with the amount of improvement we actually get. There is a point where the code is good enough. The goals of profiling are two-fold to decide how much optimization is worth doing and to identify which parts of code should be optimized. For serial algorithms, one can often make reasonable estimates on how time is being spent by simply examining and analyzing the algorithm. The standard approach characterizes performance using some measurement of the problem size. Since the problem size often provides a bound for algorithmic performance, this approach is sometimes called asymptotic analysis. Asymptotic analysis can be problematic with parallel programs for several reasons. First, it may be difficult to estimate the cost of communications required
22 by a parallel solution. This can be further complicated by the need for additional code to coordinate communications among the processors. Second, there is often a less than perfect overlap among the communicating processes. A processor may be idle while it waits for its next task. In particular, it may be difficult to predict when a processor will be idle and what effect this will have on overall performance. For these and other reasons, an empirical approach to estimating performance is often the preferred approach for parallel programs. That is, we directly measure performance of existing programs. Thus, with parallel programs, the most appropriate strategy is to select the best algorithm one can and then empirically verify its actual performance. The profile of the unoptimized code is as follows: [delta16@athena project]$ time -p ./green_without_prof Thermal Conductivity : 0.000922671 real 378.63 user 378.61 sys 0.00
%
cumulative
self
time
seconds
seconds
self
total
calls
ms/call
ms/call
name
49.31
398.03
398.03
40000
9.95
9.95
Accelerations()
34.33
675.12
277.09
1620000
0.17
0.17
lj(int)
15.88
803.34
128.21
15000
8.55
27.09
jt()
0.47
807.09
3.75
20000
0.19
20.09
Verlet(double)
0.13
808.10
1.01
1620000
0.00
0.17
ei(int)
0.00
808.14
0.04
201
0.20
0.20
rescaleVelocities()
<some output removed for brevity> From this profile we can find that there is scope for improvement in func-
23 tions like Accelerations, lj and jt because they account for more than 99% of the execution time. It can be noticed that a single call to lj function takes very little amount of time to execute but since it is executed large number of times, even a slight improvement in the code will heavily affect the execution time. It can be found from the code that the functions jt and Accelerations are both having a complexity of O(N 2 ). So the work done by this part of the code can be split among the individual nodes. The profile of the optimized code is as follows: [delta16@athena project]$ mpirun -np 4 time -p ./mpi_green_pg Thermal Conductivity : 0.000922359 real 155.74 user 134.28 sys 1.70 <some output removed for brevity> %
cumulative
self
time
seconds
seconds
self
total
calls
ms/call
ms/call
name
54.35
182.83
182.83
40000
4.57
4.57
Accelerations()
22.37
258.11
75.27
405000
0.19
0.19
lj(int)
19.61
324.08
65.97
15000
4.40
9.45
jt()
2.39
332.12
8.04
20000
0.40
9.54
Verlet(double)
<some output removed for brevity> The comparison between the unoptimized and optimized code is shown below:
24
self ms/call
No of Calls
Total Time
45000 40000 35000 30000 25000 20000 15000 10000 5000 0
12 10 8 6 4 2 0 Before
450 400 350 300 250 200 150 100 50 0 Before
After
self ms/call
After
300
1800000 1600000 1400000 1200000 1000000 800000 600000 400000 200000 0
0.19 0.185 0.18 0.175 0.17 0.165 0.16
250 200 150 100 50 0 Before
After
self ms/call
After
After
Total Time 140
16000 14000 12000 10000 8000 6000 4000 2000 0 Before
Before
After
No of Calls
9 8 7 6 5 4 3 2 1 0
After
Total Time
No of Calls
0.195
Before
Before
120 100 80 60 40 20 0 Before
After
Before
After
25
With this optimized code we get a speedup of 2.43 on a single node(4 CPUs) cluster. The error in the computation of the order of 10−7 is due the presence of floating point arithmetic. Maximum speedup was achieved on a dual processor machines by using shared memory instead of message passing. When the program was run on a cluster with two nodes(4 CPUs) the speedup achieved was in the range of 1.7 to 1.9 .
Chapter 7 Testing and Benchmarking
Once the cluster is running, one needs to run a benchmark or two just to see how well it performs. There are three main reasons for running benchmarks. First, a benchmark will provide us with a baseline. If we were to make changes to the cluster or if we suspect problems with the cluster, we can rerun the benchmark to see if performance is really any different. Second, benchmarks are useful when comparing systems or cluster configurations. They can provide a reasonable basis for selecting between alternatives. Finally, benchmarks can be helpful with planning. If we can run several with differently sized clusters, etc. , we should be able to make better estimates of the impact of scaling the cluster. High Performance Linpack, Hierarchical Integration (HINT) and NAS Parallel Benchmarks are some common benchmarks available for clusters. As a yardstick of performance we are using the ‘best’ performance as measured by the LINPACK Benchmark. LINPACK was chosen because it is widely used and performance numbers are available for almost all relevant systems. LINPACK is a software library for performing numerical linear algebra on digital computers. LINPACK makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations. The LINPACK Benchmark is based on LINPACK, representing a measure of a system’s floating point computing power. It measures how fast a computer solves dense
27 n by n systems of linear equations Ax=b, a common task in engineering. The solution is based on Gaussian elimination with partial pivoting, with
2 3
∗ n3 + n2
floating point operations. The result is Millions of floating point operations per second(Mflop/s). For large scale distributed memory systems, the performance of a portable implementation of the High-Performance Linpack Benchmark link is used as a performance measure for ranking supercomputers in the TOP500 list of the world’s fastest computers. This performance does not reflect the overall performance of a given system, as no single number ever can. It does, however, reflect the performance of a dedicated system for solving a dense system of linear equations. Since the problem is very regular, the performance achieved is quite high, and the performance numbers give a good correction of peak performance. When the High Performance Linpack was run on the cluster it gave a peak performance of 1.380e-01 GFLOPS. The folowing figure explains the relationship between the execution time and the number of available processors.
28
Chapter 8 Conclusion
The basic objective of this project was to setup a high performance computational cluster with special concern to molecular modelling. By using a cluster kit such as OSCAR,the first phase of the project, setting up a high performance cluster, could be completed. With the help of message passing libraries like LAM/MPI, the second phase of the project, improving the performance of a molecular modelling problem, was completed. The molecular modelling problem, which is considered to be a standard benchmark, when tested on a two node cluster gave a remarkable speedup of the order of 2.0 . This increased efficiency came at the expense of higher code complexity. Finally, the testing phase was completed using the High Performance Linpack benchmark which showed a peak performance of 1.380e-01 GFLOPS. Future work includes the formal analysis of existing code for automated conversion to a parallel version for cluster implementation.
30
Bibliography
[1] Joseph D. Sloan, High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI, O’Reilly & Associates, 1991. [2] Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, Tata McGraw-Hill, 2003. [3] Zolt´an Juh´asz, P´eter Kacsuk and Dieter Kranzim¨ uller, Distributed and Parallel Systems-Cluster and Grid Computing, Springer, 2002. [4] Stefan B¨ohringer, Building a diskless Linux Cluster for high performance computations from a standard Linux distribution, Technical Report, Institut f¨ ur Humangenetik, Universit¨atsklinikum Essen, 2003. [5] Linux clusters information centre, http://lcic.org/ [6] Linux Documentation Project, http://www.tldp.org/HOWTO/openMosixHOWTO/index.html [7] openMosix Development site, http://openmosix.sourceforge.net/ [8] openMosix website, http://openmosix.org [9] Oscar website, http://oscar.openclustergroup.org [10] Rocks cluster website, http://www.rocksclusters.org [11] Openssi website, http://www.openssi.org/