Ocean Model Performance Dissertation

  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Ocean Model Performance Dissertation as PDF for free.

More details

  • Words: 25,522
  • Pages: 127
HIGH PERFORMANCE PARALLEL ALGORITHMS FOR SCIENTIFIC COMPUTING WITH APPLICATION TO A COUPLED OCEAN MODEL by Manu Konchady A Dissertation Submitted to the Graduate Faculty of George Mason University in Partial Ful llment of the Requirements for the Degree of Doctor of Philosophy Information Technology Committee: Arun Sood, Dissertation Director Daniel B. Carr Richard H. Carver Ophir Frieder Carl Harris, Interim Associate Dean for Graduate Studies and Research W.Murray Black, Interim Dean, School of Information Technology and Engineering Date:

Fall 1996 George Mason University Fairfax, Virginia.

High Performance Parallel Algorithms for Scienti c Computing with application to a Coupled Ocean Model

A thesis submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy at George Mason University.

By

Manu Konchady Master of Science Clemson University, 1985

Director: Arun Sood, Professor Department of Computer Science

Fall Semester 1996 George Mason University Fairfax, Virginia

ii

Acknowledgements I am grateful to Dr.Sood for all his assistance during my studies at George Mason. His guidance in formulating a research topic, proposal, and completing my dissertation was invaluable. I would like to thank Dr.Carr, Dr.Carver, and Dr.Frieder for their suggestions in improving my research and serving on my committee. I thank Dr.Schopf and Dr.Suarez at NASA Goddard Space Flight Center, for giving me the opportunity to work in an area new to me and providing me access to resources for my research. Computer time on machines used in this dissertation was possible through grants provided by NASA. I also appreciate the nancial assistance from IBM and General Sciences Corporation for continuing my education. Finally, I would like to thank my wife Renuka and my family for supporting me during the many hours I spent at the computer.

iii

Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Introduction 1.1 1.2 1.3 1.4

Problem . . . . . . . Contributions . . . . Prior Research . . . Dissertation Outline

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

3.1 Parallel Machines . . . . . . . . . . . . 3.2 SMP Machines . . . . . . . . . . . . . 3.2.1 DEC Alpha . . . . . . . . . . . 3.2.2 Cray J90 . . . . . . . . . . . . 3.2.3 Clustering . . . . . . . . . . . . 3.3 MPP Machines . . . . . . . . . . . . . 3.3.1 Cray T3D . . . . . . . . . . . . 3.4 Programming Parallel Machines . . . . 3.5 Data Parallel vs. Message Passing . . 3.6 Programming Methods for Portability

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

4.1 Grid Partitioning Techniques . . . . . . . . . . 4.1.1 Striping . . . . . . . . . . . . . . . . . . 4.1.2 Regular Blocking . . . . . . . . . . . . . 4.1.3 Irregular Blocking . . . . . . . . . . . . 4.2 Execution Timing Analysis . . . . . . . . . . . 4.2.1 Striping vs. Regular Blocking . . . . . . 4.2.2 Regular Blocking vs. Irregular Blocking 4.3 Distribution and Collection of Data . . . . . . . 4.3.1 Hub Scheme . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

2 Ocean Model - Experimental Testbed 2.1 2.2 2.3 2.4 2.5 2.6 2.7

Experimenal Testbed . . . . Purpose of Ocean Model . . Modelling El Nino . . . . . Description of Ocean Model Serial Implementation . . . Parallel Implementation . . Model Equations . . . . . .

3 Parallel Computing

4 MPP Implementation

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

vii

1 2 3 5 7

9

9 10 11 12 15 16 19

22 22 25 26 29 30 31 32 35 37 38

41 41 42 43 46 50 51 52 53 53

iv 4.3.2 Recursive Scheme . . . . . . . . . . . 4.3.3 Other Schemes . . . . . . . . . . . . 4.4 Cache Trace Tool . . . . . . . . . . . . . . . 4.5 Communication Protocols . . . . . . . . . . 4.6 Analysis of Results . . . . . . . . . . . . . . 4.6.1 Grid Partitioning . . . . . . . . . . . 4.6.2 Recursive vs. Hub Schemes . . . . . 4.6.3 Cache Trace Tool . . . . . . . . . . . 4.6.4 Communication Protocol Evaluation

5 SMP Implementation

5.1 Programming SMP machines 5.2 Analysis of Results . . . . . . 5.2.1 Cray J90 . . . . . . . 5.2.2 DEC Alpha . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

55 56 60 64 67 67 71 76 78

81

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

6.1 Distributed Coupled Model . . 6.1.1 Design . . . . . . . . . . 6.1.2 Network Infrastructure . 6.1.3 Implementation . . . . . 6.2 Performance Analysis . . . . . 6.3 Visualization of Performance . 6.4 Distributed Computing Results

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. 92 . 93 . 95 . 98 . 100 . 103 . 105

6 Distributed Computing

7 Conclusions

81 86 86 87

91

108

7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

v

List of Figures 2.1 Ocean Model Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Ocean Model Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13 17

3.1 3.2 3.3 3.4

Common Parallel Computer Architectures . . . . . . . . . . . . . . Architecture for the Alpha Server 8400 . . . . . . . . . . . . . . . . 3D Torus architecture for the Cray T3D . . . . . . . . . . . . . . . Inter-processor Communication Mechanisms for Parallel Machines

. . . .

24 26 33 36

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10

Regular Blocking and Striping . . . . . . . . . . . . . . . . . . . . . . . . . Regular Blocking for the Global Topography . . . . . . . . . . . . . . . . . Irregular Blocking for a Global Topography . . . . . . . . . . . . . . . . . . Hub vs. Recursive Schemes for Striping . . . . . . . . . . . . . . . . . . . . Recursive Scheme for Blocking . . . . . . . . . . . . . . . . . . . . . . . . . Overlapped Recursive Scheme to Distribute Data . . . . . . . . . . . . . . Communication Time for three Grid Partitioning Techniques . . . . . . . . Execution Time for three Grid Partitioning Techniques . . . . . . . . . . . . Speed Up Curve for three Grid Partitioning Techniques . . . . . . . . . . . Timing Results for Regular Blocking using di erent Processor Grid Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Hub vs. Recursive for Striping . . . . . . . . . . . . . . . . Comparison of Hub vs. Recursive for Blocking . . . . . . . . . . . . . . . . Average Cache Hit Rate for the Update Subroutine . . . . . . . . . . . . . Execution times using three Communication Protocols . . . . . . . . . . . Speed Up curves using three Communication Protocols . . . . . . . . . . .

44 45 47 54 57 60 68 70 72

5.1 Execution Times for Autotasking and Shmem on the Cray J90 . . . . . . . 5.2 Speed up for Autotasking and Shmem on the Cray J90 . . . . . . . . . . .

88 89

4.11 4.12 4.13 4.14 4.15

6.1 6.2 6.3 6.4 6.5

. . . .

Design of the Message Driven Distributed Coupled Model . . . . . . Control and Data Flows for Distributed Coupled Model . . . . . . . Network Components for the Distributed Coupled Model . . . . . . Collection of Trace Data for Visualization . . . . . . . . . . . . . . . Window for Visualizing Performance of Distributed Coupled Model

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

73 74 75 77 79 80

. 94 . 96 . 99 . 102 . 106

vi

List of Tables 4.1 Optimal Processor Grids for Ocean Grid . . . . . . . . . . . . . . . . . . . 4.2 Distributing a 512  512 grid (64 processors) . . . . . . . . . . . . . . . . .

49 58

5.1 Execution time (secs.) using KAP . . . . . . . . . . . . . . . . . . . . . . .

90

Abstract HIGH PERFORMANCE PARALLEL ALGORITHMS FOR SCIENTIFIC COMPUTING WITH APPLICATION TO A COUPLED OCEAN MODEL Manu Konchady, Ph.D George Mason University, 1996 Dissertation Director: Dr.Arun Sood

Parallel computers are becoming more common and are available on a range of platforms from massively parallel machines with thousands of processors to desktop systems with two or four processors. The software to exploit the capabilities of these machines is yet to be developed and this dissertation explains some of the problems and solutions to implement scienti c applications on parallel machines. Many companies have failed to develop a general purpose parallel computer because the problem of achieving high performance on practical applications was more dicult than previously thought. For scienti c applications, we have developed ecient parallel algorithms and tools which can be used on commercial parallel machines. The algorithms we used improved performance by reducing communication overhead for data distribution/collection and grid partitioning. In addition we have developed tools to study cache performance on a parallel machine and to visualize communication performance for a distributed application. We tested di erent problem sizes and processor con gurations to use a machine with high

viii parallel eciency. Communication protocols on two common parallel architectures are evaluated. We used a coupled ocean model as the experimental testbed to evaluate the algorithms and tools we developed. Ocean Models are computationally intensive applications and are one of the `Grand Challenge' problems. The coupled ocean model is used to simulate the behaviour of the El Nino phenomenon. Our implementation of the ocean model on a Cray T3D parallel computer obtained a performance rate of 2.4 G ops which was 6 times faster than the current implementation on a vector supercomputer.

Chapter 1. Introduction This dissertation describes the use of several novel parallel algorithms and performance tools to implement and evaluate parallel applications. Among the parallel algorithms described are an optimized grid partitioning technique for irregular grids and an ecient scheme to collect and distribute data. We have developed performance tools to study cache memory performance for a processor and to evaluate the behaviour of a distributed application. The purpose of these algorithms and tools is to run computationally intensive applications faster and at a lower cost on parallel computers. We have used a coupled ocean model developed at Goddard Space Flight Center [34] as the experimental test bed for validating the parallel algorithms and tools. Ocean models have been identi ed by the National Science Foundation as `Grand Challenge' problems [9]. They are some of the leading problems which can be solved only with the help of fast and powerful computers. The ocean model for our experiments is used to simulate the El Nino phenomenon [4]. The e ects of El Nino are felt worldwide and therefore several research e orts are underway to model and predict the occurrence of an El Nino. No single ocean model has been successful in predicting El Nino. One of the reasons attributed to the failure to simulate El Nino is the low resolution of models. It is believed that high resolution models can simulate eddies [33] which contribute to ocean circulation and can e ect climate. Eddies are not resolved in most current ocean models. The use of high resolution models increases computation and memory demand signi cantly (see Section 1.1). Research to provide the 1

2 capability to run high resolution models on parallel computers at a reasonable cost is in progress worldwide. Our research has been towards the development of algorithms and tools which facilitate ecient implementations of high resolution models on parallel computers,

1.1 Problem Finite di erence methods are used to solve partial di erential equations which arise in scienti c applications such as nite element analysis and computational uid dynamics. The `state' for these applications is represented by multiple points on a grid. Due to the lower cost of memory and higher processor capabilities of today's machines, scientists would like to use larger and ner resolution grids. Accurate results can be achieved using high resolution grids. In oil exploration, high resolution seismic data can be used to pinpoint a suitable location for drilling. A high resolution grid is necessary to accurately simulate

ow around a complex object. We have studied the problem of running high resolution simulations on parallel computers and have used an ocean model application to evaluate our solutions. While the results from high resolution ocean modelling appear to be promising [33], its use has not been widespread due to the high computational demand. Increasing the resolution of a model by 4 times actually increases the computational demand by 16. The timesteps must be smaller in a high resolution model for computation stability. The timestep resolution which lters numerical noise for a given spatial resolution can be determined using the CFL criterion [42]. Although the program size remains the same, the data requirements

3 increase signi cantly. For example, restart les for grids of resolutions 144  78  7, 288  256  8, and 576  512  14 require about 6 Mbytes, 50 Mbytes, and 350 Mbytes respectively. Similarily, the history les for high resolution models require more storage. Due to the higher computation demand and memory requirements of high resolution models, running multiple experiments with di erent restart les for extended simulation periods is expensive. The problems of parallelizing coupled models are more complex than parallelizing other climate models. The main problem is the additional communication overhead due to the exchange of forcing data during a coupling interval. The memory requirements for a high resolution model pose a problem when porting to a massively parallel machine. On such machines, the memory per processor is limited and therefore, the size of the entire model code and data for the processor must not exceed the memory on the processor. This problem is most apparent during I/O when data must be distributed and collected to/from processors. If a single processor performs this task, then sucient space must be allocated to hold multiple high resolution grids. For a 3D grid, this can be accomplished by reading and distributing a layer at a time. If the cache on a processor is small, we can expect poor performance due to a lower cache hit rate.

1.2 Contributions On an uni-processor machine, the entire grid is stored and processed by one processor alone. On a multi-processor machine, the grid must be partitioned and each processor performs computations on a sub-grid. We have developed an optimized grid partitioning technique

4 for an irregular grid. This technique can be used in problems where a grid consists of multiple irregular shaped objects. The new grid partitioning technique was more ecient with a lower computation and communication overhead compared to previous techniques. We implemented our techniques to partition an irregular grid representing the global oceans. In most applications on parallel computers, the distribution and collection of data is a common function. This frequently occurs during I/O. If results are stored on a single le, then output data must be collected at a single processor. We have developed an algorithm to distribute and collect data with low communication overhead. We have compared our scheme with several other schemes and have shown the improvement in performance. The use of cache memory is common on many parallel machines which use microprocessors. We have developed a cache trace tool to analyze the cache hit rate for a large application. Our tool does not use a large simulator, but generates code to trace cache memory accesses. Using our tool it is possible to study the cache performance of a single function on an uni-processor or multi-processor machine. We used the tool to study the cache performance of a subroutine of an ocean model. The tool was used to determine the cache hit rate for every variable of a subroutine of the ocean model. The results were used to improve the performance of the subroutine by over 33% by modifying the data layout and code. In a distributed application consisting of multiple processes running concurrently on separate processors, it is dicult to locate performance bottlenecks. We enhanced a tool to visualize the ow of messages between processes. The enhancements to the package included

5 the ability to identify the type of messages exchanged, the time to send/receive messages per message type on a process, and the status of any process at any time during a run. We used the tool to study the performance of a 5 process coupled ocean model running on 3 separate machines. We also designed the algorithm to distribute the coupled ocean model computation.

1.3 Prior Research One of the rst commercial parallel machines based on a microprocessor was the Intel hypercube [13]. It was developed at Caltech by a team of researchers who were seeking a low cost alternative to the vector supercomputer. Following the introduction of the Intel iPSC/1 hypercube, several other vendors such as IBM, Kendall Square Research, Silicon Graphics, and Cray built machines based on microprocessors. Many vendors have had limited success in marketing parallel machines to a wide range of users. The main reasons for this problem are a lack of application software and the diculty of programming parallel machines. Some vendors attempted to built automatic parallelization tools such as KAP [22]. While the concept of parallelizing an existing application without user intervention is attractive, the practical problems in automatic parallelization are dicult. It is hard to build a pre-processor which can detect parallelism in a wide range of applications. Vendors developed native communication protocols optimized for a particular machine or architecture. This made portability across a range of parallel machines dicult. Users

6 unwilling to risk the development costs on a parallel machine built by a single vendor, did not develop parallel applications. This problem has been solved with the development of MPI [16], a standard for message passing which most vendors have adopted. Despite the problems described above, a variety of scienti c applications have been implemented on parallel machines. They include applications in data mining, ray tracing in computer graphics, seismic wave propagation, computational uid dynamics, astrophysics, computational chemistry, and climate modelling [29]. Motivated by the need for higher performance at a lower cost, research on the use of parallel computers for climate models began in the late 1980s. A number of organizations successfully parallelized and validated climate models [3, 28, 36, 43]. Research in the development of algorithms and numerical methods for parallel computers is on-going and there are many problems to be resolved before parallel climate modelling is accepted by the meteorology community. The problems include handling large data sets and using massively parallel machines eciently. Parallel climate models have not yet been used to develop forecasts for public use. Much of the research in parallelizing climate models has been funded by government agencies such as the Department of Energy (DOE), National Science Foundation (NSF), National Aeronautics and Space Administration (NASA), and the National Oceanic and Atmospheric Administration (NOAA). Funds for this research have been allocated annually since 1990 through the High Performance Computing and Communications (HPCC) act passed by Congress. The goals of the HPCC program are to extend technological leadership

7 in high performance computing and communications, improve competitiveness through the application of technologies, and develop a national information infrastructure. One of the components of the HPCC program is the development of Advanced Software Technology and Algorithms (ASTA). The objectives of the ASTA component are to enable progress towards the solution of grand challenge problems in science and engineering and improve software development for leading edge applications. Grand Challenges are problems in science and engineering that can be solved only with the help of powerful computers. Some examples of grand challenge problems include - predicting global climate change, developing and understanding new materials, building more energy ecient cars and airplanes, and improving communication networks for research and education.

1.4 Dissertation Outline In chapter 2, the experimental testbed - an ocean model is described. We developed a parallel ocean model algorithm based on the current sequential algorithm. Commercial parallel machines have been used in this dissertation and in Chapter 3, two parallel architectures for commercial parallel machines are compared. Machines based on both architectures are described. We have included a description of the development of software for a range of parallel machines. In Chapter 4, an implementation on a massively parallel processing (MPP) machine, the Cray T3D is studied. A new grid partitioning technique, algorithms to distribute/collect global data and a cache trace tool are described. The results for the implementations are analyzed.

8 We have discussed implementations on two Symmetric Multiprocessing (SMP) machines, the Cray J90 and DEC Alpha in Chapter 5. Results are compared with an implementation on the Cray T3D platform. In Chapter 6, a wide area distributed computing experiment has been evaluated. We developed a performance tool to study the communication overhead of a distributed coupled model used in the wide area distributed computing experiment. The conclusions and future work is described in Chapter 7.

Chapter 2. Ocean Model - Experimental Testbed 2.1 Experimental Testbed In this research, we have developed algorithms and tools to achieve high performance for scienti c computing applications. In addition to an analysis of these algorithms (Chapter 4), we have also experimentally validated their use on a real world problem. For this purpose we have used as an experimental testbed, a Grand Challenge problem - ocean models. Speci cally, our testing is based on an ocean model coupled to an atmospheric model. Both models were developed at the Goddard Space Flight Center (GSFC) [37, 34]. In our experiments, we have used the ocean model coupled to either a simpli ed or complete atmospheric model. The simpli ed atmospheric model is used to provide the ocean model with data for boundary conditions without running a complete atmospheric model. The use of the simpli ed atmospheric model saves CPU time and is useful to test the behaviour of the ocean model alone. Of the di erent climate models, the ocean model is the most computationally intensive. The dynamics of the ocean involve longer time scales and smaller spatial scales than the atmosphere. The higher computation demand of the ocean model is due to the higher resolution necessary for ocean models to resolve eddies and the long time interval to model the ocean response. The computation requirements for the ocean model at GSFC went from 93.5 M ops/simulation day for a 80  64  3 grid to 6.3 Glops/simulation day for a 9

10 288  256  8 grid. It is for this reason that we have selected coupled ocean models as the experimental testbed in this research. Currently, there are at least three di erent parallel ocean models [3, 28, 9] in use at the University of Illinois, University of California at Los Angeles, and University of Miami.

2.2 Purpose of Ocean Model Due to its impact on society, understanding and predicting climate is of great importance. Climate is the average state of the weather over a nite period of time. Models which incorporate the physics of the atmosphere and ocean have been used to develop predictions. Ocean models are not used to predict the day-to-day sequence of events even a few weeks in advance. Instead, they are used to predict the statistical properties of some future state. In the most basic form, ocean modelling is an initial value problem. An initial state is provided and the equations to advance the state in timesteps are known. After advancing the state for a single timestep, the current state becomes the previous state and the future state is determined. This process is performed iteratively till completion. A nal state is output at the end of a run. A set of governing equations are used to advance the state in discrete timesteps. The governing equations for the motions of the ocean are derived from the conservation laws of mass, momentum, and energy. In addition, an equation of state and other model speci c equations to simulate observed phenomena are included. The equation of state relates pressure, density, temperature, and salinity. The most dramatic and best de ned pattern of ocean variability is the El Nino Southern

11 Oscillation (ENSO) [4, 41]. An ENSO event can a ect the climate of more than half the planet and can have devastating consequences. The 1982-83 ENSO event, the most extreme in this century, was responsible for oods and drought worldwide. Severe drought conditions were recorded in Australia, Borneo, South Africa, and parts of Brazil. Millions of acres of rainforest were destroyed. Normally arid regions of Peru and Ecuador were inundated with rainfall. An estimated $8 billion in damage was attributed to the 1982-83 El Nino.

2.3 Modelling El Nino Although the El Nino phenomenon is well known, an accurate prediction system has not yet been developed. El Nino events have been documented back to 1726 and an irregular cycle of about 4 years has been estimated for the occurrence of El Nino. The diculty of modelling ENSO is due to the two way interaction of the ocean and atmosphere and attempts to predict and model the phenomenon have not been successful. At least 17 di erent coupled climate models have been developed to simulate El Nino [41]. The physics to reproduce the El Nino cycle are well known and simple models have been constructed to understand the phenomenon. For a complete understanding, coupled models have been developed using an atmospheric general circulation model (AGCM) and an ocean general circulation model (OGCM). Although, a coupled AGCM-OGCM model makes fewer assumptions than the simpler models, the subtle errors in each model feed on each other leading to a climate far from reality (climate drift). In many of the coupled models, the resolution of the atmospheric and ocean grids was coarse. Running high resolution coupled models has been

12 either prohibitively expensive or not possible on vector supercomputers due to memory limitations. The coarse resolution of the models does not resolve well known features such as the Inter-Tropical Convergence Zone (ITCZ) and equatorial wave dynamics. A high resolution ocean model coupled to an atmospheric model has been shown to reproduce a reasonable climatology [33].

2.4 Description of Ocean Model The ocean model, Poseidon [34], was developed at the Goddard Space Flight Center by Paul Schopf to study the inter-annual and inter-decadal climate variability of the global ocean which is a key element for forecasting El Nino. It is a 3D multi layer general ocean circulation model (Fig.2.1). The top layer is a mixed layer with homogenous properties. The remaining lower layers are isopycnic, i.e., the thickness of each layer is determined by the density. Using an isopycnic vertical coordinate system results in the ability to represent the vertical dimension with just a few layers and a reduction in cross di usion across layers. Some of the problems of using isopycnic coordinates include handling the disappearance of layers and intersection of layers with land. In Poseidon, these problems have been solved by specifying a minimum and maximum thickness and depth for every layer. The thickness of layers cannot exceed a speci ed limit and must be within a depth range. The distance between vertical layers is not uniform. The top mixed layer has important properties. Under conditions of strong heating and weak winds, the mixed layer becomes shallow and the sea surface temperature rises.

13

Ocean Model Layers

Atmospheric Forcings Mixed Layer Intermediate Region Ocean Interior Abyss

Figure 2.1: Ocean Model Layers

14 Under conditions of strong cooling and/or strong winds, the mixed layer deepens providing a slow response in temperature and currents. The horizontal grid points are speci ed in a latitude-longitude coordinate system. A staggered grid is used to locate variables following the standard practice for incompressible nite volume algorithms [28]. The scalar variables layer thickness, temperature, and salinity are located at the center of the box. The velocity variables are located at the edges of the box. To maintain linear computational stability near the pole without using an extremely short time interval, ltering is applied. The reduced gravity approximation is common to other ocean models and is used in Poseidon. The assumption is that the deep ocean ow does not a ect the state of the surface layer. The bottom topography of the ocean can be ignored using this assumption. A binary mask determines land and water points. A 0 implies a land point and 1 implies a water point. The water points are all in nitely deep. The use of this assumption saves computation time and reduces the complexity of the model. A modular approach has been adopted in the development of coupling with the atmospheric model. With a well de ned interface between the ocean and other climate models, coupling with the ocean model is simple. Knowledge of the workings of the ocean model is not needed to use it in a coupled mode. Poseidon can be coupled with a simpli ed or a complete atmospheric general circulation model (AGCM). An ocean model run coupled with a complete AGCM takes much longer than a run with a simpli ed AGCM. The decision to run with a simpli ed or complete AGCM is made based on the experiment. The

15 provision for this type of coupling is important to test theories regarding El Nino which is based on a complex interaction between the ocean and atmosphere. The code for the model is written in Fortran 77 ( 15; 000 lines ). The success of the model is determined by its ability to re-create the sea surface temperature given the wind and heat ux forcings. The accuracy of the wind and heat ux forcings determine the delity of the simulation. Through the use of satellites and other instruments, accurate observation data suitable for input to climate models can be obtained.

2.5 Serial Implementation The fundamental purpose of Poseidon is to advance the state of the ocean. It uses the current state and the past state to determine the future state. Explicit nite di erence equations are used to calculate the new state from partial derivatives. A global timekeeper maintains simulation time and alarms. Alarms are set for individual functions to run at speci c time intervals during the simulation. Some functions such as the hydrodynamics functions are run hourly while others such as the vertical di usion computation are run daily. The overall program ow is shown in Fig.2.2. The initial ocean state is read from a restart le. The clock and alarms are set for the current run. External forcing data is exchanged with the atmospheric model. The coupling interface is built into the main driver routine. Coupling can be performed with ice, land, lake, and atmospheric models. The ocean state is advanced by calling the hydrodynamics and other routines. A new

16 ocean state is determined at the end of each hydrodynamics computation. Each of the hydrodynamic routines contribute to a partial derivative which is then used by the update routine to compute a new state. The ocean state variables are layer thickness, temperature, salinity, zonal and meridional advection.

2.6 Parallel Implementation Our parallel implementation includes the capability to use any one of three communication protocols - Parallel Virtual Machine (PVM), Message Passing Interface (MPI), or a native shared memory library (shmem) [16, 30, 31]. We have successfully run the model using each of the three protocols on the Cray T3D. The task of parallelizing the code involved three tasks - 1) Developing code for interprocessor and global communication, 2) Modifying the number of iterations executed for a do loop based on the number of processors used and 3) Optimizing code to run eciently with a small cache. The rst task of developing code for inter-processor communication and global communication involved implementing a grid partitioning technique and eciently performing global computations. We have analyzed three grid partitioning techniques in Chapter 4. On the Cray T3D, the number of processors allocated for a job is always a power of two [6]. We have coded global computations assuming a complete binary tree. When running on a machine such as the Intel Paragon, additional code must be included to handle processor con gurations which are not powers of 2. The second task is a tedious job of changing

17

Ocean Model Algorithm

Start

Coriolis

Read restart file

Pressure

Filtering

Zonal Advection

Exchange forcing data Meridional Advection Calculate next state

Hydrodynamics Vertical Advection

Loop

Mixed Layer

Update

Write restart file

End

Surface Flux

Vertical Diffusion

Figure 2.2: Ocean Model Algorithm

18 the iteration counts of every do loop. The number of iterations is based on values set in a pre-processor le. The start and end iterations are computed depending on the values for number of processors, shadow width, and grid partitioning technique in the pre-processor le. The third task involves analysis of performance for a direct mapped cache. We have discussed cache performance in Chapter 4. A number of the performance recommendations for a T3D processor may not be appropriate for a vector processor. Therefore, optimizing code to run on multiple machines involves iterative changes to verify that performance on any one of the machines has not degraded sigini cantly. Processor 0

Other Processors

1.Read restart file, send sub-grid

1.Receive sub-grid data from

data to processors based on the

processor 0 and generate

processor number.

the initial ocean state.

2.Loop till complete

2.Loop till complete

- receive forc. data for atm.model

- send forc. data for atm. model

- execute atm. model for a day

-

- send forcing data for ocn. model

- receive forcing data

- run ocean model for a day

- run ocean model for a day

- exchange shadow data

- exchange shadow data

19 3.Receive final ocean state and write the restart file

3.Send the final ocean state to processor 0.

The performance of the parallel algorithm above depends on the communication overhead. Minimizing communication overhead while performing the required computation is a challenging task. A single program multiple data (SPMD) coding model is chosen. The choice is made for easier maintenance and simplicity. The SPMD model is easier to develop from uni-processor code. The implementation uses a single program on each processor working on a section of the 3D grid with message passing to share common grid data. Veri cation of results from the sequential and parallel algorithms must take into account the di erences between processors. Since ltering is used in both algorithms, values will not exceed bounds. We have veri ed the results using the sequential and parallel algorithms to validate the parallel implementation. Successful extended runs were executed to generate a higher con dence in the use of a parallel implementation.

2.7 Model Equations The following equations describe the computation of the partial derivative for a state variable, zonal velocity, u. The computation for other state variables is similar.

Coriolis (uh) = ::: + (f + utan )uh t a

20

Pressure (uh) = ::: h p z t acos (  b  )

Zonal Advection (uh) = ::: + 1 (  (uhu)) t acos 

Meridional Advection (uh) = ::: + 1 (  (vhucos)) t acos 

Vertical Advection (uh) = ::: + uk w t t e

Surface Flux (for top layer)

(uh) = ::: +  x t

Update (uh) t next

unext = upast hpasth+

 c2dt

Vertical Di usion u = :: + 1  ( v u ) t h  h 

21

Mixed Layer we = (Pr EhD) h

where h is the layer thickness,  is the latitude,  is the longitude, f is the coriolis force,

a is the radius of the earth, b is the buoyancy, z is the depth, we is the vertical entrainment rate,  x is the wind stress, c2dt is the timestep interval, p is the pressure,  is the density,

v is the viscosity, D is the entrainment depth, and E is the kinetic energy.

Chapter 3. Parallel Computing Many supercomputers developed in the 1960s and 1970s were used to speed up computation for applications such as nuclear research and aerospace design. To boost performance, the system architecture was designed around manipulating vector data in parallel. Pipelined logic units running at a high clock rate were used to overlap operations to obtain a higher throughput. The fast circuitry consumed large quantities of power requiring liquid cooling systems to prevent overheating. Many of these supercomputers were sold to the government which helped to defray the initial development cost and proprietary operating systems were used by most vendors [39]. The high cost of these systems limited its use to a few computationally intensive applications of interest to the government and large companies.

3.1 Parallel Machines In the early 1980s, the use of multiple low cost commercial microprocessors within a single machine to solve computationally intensive problems was shown and o ered a viable alternative to the use of vector supercomputers [13]. One of the rst a ordable parallel machines was the iPSC/1 hypercube [20] based on the Intel 80286 processor. A hypercube interconnection architecture was used to link up to 128 processors. The iPSC/1 was based on the Cosmic Cube and Mark II machines developed at Caltech. The iPSC/2 and iPSC/860 machines followed with faster processors (Intel 80386 and i860). The Intel Paragon was

22

23 developed in the early 1990s based on a prototype Touchstone Delta machine implemented at Caltech. The Touchstone Delta machine used a mesh inter-connection architecture to link 512 i860 processors. A tera op machine using over 9000 Intel Pentium processors is scheduled to be built by 1997 for the Department of Energy [39]. Intel is just one of the vendors who manufacture parallel machines. A number of other vendors including Cray, Kendall Square Research, IBM, Thinking Machines, Silicon Graphics, and Digital Equipment Corporation have developed parallel machines. All vendors used either commercial or proprietary microprocessors based on a number of inter-connection schemes. Classifying the current generation of parallel machines using Flynn's taxonomy [21] is inadequate. We can distinguish parallel machines based on memory access [29] (Fig.3.1). Parallel processing can be classi ed as - Symmetric Multiprocessing (SMP), Massively Parallel Processing (MPP), and Scalable Parallel Processing (SPP). MPP and a network of workstations (NOW) use similar concepts with the following di erences - a network of workstations uses processors which are not physically located in a single machine, the type of processors on a MPP are homogenous, and each workstation can be used independently of the network. In a SMP architecture, multiple processors share a single shared memory through a common bus. The SGI Power Challenge and DEC Alpha are examples of SMP designs. This design is also known as `tightly coupled' or `shared everything'. SMP machines have become increasingly popular due to the ease of programming and price/performance ratio [29]. A single memory space simpli es application and system programming. Global shared memory

24

Type

Vendors

SMP and Cluster of SMPs

DEC Alpha, SGI Challenge, Cray J90, Convex Exemplar

Network of Workstations

Any manufacturer of a workstation with networking capability

MPPs

Cray T3E, IBM SP2, Intel Paragon, SGI Origin 2000

Number of Processors 10

100

1000

10,000

SMPs Cluster of SMPs (Homogenous) Network of Workstations (Heterogenous) MPPs

Figure 3.1: Common Parallel Computer Architectures

25 is also the biggest problem for SMP machines. Scalability is limited by the bandwidth of the bus. With the addition of more processors, memory bus trac increases until saturation. An alternative design is the MPP design. To avoid memory bus bottlenecks, memory is distributed across processors. This design is also known as `loosely coupled' or `shared nothing'. To access memory outside local memory, a processor must use a message passing scheme. Each processor has a separate memory and data is shared through message passing between processors. The MPP design provides a scalable solution to build large machines with hundreds or thousands of processors. The drawback of MPP machines is the increased complexity of programming. Synchronization and distribution of tasks must be maintained by the application and message passing code must be added when data is shared among processors. Most MPP vendors have implemented portable message passing libraries such as Parallel Virtual Machine (PVM) or the Message Passing Interface (MPI) [16, 14]. To overcome the disadvantages of SMP and MPP, a hybrid SPP design can be used. This design uses a two tier memory hierarchy to achieve scalability. In the rst tier, a SMP system is used and the second tier is created by connecting multiple SMP systems so that memory appears logically as a global shared space. SPP o ers the SMP programming model while maintaining the scalability advantages of the MPP design.

3.2 SMP Machines There are a number of SMP machines available today - SGI Power Challenge, DEC Alpha, Cray J90, Convex Exemplar and IBM PCs. We have described the architecture of two

26

CPU

CPU or Memory

CPU or Memory

CPU or Memory

System Bus 40 Bit address path, 256 bit data path

I/O Port Module

Memory

Memory or I/O

CPU or Memory or I/O

CPU or Memory

Figure 3.2: Architecture for the Alpha Server 8400

speci c SMP machines - the DEC Alpha and Cray J90. In Chapter 5, we describe the implementation of the ocean model on both machines.

3.2.1 DEC Alpha The DEC Alpha 8400 server can be con gured with up to twelve CPUs. Each CPU shares up to fourteen Gbytes of memory (Fig.3.2). The machine used for our experiments had four 300 Mhz. processors and four Gbytes of memory. Each processor contained an eight Kbyte instruction cache and a three level data cache. The rst level was an eight Kbyte cache followed by a second level 96 Kbyte and third level four Mbyte cache. The Alpha server is a combination of PC peripherals, a common memory bus for proces-

27 sors, multiple I/O buses, and support for a redundant array of inexpensive disks (RAID) [8]. A variety of peripherals commonly found on PCs including disk drives, tape systems, network interface cards, and video boards can be used since the Alpha server uses the PCI bus. The PCI bus also allows the use of multiple generations of the Alpha processor. Upgrading to a faster processor is achieved by swapping CPU module boards. The faster processors run on the same bus. Support for Digital UNIX, LINUX, OSF/1, and Windows NT is provided. The server can be connected to a network via Ethernet or a Fiber Distributed Data Interface (FDDI). The parallel eciency of an SMP application depends heavily on the operating system and compiler. We used the DEC OSF/1 operating system and KAP [22] pre-processor in our SMP implementation. The requirements for a SMP based operating system are to provide equal access to all system resources to all processors. The addition of new processors should maintain the symmetric allocation of resources. The operating system overhead for using multi-processors should be low enough to enable applications to use the additional CPU resources. The addition of a new processor can result in a performance gain of 70 - 80 % of the last processor added and therefore the bene ts of adding more processors diminishes at higher processor con gurations. Some of the problems in designing an SMP operating system include handling concurrency and scheduling across various processors. In a multiprocessing platform, multiple processes running concurrently on separate processors can access kernel data structures simultaneously. A combination of locks and the use of a system priority level (SPL) guarantee

28 synchronized access to system resources. A lock can be a spin lock or a blocking lock. In a spin lock, a thread repeatedly checks the lock bit. In a blocking lock, a thread is placed in a sleep state until the lock bit is reset. The KAP pre-processor is used to optimize and parallelize code. It optimizes code by manipulating loop statements such that the loop will execute faster. Loop unrolling, loop fusion, and elimination of dependencies [11] are some of the tasks KAP will perform during optimization. KAP generates multiple threads of a loop during parallelization. Each thread of a loop can run independently. Synchronization of access to shared data is based on code generated by KAP. The scheduler assigns threads to processors based on the position in a global run queue and the priority of the thread. In the ideal case, the top N priority threads will run on N processors at any given time. By scheduling threads which are not bound to a particular CPU from a global run queue, the N highest priority threads can be executed concurrently. The disadvantage of using a single queue is the contention between processors attempting to read the queue. By running on a single processor, the cache state can be maintained for a thread. When a thread is migrated to a di erent processor, its cache state must be loaded onto the new processor. The bene ts of reusing the cache state of a processor for a thread are lost when threads are loaded on di erent processors. A last preference model is used for high priority threads, i.e., the scheduler attempts to run the thread on the processor where it last ran; if that processor is busy, the thread runs on the next available processor. A soft anity model

29 can be used for other threads, a thread will run on the processor where it last ran, if the load across all processors remains balanced. The scheduler periodically load balances the system to keep threads evenly distributed across processors. The load balancing algorithm is run such that the processing cost is not excessive and thread movement between processors is minimal.

3.2.2 Cray J90 The Cray J90 is a shared memory machine which can be con gured with up to 32 processors. Each processor is a custom 200 Mhz RISC processor without cache memory. Upto 8 Gbytes of memory can be used on a single machine. A variety of peripherals can be connected using the FDDI, Ethernet, Hippi, and SCSI protocols. The J90 uses the UNICOS operating system which is a superset of UNIX with modi cations to run on a multi-processor system. A multi-threaded kernel which can be accessed simultaneously by multiple processors is used in UNICOS. All processors use a single global shared space. The problems of concurrency and scheduling described in the previous section for the DEC Alpha are applicable to the J90 as well. In the J90, concurrency is handled through the use of semaphores. The operating system schedules threads on processors. The problem of scheduling threads such that all processors are equally busy is common to the DEC Alpha and J90. Details of how concurrency and scheduling were implemented on the J90 have not been published. Since cache memory is not used on the J90, reloading the cache state is not necessary when a thread is moved to another processor. The code to run on multiple processors is generated by a pre-processor based on di-

30 rectives in the code. The details of this process are described in Chapter 5. Essentially, a subroutine or function without static variables is created for every parallelized region of code. This region of code can be executed concurrently on multiple processors. Synchronization with a `master' is obtained through the use of semaphores. For computationally intensive applications, the region of code most often parallelized is the loop. When the number of iterations for a loop justi es running in parallel, multiple threads of the loop will be generated. Each thread runs independently and uses a stack area to store variables. The iteration number in each thread is a shared resource and is accessed through the use of semaphores. The `master' thread controls the execution of `slave' threads. After execution of the loop, the slave threads terminate and sequential execution resumes at the master thread.

3.2.3 Clustering Large processor con gurations (more than 32 processors) are obtained by clustering SMP machines. Each machine is connected via a high speed connection. This approach has been used for machines such as the Cray J90, DEC Alpha, SGI Power Challenge, and Convex Exemplar. Memory models and interconnection schemes for clustered machines di er among vendors. In the Convex Exemplar, clusters are built by tying nodes (a machine with up to 16 processors) together through a one-way toroidal ring [39]. A cross bar switch monitors memory access. Remote memory accesses are routed through the toroidal ring while local memory accesses are directed to memory on the node. In this manner, an uni ed memory space can be used by a programmer, even though memory is physically located in di erent

31 nodes. On the DEC Alpha, clustering is achieved by connecting FDDI links through a memory channel adapter. The memory channel adapter permits data written to one node's memory to be shared by other nodes of the memory channel bus. A write performed by a CPU on a node is re ected to other nodes. This provides the appearance of a global shared memory across a group of nodes. On the J90, a clustered con guration does not access a single shared memory space. Each machine in the cluster uses a local shared memory, however, the le space is shared using the shared le system (SFS) which allows high speed le sharing between machines in the cluster. Workload is distributed among processors in the cluster. The SGI Power Challenge can also be clustered, however, there is no provision for a global shared memory. Data must be shared using message passing between nodes.

3.3 MPP Machines Several vendors have developed MPP machines. Among the most popular are the Intel Paragon, Cray T3D, and the IBM SP2. We have used the Cray T3D in our implementation. However, the code can be ported to any of the other MPP machines. All MPP machines have some form of distributed memory, a collection of homogeneous processors, and an interconnection architecture. The di erences are in the individual processors, the type of interconnection architecture, and communication protocols used. Code written with a standard communication protocol such as PVM or MPI can be easily ported among MPP machines. Most of the current scienti c code is written in C, Fortran, or C++. Compilers for

32 these languages have become standard on MPP machines. We will describe the architecture and programming techniques for the Cray T3D alone.

3.3.1 Cray T3D Some of the key features of the Cray T3D are - 1) A globally addressable physically distributed memory, 2) Fast synchronization primitives, 3) A 3D torus interconnection architecture, and 4) DEC Alpha based processor elements . The memory of the T3D is globally shared, yet physically distributed across all processors (Fig.3.3). A memory location consists of an address with a PE number pre x in a 64 bit word. Local memory accesses are distinguished from remote memory accesses by the PE number. When the PE number of an address is not the same as the local PE number, a remote memory access is made. The communications hardware allows load and store capabilities from remote PEs. When a local memory location is updated by a remote PE, the corresponding cache data on the local PE is invalidated. This scheme allows for PEs to update each other's memory locations and yet maintain cache coherence. Communication between PEs is achieved through one of three protocols - PVM, MPI, or shmem. We have implemented all three protocols for the ocean model and have described our results in Chapter 4. The shmem protocol exploits the global memory scheme and gives the best performance. The disadvantage of using the shmem library is portability. A program with shmem calls cannot be executed on a non-Cray machine. The other two protocols, PVM and MPI are portable but give lower performance. Shmem calls can be implemented with a put/get and a barrier. The put/get call is executed by the communi-

33

Figure 3.3: 3D Torus architecture for the Cray T3D

34 cation circuitry while the barrier is executed by the processor. Since the T3D restricts the number of processors that can be used by an application to a power of two, the barrier call is executed using a binary tree scheme. This reduces the overhead for a barrier call. The other two protocols, PVM and MPI, use the two way send/receive mechanism to exchange data. The 3D torus interconnection method was chosen after experimenting with other interconnection methods by Cray Research Inc. [31]. For each method, the network capacity per node was determined for a range of processors from 32 to 2048. The network capacity per node was highest for a mesh architecture when 32 to 128 processors were used [1] . Beyond 128 processors and up to 1024 processors, the 3D torus was ideal and a 4D interconnection scheme was best when more than 1024 processors were used. The 3D torus was chosen since it was assumed that most applications and systems would use between 128 to 1024 processors. Each processor of the 3D torus contains 7 links (Fig 3.3). Six links are used for the x,y, and z dimensions. Bi-directional links are used in each dimension. The seventh link is to the processor memory. Routing is ordered by dimension. A message is sent along the x dimension, y dimension, and nally the z dimension. The routing algorithm can also determine a new route, if a link fails. Each PE of the Cray T3D is a DEC Alpha processor running at 150 Mhz [31]. An eight Kbyte direct mapped data cache and an eight Kbyte instruction cache is maintained on the processor. The data cache is arranged as a series of 256 lines with 4 words each. Each PE can have up to 64 Mbytes of main memory and it is organized in pages of 32K bytes each.

35 Each cache miss generates a cache line load from main memory. Data layout is crucial for obtaining good single PE performance. The small data cache size relative to the size of main memory can cause signi cant performance degradation. When an array is mis-aligned with the cache, it can result in multiple misses. Attention must be paid to the relative o sets of consecutive memory acceses to reduce the number of misses. Data should not be accessed in strides of multiples of 1024 since the size of cache on a PE is 1024 word. Therefore, every access will result in a cache miss. Data alignment to minimize cache misses for a large program is not obvious. We have developed a cache model program which can trace the cache performance for a single subroutine. The results from the program indicate which arrays have the highest cache miss rate. With this information, data alignment can be performed to obtain better performance.

3.4 Programming Parallel Machines Two types of programming - data parallel (use of shared memory constructs) and message passing are commonly used in parallel machines (see section 3.5). There are many ways to communicate between two programs on the same host or two hosts (Fig.3.4). A message passing protocol such as PVM, MPI, Express or P4 can be used. The id of the destination program is known and the source program sends data by calling a send routine from the protocol. Data is similarily received from a processor. The underlying mechanism to transmit data which is transparent to the program, can be via TCP, UDP, ATM, or threads. Another way of exchanging data between programs is to use SMP directives. SMP

36

Software Communication Mechanisms for Parallel Programs

APPLICATION SMP Directives Threads, RPC,

TCP/UDP

PROGRAM

PVM, MPI, Express, P4

Sockets, Shared Memory IPC OPERATING

NETWORK

ATM

SYSTEM

Local Host

DEVICE DRIVER

NETWORK

Remote Host

Figure 3.4: Inter-processor Communication Mechanisms for Parallel Machines

37 directives instruct the compiler to generate parallel code with statements describing the private and shared data variables. The code generated by the compiler can utilize threads or any other method to implement parallelism. The generated code is transparent to the programmer. At a lower level, a program can use any of the following methods - threads, sockets, shared memory calls, TCP, UDP, Remote Procedure Calls (RPC), or an API for the Asynchronous Transmission Mode (ATM). If the destination program is on the same machine, there is no need to use the network. The operating system handles calls from any one of the inter-process communication mechanisms for exchange of data between programs on the same host. If the destination program is on a remote host, the network device driver and network is used to exchange data. The operating system on the remote host handles the routing of data to the receiver program via one of the communication mechanisms.

3.5 Data Parallel vs. Message Passing . The data parallel (use of shared memory constructs) or message passing schemes can be used to program a SMP machine. The di erence between data parallel and message passing schemes is the perspective which the programmer takes of the whole machine, i.e. a global view (data parallel) or a local view (message passing). Data parallel programming relies on the compiler to insert appropriate communication calls in the code based on user directives. While this type of programming is easier than message passing programming, it is dependent on an optimized compiler for performance. Data parallel compilers are more

38 complex and costly to develop than message passing compilers which typically consist of a single processor compiler and a communication library [29]. Data parallel compilers are written for a suite of applications and provide generic tools for parallelism. The approach taken by data parallel compilers is to parallelize loops by dividing iterations over the number of processors and synchronizing after the execution of a loop. Compilers are conservative in parallelizing code with dependencies between iterations. Unless a programmer modi es loops so that it is obvious to the compiler that iterations are independent, some inherent parallelism within an application will not be utilized. In message passing, parallelism is implicitly de ned by running the same program on a number of processors and processing separate blocks of iterations on separate processors. The productivity of a message passing programmer can be expected to be lower than that of a data parallel programmer. The e ort to convert sequential code to a message passing version is considerable relative to the e ort to develop a data parallel version. The main advantage of message passing over data parallel is the ability to optimize parallel code for a particular application. The extra e ort to code for the message passing scheme must be weighed against the bene ts of higher performance.

3.6 Programming Methods for Portability We have used the C pre-processor to generate portable code for six di erent machines, three communication protocols, and to run on a distributed, SMP, or MPP system. The advantages of using portable code are -

39

 Independence from any particular vendor or machine,  Maintenance of a single source  Machines based on MPP, SMP, and vector processing will co-exist and using code which runs on any type of architecture provides exibility We de ned a pre-processor options le in our code which contains all the options necessary to generate code for a particular machine. The le contains options such as - the number of processors, the type of grid partitioning technique, the communication protocol, a ag for the use of SMP directives, a ag for a distributed or local run, and a ag for tracing communication calls. Before generating code for the target machine, the appropriate options for communication protocol, number of processors, etc. must be set in the preprocessor options le. After setting options, a single le containing the entire ocean model code is generated using the C pre-processor. This le is sent to the appropriate machine for compilation. A make le on the execution machine generates the executable. In this scheme, we maintain the source code on an UNIX workstation and route programs to di erent machines for compilation and execution. We have successfully run the ocean model on a DEC Alpha, Cray J90, Cray T3D, SGI Power Challenge, Cray C90 and a SUN workstation. We tested the PVM, MPI, and shmem protocols on the Cray T3D, Cray J90, and DEC Alpha. We also ran experiments on SMP machines using directives in the code. While there are a number of advantages to having portable code, there are some disadvantages. When a single machine will be used for execution, the code can be optimized for the particular machine. To develop portable code, we created higher levels of protocol and

40 machine independent code. Machine dependent code was used at the lowest level possible. This adds overhead due to the additional functional calls needed when executing machine dependent code. Further, the communication code must be implemented in such a manner that all protocols will be able to perform the same communication tasks. This implies that only the basic calls such as send and receive can be used. In some protocols, the ability to perform high performance broadcasts or to use alternate communication algorithms will not be exploited. Maintaining code for a large number of platforms and protocols is complex. A number of ifdefs are scattered throughout the code. Reducing the number of ifdefs can simplify the code, but leads to additional overhead for functional calls. Fewer ifdefs are required when the code is made as machine independent as possible. We can conclude that while the problem of maintaining portable code through the use of the C pre-processor is cumbersome, it is simpler than generating and maintaining separate versions of the code for each machine.

Chapter 4. MPP Implementation We have used the Cray T3D in our MPP implementation of the coupled ocean model. We made several changes to port the ocean models from an uni-processor implementation to a MPP implementation. The changes we made include grid partitioning, distribution and collection of data, and handling multiple communication protocols. These changes are common to parallel applications using 3D regular grids. Our contributions include a new algorithm to nd an optimal processor grid for an irregular blocking grid partitioning technique, an algorithm to distribute and collect data eciently, and a cache trace tool to evaluate cache performance on a single PE. We have compared the performance of our techniques with previous methods and have achieved better results as shown below. We also evaluated the performance of three communication protocols for the coupled model application. The cache trace tool we developed can be used to evaluate single PE cache performance on any parallel machine. It does not require the use of a large simulator used in other cache performance tools [15].

4.1 Grid Partitioning Techniques A variety of grid partitioning techniques have been proposed for numerical problems employing grids [13]. Each of these techniques has particular advantages that can be applied to a numerical solution [25]. In this section, we have reviewed three grid partitioning techniques

41

42 for the ocean model. The number of grid points per individual processors is determined during the initialization of the model. For a balanced load, approximately equal number of grid points must be assigned to all processors. Since the grid for the ocean model is a regular grid, the domain decomposition is performed along the 3 axes. For a 3D grid, the types of partitioning are 1D partitioning along the x; y; and z axes, 2D partitioning along combinations of two of the three axes, and 3D partitioning into 3D blocks. Partitioning along the z axis is not ecient when vertical integrations are performed due to the added communication overhead between layers. Each grid partitioning technique has pros and cons in terms of the volume of data communicated and number of neighbor processors. The performance for each technique is evaluated assuming computation for a grid point requires data from four (north, south, east, and west) neighbors.

4.1.1 Striping Striping can be performed along any one of the three axes. A stripe is assigned to a single processor and consists of a sub grid. The size of the sub grid is determined by the number of stripes (Fig.4.1). A large number of stripes results in small sub grids. A load balanced implementation must have equal sub grid sizes in all processors. For any sub grid size, a processor must communicate with north and south neighbor processors. The number of bytes exchanged is equal to the size of the x dimension of the grid (a shadow row). The advantages of this technique are simpler coding and the number of neighbor processors is always 2. The process of dividing the grid is simple since the entire grid can be viewed as a one dimensional array which is divided into sections. During an exchange, the shadow

43 row is not copied into a temporary message bu er. The shadow data can be exchanged by simply providing an address and length. A one dimensional distribution of the grid makes the task of gather/scatter of global grid data simple (see section 4.3). For striping, the maximum number of processors is limited to the size of a dimension of the grid.

4.1.2 Regular Blocking Blocking is a two dimensional partitioning technique which reduces the volume of interprocessor data communication compared to striping (Figs.4.1 and 4.2). The grid is divided horizontally and vertically into blocks of equal areas (or grid points). Each block requires a shadow row from the north and south processors and a shadow column from the east and west processors. The two shadow rows adjoin the bottom and top rows. Similarily, the two shadow columns adjoin the leftmost and rightmost columns. As the number of processors in this con guration increases, the volume of data communicated per processor decreases. However, each processor must exchange data with 4 processors. The implementation is more complex compared to striping. The grid cannot be viewed as a one dimensional array. Code must be written to build blocks of global data. Additional code must also be written to handle the `gather and scatter' of shadow data. This additional code causes a degradation in performance compared to striping for large grids on small processor con gurations. The number of processors that can be used in blocking exceeds the number of processors that can be used in striping.

44

P1

Shadow row

P2

Striping

P3 P4

P1

P2

P3

P4

Blocking

Gather/Scatter overhead for blocking

Shadow row

Figure 4.1: Regular Blocking and Striping

Shadow column

45

Figure 4.2: Regular Blocking for the Global Topography

46

4.1.3 Irregular Blocking For a global ocean model grid, approximately 40% of the grid points are land points and the remainder are water points (Fig.4.3). No useful computation is performed on land points. A lower execution time can be obtained through the use of an irregular blocking technique [3]. We will show how an optimal processor grid can be obtained for a given data grid and processor con guration. The bene ts from irregular blocking though not signi cant for the global topography can be of use in other topographies where land points are over 50% of all grid points. Using the algorithm below for a given processor con guration and topography, we can calculate the optimal processor grid. The optimal processor grid should have the maximum number of processors that can be used while still including all water points. The best solution results in a small sub-grid for a processor. This results in fewer iterations per do loop and a corresponding lower execution time. With a lower communication overhead and computation demand, irregular blocking gives better performance results than regular blocking. The algorithm can be used for any given topography. By altering topography, such as adding extra columns or rows of land along the edges of the grid, optimal solutions can be obtained which are not possible for the original topography. An altered topography can have more factors for processor grid con gurations. For example in Fig.4.3, we have used a 36  20 processor grid for a 288  260 data grid with 512 physical processors. The dimensions of each sub grid are 8  13 or 104 grid points per processor. For regular blocking, a processor grid of 32  16 is used for a 288  256 data grid (Fig.4.2). The dimensions of each sub grid are 9  16 or 144 grid points

47

Figure 4.3: Irregular Blocking for a Global Topography

48 per processor. The size of the sub grid for irregular blocking is smaller by 40 points despite the use of a larger data grid. A smaller sub grid implies faster computation and a lower volume of communication. On the T3D, processor allocations are restricted to powers of two and therefore, the number of combinations of processor and data grids is limited. Algorithm for Optimal Processor Grid

- Read topography mask - Set procnum to maximum number of processors - Do while procnum < number of real processors - Find all factors for procnum and data grid - Do for each factor - Compute number of land procs - If (no. of land procs + no. of real procs >= procnum) - then save factors and exit - Endif - Enddo - Decrement procnum - Enddo

The optimal processor grid will contain more processors than the number of real processors used. After subtracting land processors, the number of real processors is sucient to cover

49

Table 4.1: Optimal Processor Grids for Ocean Grid Number of Processor Grid Number of land Percentage increase in Processors

Dimensions

Processors

Processors

64

89

8

12.5

128

12  13

33

21.9

256

32  10

67

25.0

512

36  20

220

40.6

1024

72  20

466

40.6

all water points. The di erences between the two blocking schemes are in inter-processor communication and collection/distribution of global data. Shadow data is not exchanged with a neighbor land processor, since it does not exist. Processor 0 does not transmit or receive global data from a land processor. The table 4.1 shows the improvement of optimal solutions for larger processor grids for a 288  260 data grid. With larger processor con gurations, the number of land processors increases and a high resolution processor grid can be used. The percentage increase represents the percentage of processors in excess of the real number of processors. For the 512 processor con guration,

50 we use a grid 36  20 or 720 processors. The number of processors in excess of 512 processors is 208. The % increase will reach a stable value with more processors and should be approximately the same as the percentage of land for the data grid. The irregular decomposition technique yields signi cant bene ts when high resolution processor and data grids are used.

4.2 Execution Timing Analysis We have analyzed the time for execution (excluding I/O) of the model. The time for I/O is approximately 30% of the total time to complete a one month simulation. I/O time has not been modelled here and the focus is on execution time alone. The processing time, tp for a timestep is

tp = tcomm + tcomp + tidle where tcomm and tcomp are the communication and computation times per timestep respectively. The computation time is an approximate linear function of the number of grid points, ng . This was con rmed by measuring the computation time for several grid sizes on a single processor of the Cray T3D. ng is based on the number of processors used and the dimensions of the data grid. For a balanced load, ng must be the same on all processors. The communication time, tcomm is based on inter-processor bandwidth and latency time. The latency time for the Cray T3D is approximately 2:5s [31]. The time to exchange data is a function of the bandwidth, latency, and the number of bytes transmitted and received. Data can be exchanged using PVM or MPI [14, 16] send and receive calls. On the Cray T3D, shared memory calls provide a higher bandwidth than send and receive calls.

51 We have used shared memory calls in the model and have included code for communication using PVM or MPI. In our experiments, we found the Cray shmem protocol to be the fastest followed by MPI and PVM.

4.2.1 Striping vs. Regular Blocking The three components of communication time are latency time, transmission time and gather/scatter time. The latency time cannot be changed by a program, however, the transmission time can be reduced by sending data packets which are near the peak of the bandwidth curve. For example, in the Cray T3D, the latency time is in s, while the transmission time is in tens of s for packets of 40 Kbytes or less and in hundreds of s for packets up to 100 Kbytes. In the striping grid partitioning technique, data can be sent by providing an address and the number of bytes to be sent since shadow data are stored in contiguous locations. In the blocking grid partitioning technique, pre-processing must be performed prior to sending data and after receiving data (the gather/scatter operation) [17]. This additional overhead for exchanging data must be added to the latency time and transmission time. Depending on the dimensions of the sub grid, the gather/scatter time can be several milliseconds. The communication time, tcomm to exchange shadow data for striping and blocking techniques can be de ned as

tcomm = tlat + ttran + tgs where tlat , ttran , and tgs are the latency time, the transmission time, and the gather/scatter

52 time respectively. In striping,

tlat = 2tl ; ttran = 2tx; tgs = 0 tl is the latency time to transfer data from the application to a system message bu er and tx is the time to transmit a row of x bytes. In blocking, tlat = 4tl ; ttran = 2(txp + tyq ); tgs 6= 0 txp and tyq are the times to transmit xp and yq bytes respectively. If p and q are the dimensions of the processor con guration in the x and y dimensions, then for a grid with dimensions x  y, xp = x=p and yq = y=q. tgs, txp, and tyq vary depending on the processor con guration and grid dimensions. In general, striping is better than blocking, when the following is true, 2tx < 2(tl + txp + tyq ) + tgs

4.2.2 Regular Blocking vs. Irregular Blocking Irregular blocking is similar to regular blocking and includes preprocessing to determine land and water processors. Processor 0 reads the topography mask and builds a table consisting of real processor numbers, virtual processor numbers, and a ag indicating a water or land processor. This table is broadcast to all processors. A virtual processor number is based on the location of a processor on the processor grid. A real processor number is the T3D PE number associated with the virtual processor number. Shadow data is exchanged with four neighbor processors to the north, south, east, and west. An exchange is not performed when the neighbor processor is a land processor. The bene ts of irregular blocking are smaller

53 sub-grid sizes for a xed number of processors. With a smaller sub grid, tcomp and tcomm will both be smaller giving better overall performance. The experimental results show the improvement in performance that can be expected with irregular blocking for the global topography.

4.3 Distribution and Collection of Data For coupled model runs, there is a frequent exchange of forcing data between the ocean and the atmospheric models. In a sequential implementation, all forcing data is available on a single processor. In a parallel implementation, forcing data must be collected and distributed at each coupling interval. The distribution and collection of data also occurs during I/O. When a run is started, data from a restart le is distributed to all processors. At the end of a run, data is collected for a restart le and during the run, data is collected for history les. For small processor con gurations, the overhead is not excessive. With a large processor con guration (more than 64 processors), the overhead for collecting and distributing data increases rapidly. In our 256 processor implementation, approximately 10% of the execution time was spent in distributing and collecting data. We have used two approaches for data distribution and collection - a hub scheme and a recursive scheme.

4.3.1 Hub Scheme A hub scheme was used to collect and distribute global data (Fig. 4.4). Processor 0 was a hub for all other processors. During the distribution process, the global grid was divided into

54

P0

P2

P4 P7

P1

Step 3 P0

P2

P4

P6

P6

P0 P3

Step 2

P5

P0

P1

P2

P3

P4

P5

P6

P7

P4 Hub Scheme

Step 1

P0

P1

P2

P3

P4

P5

P6

P7

Recursive Scheme

Figure 4.4: Hub vs. Recursive Schemes for Striping

sections within processor 0 based on processor number and a sub grid was sent to all other processors. This scheme was inherently sequential since only one sub grid could be sent at a time to other processors. Further, all other processors were idle during the collection and distribution process. This caused the serial fraction of the parallel implementation to increase, resulting in lower eciency.

55

4.3.2 Recursive Scheme We describe a new recursive scheme to collect and distribute global data with less communication overhead. During the collection process, a global grid is built in log(n) steps where

n is the number of processors. For an 8 processor con guration, a global grid is assembled in 3 steps (Fig.4.4). In step 1, odd numbered processors send their respective local grids to a neighboring even numbered processor one hop away. In step 2, every fourth processor starting with processor 2 sends the current composite grid to a neighbor processor two hops away. In step 3, every eighth processor starting with processor 4 sends the current composite grid to a neighbor processor four hops away. At the end of step 3, a global grid has been assembled in processor 0. The same process is applied in reverse order to distribute a global grid. In the rst step, processor 0 divides the global grid in half and sends the upper half to a neighbor processor 4 hops away. In the second step the sub grid is further divided and a quarter of the global grid is sent to a neighbor processor 2 hops away. In the third step an eighth of the global grid is sent to a neighbor processor one hop away. The recursive scheme can be used e ectively when striping is used as the grid partitioning technique. The global grid is built by accumulating larger sections of rows at each step. When blocking is used (Fig.4.5), an additional step must be implemented. In the rst step, the grid is recursively divided along the y dimension till every processor on the leftmost column of processors receives a sub grid for a row of processors. This step is common to both striping and blocking. The second step for blocking alone, is the recursive division of a sub grid along the x dimension for every row. The division is performed by building blocks

56 from a row and recursively dividing into sub-blocks till each processor on a row receives a block. The recursive scheme is particularly ecient for a parallel computer with a hypercube architecture. All transmissions of data to neighbor processors are accomplished in a single hop. On the T3D, the time to transmit and receive data between any two processors is essentially uniform [31].

4.3.3 Other Schemes We conducted an experiment to see if other schemes would outperform the recursive scheme. We distributed a grid of dimensions 512  512 across 64 processor 10,000 times. We used trees with di erent numbers of branches (Table 4.2). In the recursive scheme, the size of the sub grid was recursively divided in half at every level of the tree. A processor communicated with only one other processor at each level. We used other schemes where a processor communicated with multiple processors at each level. Trees which communicated with 4,8,16, and 32 processors at a level were used. The number of transmissions increased with broader trees (fewer levels). A tree with only 1 level is equivalent to the original hub scheme. For all trees used, the total volume of data transmitted was the same (

63 64

th of

the grid). The recursive scheme gives the best results since the number of transmissions is lowest compared to all other trees. On the Cray T3D, the di erence in time between the hub and recursive schemes is not large, but for a system with a higher latency time, the recursive scheme will give a higher communication performance.

57

Gather/Scatter of Global Data for 2D Decomposition

P0

P8

P0

P4

P0

P0

P0

P2

P1

P1

P2

P2

P8

P4

P6

P12

P8

P3

P4

P5

P6

P7

P3

P4

P5

P6

P7

P8

P8

P10

P9

P9

P10

P12

P11

P12

P14

P13

P10 P11 P12 P13

Figure 4.5: Recursive Scheme for Blocking

P14

P15

P14 P15

58

Table 4.2: Distributing a 512  512 grid (64 processors) Tree Type No. of transmissions Time for 10,000 distributions (secs.) 2-2-2-2-2-2

6

163.4

4-4-4

9

163.8

8-8

15

164.0

16-4

18

164.3

32-2

32

165.1

64

63

167.0

Overlapped

7

163.5

59

Hub vs. Recursive Analysis If tl is the latency time, ttran is the transmission time, and x  y is the dimension of the grid, then for num processors, the time to distribute the grid using the recursive scheme is dim X i=1

x  y   ttran 2i + tl

where 2dim = num. Similarily, the time to distribute the grid using the hub scheme is num X 1 i=1

x  y 

ttran num + tl



The volume of data transmitted is the same in both schemes. In the recursive scheme, larger packets are transmitted at a higher bandwidth [31]. When the latency time, tl is 2:5 secs, the di erence between the two schemes for 10,000 distributions on 64 processors is 1.5 seconds. When the latency time is 72 secs, the di erence between the two schemes for 10,000 distributions on 64 processors is 45 seconds. Therefore on machines with a higher latency than the Cray T3D, the di erence in performance between the 2 schemes will be larger.

Overlapped Scheme The overlapped tree type is a modi cation of the recurive scheme. If processor 0 is distributing the grid, then at the rst level of communication,

1 4

of the grid is transmitted

twice instead of sending 12 of the grid (Fig.4.6). The idea behind this modi cation was to observe if starting the distribution of data at lower levels of the tree sooner will improve performance. The disadvantage of this scheme is additional transmission required by pro-

60

0 1/4

1

8

0 1/4

2

1/4

12

3

12

1/8

14 8 1/16

4

0

8

1/8

13

1/16

15

5

1/4

10 1/16

9

4 1/16

4 11

0 1/8

1/8

6

2

0

1/16

1/16

1/16

1/16

5

7

3

1

Time

Figure 4.6: Overlapped Recursive Scheme to Distribute Data

cessor 0. From the experimental results in Table 4.2, there is no bene t in splitting the communication.

4.4 Cache Trace Tool A parallel implementation of the ocean model on the Cray T3D with 256 processors achieved a throughput equivalent to a giga op or about 3.8 M ops per processor. The alpha processor

61 of the Cray T3D has a peak performance of about 150 M ops. The reason for the lower mega op rate of the ocean model was assumed to be the small cache size. In order to determine the e ect of cache size on performance, a cache trace tool was built to calculate the hit percentage for every variable of a subroutine. The cache for a single PE consists of 256 lines [31], each line containing a 32 byte segment or four words of the cache. The cache is direct mapped and the address of any element is the address modulo 1024. An element is assumed to be a eight byte eld. The example below shows the cache hit percentage for array elements. For every cache miss, a cache line is loaded with four words from main memory. This operation can take 21 or 30 clock periods depending on whether there is a page hit or miss in main memory.

62 Example: real a(37),b(37),c(37) integer i,j

do j=1,5 do i=1,37 c(i) = a(i) * b(i) enddo enddo ------------------------------------------------Cache Table

Name

Hits

Misses

a

148

37

b

148

37

In the rst iteration of the outer loop in the example above, all references to elements of

a and b result in cache misses. For all other iterations of the outer loop, the references to elements of a and b are in cache. Four write bu ers, each one cache line long are used for memory writes. The cache hit rate cannot be estimated for a complex program by observation alone. Several cache performance tools have been developed [10]. Some of the tools have a graphical

63 user interface to examine cache performance and pro le the cache hit percentage by variable. The performance tool we developed simulated cache performance for multiple processor con gurations and various cache sizes. Instead of using a multiprocessor simulator, we replaced the subroutine to be pro led with generated code. A code generator created code to simulate the memory references of the subroutine. The generated code was included with the rest of the model code intact. A single timestep of the simulation was executed to generate the cache table. Input Subroutine ! Code Generator ! Output Pro le Code Output Pro le Code ! Cache Table

Using the above technique, a single subroutine of a complex application can be studied in isolation without pro ling the entire application. A multiprocessor simulator is not required since generated code is used with the application to build the cache table. The code generator is a `C' program to scan do loops and generate memory references for variables of the subroutine. A symbol table of the subroutine variables is maintained. The cache table is an array of 1024 unsigned long integers. Each element of the cache table contains a memory address. The cache table is maintained by updating table entries as needed based on memory references generated. A cache line is the unit of replacement for the table. Replacements are made starting from a cache line boundary. After running a Cray performance tool to analyze the performance of the ocean model,

64 one of the subroutines - update, was found to be the most time consuming subroutine of the ocean model. It consumed approximately 15% of the total execution time excluding I/O. All the remaining subroutines consumed less time. The cache performance tool was used to analyze the update subroutine of the model alone. The function of the update subroutine is to calculate a new state of the ocean after the computation of partial derivatives from hydrodynamic calculations. It consists of a single outer loop over each layer of the ocean. The grid point values for layer thickness, horizontal velocities, and tracers are revised. Two inner loop divisions are performed. Since update uses the most CPU time in our analysis, we decided to focus on this subroutine. Similar approaches can be used to analyze other program subroutines. By re-organizing the data layout and code, we obtained an improvement of about 33% in the performance of update (see section 4.6.3).

4.5 Communication Protocols On the Cray T3D, any one of three communication protocols - PVM, MPI, or shmem can be used. Most parallel machines (SMP and MPP) support PVM and MPI. PVM was introduced in the early 1990s by the Oak Ridge National Laboratories [38]. It runs on most avors of UNIX and numerous architectures. The key feature of PVM was providing the ability to run on a network of heterogenous hosts using a library and a daemon. The library included a set of PVM calls to handle communication between processes on local and remote hosts, spawn processes on any host, asynchronously join groups of cooperating processes, and handle data conversion. The PVM daemon controlled

65 process execution and transmission of data on the network of hosts. A PVM daemon runs on each host in the network. One of the daemons is designated as the master and the others as slaves. The master daemon can start or terminate slave daemons. An application can communicate with local processes using TCP and remote processes using UDP through the daemon. When the PvmRouteDirect option is used, an application can communicate directly with a remote application using TCP bypassing the daemon. In other cases, the application message bu er must be copied to a system message bu er when data is transmitted. The advantage of using UDP is the ability to communicate with any number of remote UDP sockets through a single local UDP socket. Setting a separate TCP connection for every pair of hosts on the network is expensive compared to the use of UDP. The use of TCP instead of UDP o ers better performance but lower scalability. The main advantages of PVM are portability and a ordability. It is available on the internet and is implemented on over 35 di erent architectures ranging from the Cray T3D to PCs. The disadvantage of PVM is poor performance. To maintain portability across a variety of architectures, only generic features of the UNIX operating system have been used. Vendors such as Cray, DEC, and Silicon Graphics have implemented customized versions of PVM which o ers higher performance. Since there is no standard for PVM, new commands were introduced for a particular machine and some commands were not implemented in custom versions. This solution solves the performance problem for a machine, but portability of code is no longer assumed.

66 While PVM provided a way to implement distributed applications on a network of heterogenous hosts, many vendors developed custom communication protocols for their machines. Vendors such as IBM, Intel, Thinking Machines and Cray developed independent high performance communication protocols. At the Supercomputing 92' conference, a committee later known as the MPI forum, was formed to de ne a message passing standard. The message passing standard would provide a scheme to write portable parallel code. It would consist of a library providing the functionality speci ed in the MPI standard. The participants in the MPI forum included most supercomputer vendors and a number of government agencies. The functions of the MPI library were de ned by the MPI forum and the implementation was machine dependent. The rst version of the MPI library was announced in 1994. It contained most of the functionality provided in the PVM library, but did not include the ability to run on a network of heterogenous hosts. The bene ts of using MPI are portability, standardization, and eciency. The shmem protocol is Cray's implementation of a message passing library. It can be used on Cray's SMP and MPP machines. In both machine types, usage of the shmem protocol gives the best performance. The shmem routines can be used by a processor to read or write from another processor's memory without interrupting the remote processor. When data is stored at a remote processor, the cache location corresponding to the address of the data stored at the remote processor is invalidated. The shmem routines do not perform any error detection and the burden of synchronization and verifying data communication is the programmer's responsibility. Therefore, misplacing a shmem call in the program will

67 not automatically lead to an error or deadlock during execution, but will instead generate

erroneous results. In the design of the shmem routines, ease of use was sacri ced for speed. The added programming e ort to verify communication between processes is the price to pay for the higher communication performance.

4.6 Analysis of Results Our results from experiments with grid partitioning techniques, distribution and collection of data, and a cache performance model are described in this section. The rst set of results are timings for various combinations of processor con gurations and grid partitioning techniques. We have plotted results for execution time excluding I/O, communication time, and speed up. The second set of results compare the hub and recursive schemes to distribute and collect data for 1D and 2D partitioning. The nal results are from the cache performance model for multiple cache sizes and processor con gurations.

4.6.1 Grid Partitioning The results for grid partitioning are based on a 1 month simulation (744 timesteps) for a 288  256  8 grid. The execution and communication times are for the ocean model alone. The time to run the atmospheric model and perform I/O have been excluded. The communication time includes the time to distribute and collect data for coupling and interprocessor communication. The number of processors used range from 64 to 512. The 512 processor runs were

68

Figure 4.7: Communication Time for three Grid Partitioning Techniques

69 executed on a Cray T3D at the Pittsburgh Supercomputing Center (PSC) and the other runs were executed on a Cray T3D at the Jet Propulsion Laboratory (JPL). The results should be accurate for the Cray T3D since both machines use the same Alpha processors. All executables were built on the Cray T3D at JPL including the 512 PE executable using the same compiler and operating system. The 512 PE run at PSC used an executable created at JPL without any changes. No results are plotted for striping with 512 processors since the maximum number of processors that can be used with striping along the y axis for the 288  256 grid is 256. Irregular blocking gives the best results and striping gives the worst results of the three grid partitioning techniques (Figs. 4.7, 4.8, and 4.9). The reason for the poor performance of striping, is the high communication overhead for inter-processor communication. For example, the communication overhead for striping is 121 secs. and 62 secs. for blocking when a 64 processor con guration is used. The volume of communication a ects performance as shown by the di erence in timing results. The bene ts of communicating with two neighbors is not sucient to overcome the lower communication overhead associated with blocking. This is especially true for large grids. The superior performance of irregular blocking can be explained by the smaller communication overhead and sub-grid size. The rst advantage results in lower communication time and the second advantage results in lower computation time. With larger processor con gurations, the di erences between sub grid sizes for blocking and irregular blocking decreases and therefore, the bene ts of irregular blocking are correspondingly lower. The highest execution time di erence between the

70

Figure 4.8: Execution Time for three Grid Partitioning Techniques

71 two techniques is for the 64 processor con guration. Irregular blocking is e ective when a large grid is used with a signi cant land area. In our next experiment (Fig.4.10), we evaluated processor grid dimensions for a particular processor con guration using regular blocking. A processor grid dimension must be selected for a xed number of processors. We ran 7 tests with a 256 processor con guration. The grid dimensions ranged from 128  2 to 2  128. The 128  2 and 8  32 con gurations took 185 and 132 secs. respectively for a 1 month simulation. There is a di erence of 72 secs. between the two con gurations. We measured the execution time excluding I/O in each case. The poor performance of the 128  2 con guration can be explained by the long columns for sub grids. The time to gather/scatter data from long columns during interprocessor communication is responsible for the high execution time. Therefore, in choosing processor grid dimensions for regular blocking, it is better to use a grid which is close to a square.

4.6.2 Recursive vs. Hub Schemes In Figs.4.11 and 4.12, the results from the recursive and hub schemes are compared for a 288  256  8 grid. The recursive scheme scales well and gives better performance than the hub scheme in all processor confgurations. The rate of increase of communication time is higher with the hub scheme than the recursive scheme for blocking. The recursive scheme is e ective for machines with a high latency time. (see Section 4.3.3).

72

Figure 4.9: Speed Up Curve for three Grid Partitioning Techniques

73

Figure 4.10: Timing Results for Regular Blocking using di erent Processor Grid Dimensions

74

Figure 4.11: Comparison of Hub vs. Recursive for Striping

75

Figure 4.12: Comparison of Hub vs. Recursive for Blocking

76

4.6.3 Cache Trace Tool We computed the average cache hit rate for a subroutine when the cache size was varied from 8Kbytes to 1Mbyte and the number of processors was varied from 64 to 256. In Fig.4.13, the hit rate vs. cache size is plotted for three processor con gurations for a subroutine (update) of the model. The cache size was varied by changing the size of the cache table in the code generator. The highest cache hit percentage of 61.6 % was obtained for the 256 processor con guration with a 1 Mbyte cache. The lowest cache hit percentage of 8.22 % was obtained for the 128 processor con guration with an 8 Kbyte cache. From the plot in Fig.4.13, we can conclude that for a Cray T3D processor, the cache hit percentage for the update subroutine is a function of grid size, processor con guration, and cache size. The

knee of the curves in Fig.4.13 are for a cache size of 128 Kbytes. Increasing a cache size larger than 128 Kbytes does not yield signi cantly better performance. A reasonable cache hit rate can be obtained at a cache size of about 128 Kbytes without the additional expense of a large cache memory. The performance of a subroutine cannot be predicted on the basis of the cache hit rate alone. It is based on the combination of the memory reference patterns, cache size, and dependencies. The update subroutine was altered to improve the cache hit percentage by grouping memory references. Shorter loops were also used. The alpha processor uses an instruction cache which can hold 2K machine instructions. Shorter loops speed up the instruction fetch time since fewer instruction fetches result in a cache miss. The instruction cache performance has not been modelled. We observed that using shorter loops improved

77

Figure 4.13: Average Cache Hit Rate for the Update Subroutine

78 the data cache hit rate. Some dependencies were eliminated and an inner loop division operation was pre-computed for an iteration of the loop. With these modi cations, the execution time for the update subroutine fell from 15.7 secs. to 10.5 secs. for a 64 processor run, an improvement of approximately 33%. There is an obvious bene t to having a higher cache hit rate. However, a higher cache hit rate does not always result in lower execution time. There are other factors such as dependencies and loop optimization which also a ect performance.

4.6.4 Communication Protocol Evaluation Three message passing protocols - PVM, MPI, and shmem were evaluated. A grid dimension of 144  78  7 was used with 2,4, and 8 processors. In Figs.4.14 and 4.15, the results for each protocol have been evaluated. The shmem protocol gives the best results followed by PVM and MPI. In each run, 240 timesteps were executed. The reasons for the superior performance of shmem are explained in section 4.5. For eight processors, the use of shmem gives a speed up of about 7.

79

Figure 4.14: Execution times using three Communication Protocols

80

Figure 4.15: Speed Up curves using three Communication Protocols

Chapter 5. SMP Implementation We used the Cray J90 and DEC Alpha SMP machines in our experiments. Both message passing and SMP directives were used on each machine. In a SMP computer, multiple processors share memory and the system bus. A threaded operating system assigns tasks to individual processors. This is achieved by using pre-processors such as KAP [22] and FMP [6] to generate code for parallel sections of a program through the use of directives. One of the limitations of SMP machines is low scalability. We will show in our experiments how speed up is limited by memory bandwidth. The use of a cluster of SMP machines which may improve scalability has not been evaluated.

5.1 Programming SMP machines Automatic parallelization of existing sequential code remains a technological impossibility for a majority of applications. Tools such as KAP have success only when code is simple enough to reveal inherent parallelism. Therefore, developing parallel code is still a manual task. We describe some of the data parallel software tools used to program SMP machines. Most of the tools follow a standard approach of declaring parts of a program as sequential or parallel. The parallel sections are enclosed with directives describing parameters for the compiler. Some of the parameters include shared variables, private variables, the maximum number of processes assigned to a parallel section, and a minimum number of iterations for

81

82 a loop to trigger the generation of parallel code. We show how this is accomplished on three machines - the DEC Alpha, SGI Power Challenge, and the Cray J90. On the DEC Alpha, the KAP parallelizer or a message passing system such as PVM can be used to implement parallel code. We have implemented both methods for the ocean model. KAP provides three levels of parallelization. The rst level is the simplest where a compiler option is set and automatic parallelization is performed. However, we obtained poor results. The second level involves implicit parallelization by adding directives for loops which may have parallelism. The results from this scheme depend on the intelligence of the pre-processor. The third level is through explicit parallelism which involves more analysis. Every loop must be analyzed as shown below. The example is for a Fortran program and similar directives are used for C programs. c*kap* parallel region c*kap*& shared (a,b,c) c*kap*& local(i) c*kap* parallel do do i=1,100 a(i) = b(i) * c(i) enddo c*kap* end parallel do c*kap* end parallel region

From the example above, we can see that a do loop must be carefully analyzed to determine

83 shared and local variables. Simple do loops can be easily parallelized. A complex do loop with subroutine calls and dependencies will require more analysis and code may have to be restructured to obtain parallelism. Our experiences with KAP for the ocean model are described in section 5.2.2. We ran tests on the Cray J90 using a similar approach. There are several ways of implementing parallelism on the Cray J90 [Cray 93]. We studied autotasking (using SMP directives) and the use of shmem for the ocean model. Creating an executable parallel Fortran program using SMP directives is a four step process. The rst step is to run the source code through a pre-processor, FPP, which generates CMIC$ directives. The second step is to run the output of the FPP pre-processor through the FMP translator which interprets the CMIC$ directives. The FMP translator generates parallel code based on the CMIC$ directives. In the third step, the cft77 compiler generates object code from the output of the FMP translator. The nal step is the execution of the segldr utility. The simplest technique to generate parallel code is to set an FPP option and allow FPP to generate CMIC$ directives for the FMP translator. This is the automatic parallelization approach which does not yield the best results. The second approach is to manually scan do loops and insert CMIC$ directives as shown below. CMIC$ DO ALL CMIC$1 Shared (a,b,c) CMIC$2 Private(i) do i=1,100

84 a(i) = b(i) * c(i) enddo

The scope of parallelism is limited to the do loop immediately following the CMIC$ directive. The manual approach will ensure that inherent parallelism is exploited by the compiler. The task of adding CMIC$ directives for a large program is tedious and error prone. The technique used in the SGI Power Challenge is similar to the previous two techniques. The c$doacross directive instructs the compiler to generate parallel code for the following do loop. c\$doacross local(i) share(a,b,c) do i=1,100 a(i) = b(i) * c(i) enddo

The scope of the parallel region is within the do loop. We can see that the methods for each machine have many similarities. Once analysis has been completed for one machine, code can be ported to other machines by changing syntax alone. We selected a handful of routines from the ocean model which together accounted for about 60% of the execution time for parallelization. KAP directives were added before and after do loops. KAP restructures parallel code by moving it to a separate subroutine and generating a separate thread for each parallel region. A call which references the new subroutine and uses the SMP support library is made instead of executing the loop. Local and shared variables are declared within the subroutine. At the end of the subroutine,

85 control is returned to the main program. The number of threads is determined by the number of CPUs speci ed to KAP through the PARALLEL environment variable. The overhead of such a system is the creation and scheduling of threads, synchronization of threads, and inter-thread communication. Thread creation is an expensive operation and requires the use of the SMP support library. Threads can be reused to avoid the overhead of repeatedly creating threads. Thread synchronization is required at barrier points and critical sections. It is implemented through the use of spin or mutex locks. Spin locks are ecient in a single user system and mutex locks are ecient for multiuser systems. If most of the data for the thread is local, the system bus bandwidth will not impact performance. If data is shared, then there will be more trac on the system bus which will impact SMP performance. On the Cray J90, CMIC directives are used instead of KAP directives and the source code is passed to the FMP pre-processor. The FMP pre-processor generates a separate subroutine for every parallellized loop. During run time, the number of iterations per do loop will be divided by the number of CPUs speci ed in the NCPUS environment variable. A separate thread is generated for each set of independent iterations of the loop. There is no guarantee that each thread will run on a separate processor. The thread scheduler may schedule the threads to run sequentially when the machine is heavily loaded. Therefore, executing the same code at di erent times of the day will give di erent execution times. To accurately measure performance results, a dedicated machine must be used.

86

5.2 Analysis of Results We describe results from runs using the Cray J90 and DEC Alpha. We used a dedicated Cray J90 to run our experiments. The machine had sixteen 200 Mhz. processors and 2 Gbytes main memory. All computation was performed using 64 bit data. We evaluated the use of SMP directives and the shmem techniques. While PVM and MPI can be used on the Cray J90, we did not obtain good performance with either protocol. The DEC Alpha we used at the Pittsburgh Supercomputing Center had four 300 Mhz. processors with 4 Gbytes main memory. We used 64 bit data and compiler options for single processor optimizations such as inlining and elimination of dead code.

5.2.1 Cray J90 Large grain decomposition is recommended to obtain better parallel performance on the Cray J90. By assigning a sizable task to each thread, the bene ts of running concurrent threads will o set the overhead of thread creation, scheduling, and synchronization. Therefore, outer loop iterations are usually parallelized. The inner loop is vectorized to run eciently on a single processor. If the outer loop contains more iterations than the inner loop, then the order of loops must be switched. We ran tests for a 144  78  7 grid. The timings results shown in Fig. 5.1 exclude the time for I/O and are for 120 timesteps. With higher processor con gurations, speed up increases, peaks, and then decreases. We did not achieve any bene t using more than 8 processors. Beyond 8 processors, individual threads compete for the same memory resources using a common bus causing a bottleneck. This

87 limits the peak speed up to about 2 (Fig.5.2). A number of shared memory routines are implemented for the Cray J90. We used these routines to implement a message passing version of the ocean model for the Cray J90. The shmem routines provided are very similar to the routines used in the Cray T3D. The use of the shmem protocol clearly gives better results than autotasking (Fig.5.2). When using shmem, the SPMD model is used to obtain parallelism. Multiple copies of the code are created at startup time and communication is achieved by message passing using the shmem protocol. This scheme is superior to autotasking because of the low latency for communication and the absence of the operating system overhead to handle multiple threads. Each copy of the code is a separate process and is scheduled by the operating system as an independent program. Implementing a shmem version of the code made porting the code from a Cray MPP system to Cray SMP system simple. There was no necessity to add directives or scan through loops to achieve parallelism. While the use of the shmem protocol is bene cial, we found the use of other protocols such as PVM and MPI gave poor performance. In both PVM and MPI, inter-process sockets are used for communication giving a low bandwidth between processes. The low latency of the shmem protocol is key to the performance of the message passing version of the model.

5.2.2 DEC Alpha On the DEC Alpha, we ran tests with grid sizes, 80  64  3 and 144  78  7. We used a maximum of 4 CPUs and ran the model for 360 timesteps (Table 5.1). We replaced the CMIC directives with KAP directives.

88

Figure 5.1: Execution Times for Autotasking and Shmem on the Cray J90

89

Figure 5.2: Speed up for Autotasking and Shmem on the Cray J90

90

Table 5.1: Execution time (secs.) using KAP Grid Dimensions 1 processor 2 processors 4 processors 80  64  3

106

77

89

144  78  7

168

146

153

The system bus bandwidth is clearly a problem, since there is no bene t in using 4 processors vs. 2 processors. All results include the time for I/O which is approximately 2% of the total time. Runs with large and small grids gave similar results. Since the use of KAP did not give good results, we used the DEC version of PVM for the Alpha machine. We obtained a speed up of about 1.41 with two processors. We were unable to use 4 processors since some of the PVM functions were not implemented. We can see that while the use of SMP directives to achieve parallelism maybe simpler than the use of message passing, better performance is achieved when message passing is used with a high performance communication protocol. Therefore, if a low latency communication protocol is available on a SMP machine, a message passing scheme for implementing parallelism should be adopted instead of the use of data parallel programming. These results are true for the ocean model application. The results may vary for other applications.

Chapter 6. Distributed Computing A three month coupled model simulation with the current ocean model requires about one CPU hour on a single processor of Cray C90. The computation demand will be signi cantly larger for high resolution ocean and atmospheric models. An associated problem is the size of the coupled model executable. The memory limitations of current MPP machines do not permit the execution of coupled high resolution models. Functional decomposition of the code into separate independent programs is a possible solution. In this experiment, we divided the coupled model into three separate programs - an atmospheric model, an ocean model, and a driver. Each of these programs run independently on a single machine or multiple machines. We used the PVM message passing protocol between programs to control the distributed simulation. The solutions we studied include the use of a satellite to exchange messages between machines and asynchronous transmission of history data. Performance analysis of a distributed system is a non-trivial task compared to an uni-processor or a multi-processor system [24]. We used and enhanced a tool to visualize the communication performance of a distributed application.

91

92

6.1 Distributed Coupled Model In the current version of the coupled model, the atmospheric model is run for a day followed by an ocean model run for a day. In the distributed version, both models run concurrently on distinct processors on the same machine or on separate machines. The parallel execution is controlled by an independent driver program A load balancing problem arises when both models run in parallel. When forcing data is to be exchanged between models, the model completing a simulation sooner will be idle till data is received from the other model. A crude load balancing scheme is to vary the number of processors used to run the ocean model on a MPP machine such that the execution time is approximately equal to the time to run the atmospheric model. Unfortunately, on the Cray T3D, the number of processors that can be used is always a power of 2 and ne grain control of the number of processors used is not possible. We executed the ocean model on a Cray T3D at the Jet Propulsion Laboratory (JPL) and the atmospheric model on a Cray C90 at the Goddard Space Flight Center. A Cray YMP at JPL was used as the intermediate computer between the Cray C90 and T3D. The two climate models were running concurrently on processors physically separated by over 3000 miles. We considered the feasibilty of using the ACTS satellite and a terrestrial network to connect the Cray YMP and C90. Our experiences with both types of networks are described below.

93

6.1.1 Design After modifying the code for ocean and atmospheric models to run independently, we added a receiver function for each model. The purpose of the receiver function was to interpret messages received from the driver and to call the appropriate model functions. We created over 40 di erent message types and parameters for both models (Fig.6.1). Each message type contained an integer indicating the type of function requested from the model and accompanying parameters. This scheme allows a driver program to control the execution of the models by sending and receiving messages. The order of sending and receiving messages can be changed in the driver program. This allows the user of the distributed coupled model explicit control of the simulation by manipulating the driver program alone. An example of an instruction would be a message id and an accompanying message parameter indicating the length of the run in hours. The receiver of such a message would look up a table for the message id and receive the length of the run parameter. After completing the run, the model would send an acknowledgement. The order in which instructions are issued is vital to running a simulation. If the order is not maintained, the models can hang or abort. The driver program must handle communication problems. Although each model maintains a separate logical clock, the clock of the driver program is the master clock and is used to send messages. The logical clocks of the models are used for debugging purposes and to maintain history data. Our design also allows for the addition of other models to be coupled such as an ice model or land model. The addition of a new model entails the creation of a set of messages

94

Message Driven Coupled Model

Set grid & read restart

P

Report time

O S

Send ocean forcings Start run for a day

E I D O N

L O O P

Report SST Send ocean forcings Start run for a day Write history Report SST Write restart & quit

Set grid & read restart

H E R M E S

Report time

A R

Send atmos forcings Start run for a day Report fluxes Send atmos forcings Start run for a day

I L O O

E S

P

Report fluxes Write restart & quit

Figure 6.1: Design of the Message Driven Distributed Coupled Model

and code in the driver to communicate with the new model. A receiver section of code must be added in the new model to handle communication with the driver. The code for the driver must handle interpolation between di erent grids and synchronization of all models. Two additional programs - a send and receive were added to the existing three programs (Fig.6.2). The purpose of these two programs was to transmit ocean model history data generated on the Cray T3D to a mass storage system on the Cray C90. The history data

95 was also converted from eight byte Cray format to a four byte IEEE format. To isolate the communication overhead of the transmission of history data, the send and receive programs ran independently of the model programs. The ocean model running on the T3D wrote history les on the YMP and sent a message to the send program running on the YMP. The message passed the le name of the latest history le written. The send program then read the history data from the le and transmitted it to a receive program on the Cray C90. The receive program stored the history data in a mass storage system and sent an acknowledgement to the send program. The send program continued the asynchronous transmission of data until the ocean model sent a message indicating the completion of the simulation. If history data is not written frequently, then the transmission of history data can be overlapped with computation of the coupled model. The volume of history data is much larger than the volume of forcing data exchanged between the models. However, the transmission of history data does not interfere with simulation as long as there is sucient time for the history data to be sent between a coupling interval. The coupling intervals are usually more frequent than the capture of history data.

6.1.2 Network Infrastructure We have used two types of networks to inter-connect the Cray YMP and C90 - a satellite and a terrestrial network. The terrestrial network is used by researchers across NASA to communicate between centers. It is based on the TCP/IP protocol. The satellite network is based on the Advanced Communication Technology Satellite (ACTS) launched in 1993.

96

Fig.4 Coupled Model Implementation

Poseidon T3D

4.13 Meg 10.02 Meg

Hermes YMP

Send YMP

Disk

0.18 Meg 0.44 Meg

19.46 Meg

Aries C90

Receive C90

Disk

Figure 6.2: Control and Data Flows for Distributed Coupled Model

97 The purpose of the ACTS system is to achieve a high data rate between remote earth stations. Receivers were installed at the Goddard Space Flight Center and the Jet Proppulsion Laboratory. A high data rate (HDR) modem was located at each site (Fig.6.3). It was used to communicate with the satellite. Data bursts in multiples of 32 secs were transported via the satellite between sites at a rate of 696 Mbits/sec. Data bursts arrived at the satellite based on a slot switching and beam hopping plan. Tight synchronization of earth stations to the satellite's on-board switching and beam hopping plan allowed a very short guard time between bursts and satellite tracking. The components of the network for the distributed coupled model were three supercomputers, a bus based gateway or gigarouter, ATM switches, HDR terminals, and the ACTS satellite. A Hippi-ATM interface is required to connect the Cray machines to an ATM based network. Either a Cray Bus Based Gateway (BBG) or a gigarouter can be used for the Hippi-ATM interface. The Cray BBG is connected to the ATM switch via ber optic cable with a bandwidth of about 69 Mbps. A higher bandwidth of 108 Mbps can be obtained with a gigarouter. An ATM switch connects the gigarouter or Cray BBG to the HDR terminal. The transmission of data at the application level between the Cray YMP at JPL and the Cray C90 at Goddard follows the path shown in Fig.6.3. Data is sent from the C90 to the gigarouter via a High Performance Parallel Interface (Hippi) switch and then to the HDR terminal via an ATM switch. At the YMP, data is received at the HDR terminal and routed via an ATM switch, Cray BBG, and a Hippi switch. The T3D

98 and YMP exchange data via a high speed channel (HISP).

6.1.3 Implementation All inter process communication is performed through PVM [14]. On the T3D and YMP, the Cray version of PVM is used [7]. A PVM daemon exists on the YMP and a single processor of the T3D. Within the T3D, communication is achieved through shared memory calls. The ORNL version of PVM is used on the Cray C90. The same version number of PVM is used on all machines to avoid problems due to a mismatch in versions. Due to the large memory requirements of the models, a master slave scheme has not been used. The interface program spawns the ocean model on the T3D. All other processes are started independently. Group computation calls are used to identify processes and handle interprocess communication. A key issue in communication performance using the satellite is the e ect of the longer delay (on order of hundreds of milliseconds) introduced in the network by the satellite link. PVM uses TCP for intra-processor and UDP for inter-processor communication. The slowest PVM communication is via UDP. An application copies the message bu er to a system message bu er of the local PVM daemon. The local PVM daemon communicates with the remote PVM daemon via UDP. Communication via UDP has been made reliable by sending a packet and waiting for an acknowledgement. The round trip period for sending a packet via satellite is 540 msecs. Sending multiple packets using this scheme gives very low bandwidth. One possibility is to increase the packet size and send multiple packets instead of a stop-and-wait approach. This scheme is not encouraged by PVM developers

99

Fig.2 Network Infrastructure

Jet Propulsion Laboratory

Goddard Space Flight Center

Cray T3D

Cray YMP

Cray C90

HISP Channel

Hippi Switch

Hippi Switch

Cray BBG

ACTS

Gigarouter

ATM

ATM

Switch

Switch

HDR

HDR

Figure 6.3: Network Components for the Distributed Coupled Model

100 since it involves modifying the PVM daemon [12]. Another possibility is to use the PvmRouteDirect option. This option bypasses the PVM daemon and transmits directly via TCP to the remote process. PvmRouteDirect can be unstable and cause a process to lock up due to bu er race conditions [12]. A reasonable bandwidth of 3 Mbits/sec. was only obtained when packets were larger than 10K bytes. In the distributed coupled model, there are a number of short and large messages transmitted between the driver and the models. We concluded based on the experiments with PVM on ACTS that satellite communication was not a viable solution to run the distributed coupled model due to the high latency of the satellite and the design of PVM. A number of other ACTS experiments such as telemedicine, seismic marine surveys, and video transmission have been successful. In applications where a steady stream of data is to be transmitted, the high latency is not a factor. After an initial packet of a group of packets has been transmitted, the latency of transmission of the following packets is not noticed. The PVM approach of sending a single packet and waiting for an acknowledgement is not recommended, instead a number of packets must be transmitted at a time.

6.2 Performance Analysis Several tools have been developed to study the performance of distributed applications [2, 40, 32]. A two step process is used for performance analysis (Fig.6.4). In the rst step a monitor collects trace data or event records from the distributed application. The

101 trace data will consist of multiple les which must be combined into a composite le. In the second step, a visualization tool is used to display the trace data. Both steps are equally important since proper monitoring provides valid trace data to drive a visualization and interpretation of trace data requires a suitable visualization program. While online monitoring tools such as Xab [40] and IDES [35] have been used, a post mortem approach has been chosen. The post mortem approach with bu ering does not signi cantly alter the actual performance with additional communication overhead. Detailed performance information can be obtained with tracing alone. PVaniM [40], a package developed at Georgia Institute of Technology was customized to visualize the performance of the distributed coupled model (Fig.6.5). A dual timestamping methodology is the basis for PVaniM. It consists of primary and secondary timestamps based on the Lamport scheme [27]. Logical timestamps are used instead of physical clocks. Every process of the distributed system only needs to know its own clock reading and the timestamps of messages it receives. All processes must participate in the communication of messages. The discrete clock ticks of the logical clock must be frequent enough such that two events a and b which are ordered, occur in the prede ned order. Typically, an event trace contains a timestamp, an event identi er, and event speci c information. This information is sucient for a basic visualization of performance. An advanced visualization should include the following trace information - a Lamport logical timestamp which describes concurrent events in a distributed system and support for application speci c visualization. The physical clocks of the distributed system should be

102

Animation Methodology

Atm. Model Scene Calls Ocn. Model

Visualization

Choreographer

.

Interactions

. Event Records

Figure 6.4: Collection of Trace Data for Visualization

103 synchronized as much as possible. Tracing should not degrade performance so as to mislead the user. To activate tracing, the model code must be modi ed. Selected PVM calls are traced by modifying the subroutine call. A PVaniM call is made instead of a standard PVM call. The PVaniM subroutine appends trace information to a trace le and makes the PVM call. Some of the events traced include the initiation of a process, sending/receiving messages, and the termination of a process. PVaniM must also know the task ids of all processes participating in the tracing procedure. At the termination of a process, PVaniM collects the separate trace les and merges them into a single trace le. A separate program sorts the data in the le based on the timestamp.

6.3 Visualization of Performance The composite trace le is used as input for the visualization software. PVaniM currently provides 3 di erent views of the trace le [40]. Each view provides an animated scheme to view communication and computation. A control panel varies the speed of the animation. The three views available are a causality view, a message passing view, and a Gantt chart view. The causality view is a 2D plot with the x axis labeled with Lamport clock values and the y axis with process ids. When a message is sent, an arrow grows from the coordinate where the message was sent, to the Lamport delivery time on the coordinate where the message is received.

104 The message passing view is another approach to displaying communication patterns. In this view, all processes are laid out around an outside circle. Messages represented as circles move in the direction of the receiver processor and towards the center of the circle. This helps to view the queue for a receiving processor. Contrail lines are provided to show where a message originated. The Gantt view is a well known technique to display the history of a distributed system. The x axis is time and the y axis is a list of processes. Di erent colors are used to represent di erent activities within processes. A red band indicates waiting for a message. a green band indicates computing, and a yellow band indicates sending a message. The Gantt view can be used to summarize performance and identify processes in a `waiting' mode for extended periods. It does not show message trac which is displayed in the two previous views. Although, the three views provided by PVaniM display communication patterns and performance for a distributed system, a clear understanding of the current status of the entire system cannot be obtained. For example, over 35 di erent message types are sent within the coupled model system and specifying a color for each message makes it dicult to determine which message is being sent. The Gantt view displays only three states for a process. It is preferable to have multiple states for each process which can be easily identi ed. Finally, viewing processes by number makes the task of identifying the type of process dicult. A process number may change for a particular process for di erent executions.

105 A customized view was developed for the distributed coupled model (Fig.6.5). This view identi es processes by name instead of a number. Links are drawn only between processors which exchange messages. A brief text description of the message sent and received is included for every process in the view. A status eld for every process describes the current task being performed. The di erent tasks include reading/writing les, sending/receiving messages, initiation,termination, waiting and computing. During the transmission of a message, a line is drawn from the sender to the receiver. A clock was also displayed on the screen. The clock showed the real time in seconds. Idle time between messages and the time to exchange messages could be observed. PVaniM does not have this capability. Watching the progress of a simulation in real clock time gives a better picture of communication performance.

6.4 Distributed Computing Results We used the customized view to study performance of the distributed coupled model simulation over a terrestrial network. Aries, Hermes, and Poseidon spent just 4%, 6%, and 4% respectively of the total time for computation. The remainder of the time was spent in communication overhead. This may seem to be an unusually low value, but other experiments with a distributed coupled model by Mechoso and others [28] also obtained a high communication overhead ( 75%) of the total execution time. There are several reasons for the high communication overhead time - 1) Use of PVM which uses the stop and wait approach for every packet, 2) Use of production machines for the experiment meant that

106

Message Passing for Distributed Coupled Model

POSEIDON Status: Message Sent: Message Received:

Start

Stop

Step

Replay

HERMES

ARIES

Status: Message Sent: Message Received:

Status: Message Sent: Message Received:

SEND

RECEIVE

Status: Message Sent: Message Received:

Status: Message Sent: Message Received:

Refresh

Close

Figure 6.5: Window for Visualizing Performance of Distributed Coupled Model

107 the models on any one of the machines could be swapped out and 3) Contention with other users of the network. Even though we did not obtain good results with the current implementation, the same design can be used for running on a cluster of machines which are not physically separated by a large distance. A cluster of machines will have a high bandwidth between machines and should have a lower communication overhead. We can conclude from our experiments that while the idea of using machine cycles from multiple machines to solve a single problem can be useful, the implementation problems are many. With dedicated machines and networks, it maybe feasible to run the distributed coupled model.

Chapter 7. Conclusions We used a computationally intensive application, a coupled ocean model to demonstrate the use of parallel algorithms and tools. The parallel algorithms described in this dissertation can be used for a variety of problems. In particular, the optimized grid partitioning technique and recursive scheme can be used for problems which use 3D grids.

7.1 Summary of Contributions The optimized grid partitioning technique provides a method to distribute processing for an irregular grid such that processor resources are used eciently. The recursive scheme can be applied where data is distributed by a single processor and collected at a single processor. The recursive scheme uses fewer steps and transmits data at a higher bandwidth. The requirement to distribute and collect data frequently occurs during le I/O. Data from a le must be distributed to many processors where individual processors use di erent parts of the le. The recursive scheme can also be used when data must be broadcast. For use on a machine where the number of processors is not restricted to a power of 2, the recursive scheme must be modi ed to handle non-binary trees. The cache trace tool we used can be easily expanded to handle multi-level caches which are common on many current microprocessors. The tool can be implemented on any uniprocessor or multi-processor machine where the use of cache memory must be evaluated.

108

109 The tool is especially useful for large applications with numerous functions. The cache performance for a single function can be isolated. The tool can also be used to identify which particular variable of a function has a low cache hit rate. The cache trace tool is processor independent and is implemented in software alone. We have compared SMP and MPP implementations for the same problem using the same number of processors. The higher scalability of MPP over SMP was shown. The visualization tool we developed to study message passing performance is an improvement over some of the current tools. The use of a real time clock as well as an identi cation of the type of message during transmission was useful in understanding bottlenecks due to message passing. We also generated statistics for each message to indicate which messages took more time than others. The enhancements in this tool can be used as an example for future distributed application performance tools.

7.2 Future Work One of the problems of running a global ocean model including the north and south poles is the ne resolution and short timestep required near the poles to maintain numerical stability. The problem has been solved by using lters which maintain numerical stability, but introduce errors in the delity of the simulation. An alternate solution is to use a rotated grid. A section of the northern hemisphere is rotated by 90 tilting the polar ocean towards the equator. A future task is to modify the ocean model to handle rotated grids. Parallelizing the ocean model with a rotated grid is possible using the techniques de-

110 scribed in Chapter 4. The diculty lies in implementing a parallel interpolation scheme. When interpolation is performed, the entire ocean and atmospheric grids must be assembled on a single processor. This results in communication overhead to collect/distribute data from other processors and a sequential interpolation which lowers the parallel eciency. Assembling the ocean and atmospheric grids on a single processor increases the size of the executable program which can cause memory problems when running on a MPP machine. A future task is to develop a parallel interpolation scheme which can handle the rotated ocean grid. Vis-5d [19] was used to study the results of the ocean simulation. One of the limitations of the Vis-5D is the ability to only use slices in the horizontal dimension of uniform thickness. Other packages such as AVS and Data Explorer can be used to visualize ocean layers with varying thickness.

List of References [1] Agarwal A., 1991: Limits on Interconnection Network Performance, IEEE Transactions on Parallel & Distributed Systems, Vol.2, No.4, pp 398-411. [2] Beguelin A., Dongarra J., Geist A., and Sunderam V., 1993: Visualization and Debugging in a Heterogeneous Environment, Computer, 26(6), pp 88-95. [3] Bleck R. et al., 1995: A comparison of data-parallel and message passing versions of the Miami Isopycnic Coordinate Ocean Model (MICOM), Parallel Computing, Vol.21, pp 1695-1720. [4] Brown J. et al., 1993: Ocean Circulation, The Open University, Pergamon Press, New York, NY. [5] Bryan K., 1969: A numerical method for the study of the circulation of the world ocean, Journal of Computational Physics, 4, , 347 - 376. [6] Cray Research, Inc., 1993: Cray MPP Fortran Reference Manual, Mendota Heights, MN. [7] Cray Research, Inc., 1993: Cray PVM and Hence Programmer's Manual, Mendota Heights, MN.

111

112 [8] Denham J.M., Long P., and Woodward A.J., 1994: DEC OSF/1 Symmetric Multiprocessing, Digital Technical Journal, Vol.6 No.3, Digital Equipment Corporation, Maynard, MA. [9] DeRose L., Gallivan K., Gallopoulos E., Navara A., 1992: Parallel Ocean Circulation Modelling on Cedar in Proceedings of the Fifth SIAM conference on Parallel Processing for Scienti c Computing, pp 401 - 405. [10] Dongarra J., Brewer O., Kohl J.A., and Fineberg S., 1990: A tool to aid in the design, implementation, and understanding of matrix algorithms for parallel processors, Journal of Parallel and Distributed Computing, 9,pp 185-202. [11] Dowd K.,1993: High Performance Computing, O'Reilly & Associates , Sebastopol, CA. [12] Dowd P.W., et al., 1995: Geographically Distributed Computing ATM over NASA ACTS Satellite, IEEE Milcom 95, New York, NY. [13] Fox G., Johnson M., Lyzenga G., Otto S., Salmon J., and Walker D., 1988: Solving problems on concurrent processors, Vol. 1, Prentice-Hall, Englewood Cli s, NJ. [14] Geist et al., 1994: PVM: Parallel Virtual Machine, A User's Guide and Tutorial for Networked Parallel Computing, MIT Press, Cambridge, MA. [15] Goldberg A.J. and Hennessy J.L., 1993: Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications, IEEE Transactions on Parallel and Distributed Systems, Vol.4, No.1, pp 28 - 40.

113 [16] Gropp W., Lusk E., and Skjellum A, 1994: Using MPI, portable parallel programming with the message passing interface, MIT Press, Cambridge, MA. [17] Gustafson J.L., Montry G.R., and Benner R.E., 1988: Development of parallel methods for a 1024 processor hypercube, SIAM, Journal of Scienti c & Statistical Computing,9,pp 609-638. [18] Henderson T. et al., 1994: Progress Toward Demonstrating Operational Capability of Massively Parallel Processors at the Forecast Systems Laboratory, Sixth ECMWF Workshop on the use of Parallel Processors in Meteorlogy, Reading, England. [19] Hibbard W.L. et al., 1994: Interactive Visualization of Earth and Space Science Computations, IEEE Computer, Vol.27, No.7, pp 65 - 72. [20] Hord R. Michael, 1993: The Paragon XP/S System, Parallel Supercomputing in MIMD architectures, CRC Press, New York, NY. pp 95 - 103. [21] Huang R. and Briggs F.A., 1984: Computer Architectures and Parallel Processing, McGraw-Hill, New York, NY. [22] Kuck & Associates, 1995: KAP for DEC Fortran for DEC OSF/1, User's Guide, Digital Equipment Corporation, Maynard, MA. [23] Konchady M., 1995: Implementation of a parallel coupled climate model, Proceedings of the Second International Conference on High Performance Computing, New Delhi, India.

114 [24] Konchady M. 1996: Distributed Supercomputing using ACTS, Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, Syracuse, NY. [25] Konchady M, and Sood A., 1995: Analysis of grid partitioning techniques for an ocean model, Proceedings of the Seventh High Performance Computing Symposium, Montreal, Canada, pp 407-419. [26] Koster R.D. and Suarez M.J., 1992: Modeling the Land Surface Boundary in Climate Models as a Composite of Independent Vegetation Studies, Journal of Geophysical Research, 97, pp 2697 - 2715 [27] Lamport L., 1978: Time, Clocks, and the Ordering of Events in a Distributed System, Communications of the ACM, 21(7), pp 558-565. [28] Mechoso C.R. et al., 1993, Parallelization and distribution of a coupled atmosphereocean general climate model. Mon. Wea. Rev., 121, pp 2062-2076. [29] Morse S.H. 1994: Practical Parallel Computing, Academic Press, Cambridge, MA. [30] Morton D., Wang K., and Ogbe O.D., 1995: Lessons learned in porting Fortran/PVM code to the Cray T3D, IEEE Parallel & Distributed Technology, Spring 1995, pp 4-11. [31] Numrich R., Springer P.L., and Peterson, J.C., 1994: Measurement of the Communication rates on the Cray T3D Interprocessor Network, Proceedings of the High Performance Computing Networks Symposium, Munich, Germany.

115 [32] Ogle D.M., Schwan K, and Snodgrass R., 1993: The Dynamic Monitoring of Distributed and Parallel Systems, IEEE Transactions on Parallel and Distributed Systems,4(7),pp 762-778. [33] Semnter J.A. and Chervin R.M., 1988: A simulation of the global ocean circulation with resolved eddies, Journal of Geophysical Research, Vol.93,No.c12, pp 15502-15522. [34] Schopf P., 1994: Poseidon Ocean Model User's Guide Version 4, Goddard Space Flight Center, Greenbelt, MD. [35] Schroeder B.A., 1995: On-line Monitoring: A Tutorial, Computer, 28(6), pp 72-78. [36] Smith R.D., Dukowicz J.K., and Malone R.C., 1992: Parallel ocean General Circulation Modeling, Physica-D, Vol.60, pp 38 - 61. [37] Suarez M.J., 1988: Atmospheric Modelling on a SIMD computer, Multiprocessing in Meteorological Models, Springer-Verlag. [38] Sunderam V.S., 1990: PVM: A Framework for Parallel Distributed Computing. Concurrency: Practice & Experience,2(4):315-339. [39] Thompson T., 1996: The world`s fastest computers, Byte, Vol.21, No.11, pp 45-65. [40] Topol B., Stasko J.T., and Sunderam V., 1995: The Dual Timestamping Methodology for Visualizing Distributed Applications, Technical Report GIT-CC-95/21, Georgia Institute of Technology, Atlanta, GA.

116 [41] Trenberth K.E., 1992: Climate System Modelling, Cambridge University Press, Cambridge, UK. [42] Washington W.M. and Parkinson C.L., 1986: An Introduction to Three-Dimensional Climate Modelling, University Science Books, Mill Valley, CA. [43] Wolters L., 1993: Atmosphere and Ocean Circulation Simulation on Massively Parallel Computers, Technical Report, Dept. of Computer Science, Leiden University, Leiden, Netherlands.

117

Curriculum Vitae About 10 years ago, I read an article in the newspaper about climate models and high performance computing. I was attracted to the bene ts of climate modelling for society and the use of high performance computers. So began my studies in parallel computing at George Mason University. After completing courses and other requirements, I spent some time researching the use of climate models on parallel computers. In 1993, parallel climate models were rare and not used for public forecasts. I sought a suitable climate model for experimentation on a parallel computer. Dr.Schopf at NASA Goddard Space Flight Center was willing to provide me the source code for his ocean model. I rewrote the entire model in 'C` to run on an IBM PC which provided me with valuable experience in porting code from a vector supercomputer and an understanding of the operation of the model. Soon after, I joined Goddard Space Flight Center as a Research Associate. The bene ts of working at Goddard Space Flight Center were twofold - 1) I had access to a range of parallel computers and 2) I could work on parallelizing the ocean model along with other tasks. In about 6 months, I successfully ran a parallel implementation using a simple grid partitioning technique. Later, I experimented with di erent algorithms and machines described in the dissertation. Learning the techniques of simulating climate on a computer has been a wonderful experience for me and I hope to follow the development of high resolution models.

Journal Publications: Konchady M. 1997: Parallel Computing on LINUX, LINUX Journal, SSC, Seattle, WA. Konchady M, Sood A., and Schopf P.S., 1996: Implementation and performance of a parallel ocean model, Parallel Computing (accepted for publication).

Conference Publications: Konchady M. 1996: Distributed Supercomputing using ACTS, Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, Syracuse, NY. Konchady M. and Sood A., 1995, Analysis of Grid Partitioning Techniques for an ocean model, Proceedings of the Ninth High Performance Computing Symposium, Montreal, Canada. Konchady M., 1995, Implementation of a Parallel Coupled Climate Model, Proceedings of the Second International Conference on High Performance Computing, New Delhi, India.

118 Konchady M., 1992, Parallel Simulated Annealing for network partitioning, Proceedings of the First GSD IBM Technical Symposium, White Plains, NY.

Miscellaneous: Konchady M. 1996: Parallel Computing Experiences on SMP and MPP machines, A poster session at the Space Data and Computing Division of NASA Goddard Space Flight Center, Greenbelt, MD. Konchady M., 1992, Animation of Julia sets on an IBM PC, IBM Technical Report. Finalist in the 1995 DEC Internet Research Computing contest held at the Pittsburgh Supercomputing Center.

Related Documents