Middleware Overhead In Different Grid Environments

  • December 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Middleware Overhead In Different Grid Environments as PDF for free.

More details

  • Words: 22,262
  • Pages: 98
Comparison of overhead in different Grid environments master thesis in computer science by

Thomas Zangerl submitted to the Faculty of Mathematics, Computer Science and Physics of the University of Innsbruck in partial fulfillment of the requirements for the degree of Master of Science

supervisor: Dr. Maximilian Berger, Institute of Computer Science

Innsbruck, 24 January 2009

Certificate of authorship/originality I certify that the work in this thesis has not previously been submitted for a degree nor has it been submitted as part of requirements for a degree except as fully acknowledged within the text. I also certify that the thesis has been written by me. Any help that I have received in my research work and the preparation of the thesis itself has been acknowledged. In addition, I certify that all information sources and literature used are indicated in the thesis.

Thomas Zangerl, Innsbruck on the 24 January 2009

i

ii

Abstract Grid middlewares simplify the access to Grid resources for the end user by providing functionality such as metascheduling, matchmaking, information services and adequate security facilities. However, the advantages of the middlewares usually come at the cost of temporal overhead added to the execution time of the submitted jobs. In this thesis, we group the overhead into two categories: The first type of overhead occurs before the jobs become executed in form of scheduling latency. What follows is information service overhead, which is introduced by delays in information flow about the job status from the executing worker-node up to the end-user. We analyse both types of overhead with respect to several factors, such as absolute values and variance, for the Grid middlewares SSH, Globus and gLite. We evaluate our experimental data regarding daytime-, weekday-, CE- and queue-influence, and discuss the results and the implications.

iv

Contents 1 Introduction 1.1 The Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Grid middlewares . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Virtual Organisations . . . . . . . . . . . . . . . . 1.2.2 Data staging . . . . . . . . . . . . . . . . . . . . . 1.3 Problem definition . . . . . . . . . . . . . . . . . . . . . . 1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Impact of overhead on short-running jobs . . . . . 1.4.2 Impact of overhead variance on timeout values . . 1.4.3 Impact of overhead variance on workflows . . . . . 1.4.4 Implications of measured overhead for middlewares 1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Structure of the thesis . . . . . . . . . . . . . . . . . . . . 1.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 2 The Middlewares 2.1 SSH . . . . . . . . . . . . . . 2.2 Globus . . . . . . . . . . . . . 2.2.1 Security concepts . . . 2.2.2 Resource Management 2.2.3 Information services . 2.2.4 Data transfer . . . . . 2.3 gLite . . . . . . . . . . . . . . 2.3.1 Security concepts . . . 2.3.2 Resource Management 2.3.3 Information services . 2.3.4 Data transfer . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

3 3 4 5 6 6 7 7 7 8 8 9 9 10

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

13 13 14 14 15 17 18 18 18 19 23 23

3 Implementation of adaptor and testbench 3.1 JavaGAT . . . . . . . . . . . . . . . . . 3.1.1 The motivation behind JavaGAT 3.1.2 JavaGAT’s overall architecture . 3.1.3 Writing adaptors for JavaGAT .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

25 25 25 25 27

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

v

3.2 3.3

3.4

Implementing the gLite adaptor 3.2.1 VOMS-Proxy creation . Job submission . . . . . . . . . 3.3.1 External libraries . . . . 3.3.2 License . . . . . . . . . The testbench . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4 Results 4.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Statistical methods . . . . . . . . . . . . . . . . . 4.1.2 SSH . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Globus . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 gLite . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Scheduling times . . . . . . . . . . . . . . . . . . . . . . 4.2.1 SSH . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Globus . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 gLite . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Notification overhead . . . . . . . . . . . . . . . . . . . . 4.3.1 SSH . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Globus . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 gLite . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Staging overhead . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Globus . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 gLite . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Scheduling times in Globus and gLite . . . . . . 4.5.2 Information service overhead in Globus and gLite

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

27 29 30 30 32 32

. . . . . . . . . . . . . . . . . . .

35 35 35 37 38 38 38 38 39 44 51 52 53 57 63 65 65 65 68 68

5 Conclusions

71

A gLite Documentation

73

B GridLab license

81

List of Figures

83

List of Tables

85

Bibliography

87

vi

Acknowledgements First and foremost, I want to thank my supervisor, Dr. Maximilian Berger, who has always, when required, found the time to give his advice on problems in person. More than once, this meant spending a lot of time on a particular problem. He also gave me support and assistance, whenever I needed it. My thanks are also expressed to Dr. Alexander Beck-Ratzka, who developed the original gLite adaptor and has been very helpful in supplying me with the source code before its release and in so quickly answering the many questions that I have asked him. Roelof Kemp from the JavaGAT project also gave me invaluable support by integrating the adaptor into JavaGAT and by providing me SVN contributor access to the project. Furthermore I want to thank the EGEE project, and particularly the VOCE VO for providing infrastructure and user support. In this context, I also have to outline the importance of all the people who helped me quickly, even on weekends, on the various developer mailing lists that I burdened with my problems. Roberto Pugliese was so kind to mail me the source-code of the glite-securitytrustmanager from the CVS at the University of Trieste, which I couldn’t access from outside due to the university’s security policy. Last not least I want to thank my family, who has supported me on so many occasions during my studies and my girlfriend Nadine, who has always motivated me to carry on when I was frustrated.

1

2

Chapter 1 Introduction 1.1 The Grid Scientific research, especially but not exclusively in the fields of astronomy, physics, biology or chemistry, often means measuring and collecting large sets of data. In order to gain usable results from the collected data, the raw samples have to be evaluated, usually in form of some sort of computational processing. As experiments grow larger and larger, the amount of data needed to be processed and the computational power required for processing it increase. For example, as of 2008 the Large Hadron Collider (LHC) is the largest scientific instrument in the world and produces 15 petabytes of data per year [1]. If such amounts of data are to be processed, either large and expensive mainframes are needed or the computational task has to be distributed among many computers. To assist in achieving that task, Grid computing was introduced by Foster in [2] and given a more exact definition in [3]. Grid computing is about computational power provided by all kind of computational resources in different geographic areas in the world that can be harnessed to solve concrete computational problems on demand and as the resources are available. In order to highlight the conceptual similarity to a power grid, the term Grid was coined for that heterogeneous, distributed, worldwide computing network. Computing resources (so called sites) participating in the Grid usually consist of standard, general purpose hardware and a standard operating system (mostly a flavor of Linux). Network communications between the Grid components are usually routed using standard Internet protocols on existing Internet wires. Because standard components are commonly manufactured in large amounts, they are comparatively inexpensive. However, this advantage comes at the cost of heterogeneity of hardware and software that may introduce complications and management overhead.

3

Middleware functionality

Application

user applications

Collective

scheduling workload management

Resource

resource access bookkeeping job information

Connectivity

communication authentication

Fabric

hardware resources

Figure 1.1: The Grid layer model with the components of each layer

With this set of standard components, the Grid can provide considerable computational power to the end user. The theoretical performance of the LHC computing Grid alone has been estimated to equal 1 petaflop [4]. This is about the same value as the Rmax performance in the Linpack Benchmark of the top site in the supercomputers Top 500 list of June 2008. (see [5]). Hence, the sheer mass of comparatively inexpensive computing components can amount to considerable performance, yet somehow the resources have to be coordinated to form a useful distributed computing environment. This is the task of the Grid middleware.

1.2 Grid middlewares The management of the Grid environment and of Grid tasks, is conducted by a Grid middleware. The middleware provides a layer between user-created Grid applications and the Grid infrastructure (figure 1.2). Thus the middleware typically implements the functionality of two or three layers of the Grid service stack (figure 1.1). This way it abstracts details often considered unnecessary for application development from the user and allows her to exploit Grid resources more conveniently in a coordinated way. The middleware follows a client-server-model: The client software stack is installed on the machine of the Grid user, while compatible server versions have to be present on the Grid sites, where the com-

4

C++−Application

API−Call

C++−Language−Bindings

Java−Application

User

API−Call

Java−Language−Bindings

Shell

Command−Line−Interface

Middleware

SOAP

GridFTP

SOAP

WS−Interface

WS−Interface

Grid security server

Grid computation element

GridFTP−Interface GridFTP

Grid storage server

Figure 1.2: The middleware as a transparent component between APIs and services

putational task of the user are to be executed. The set of Grid-related tasks that a middleware can take over on behalf of the Grid user, varies among different middlewares; however, minimal functionality that a scientific user would expect from a middleware includes access control, copying of needed data sets and the executable itself to the execution sites (input staging), copying of result data back to the user’s computer (output staging), continually reporting the progress of the computational task (job monitoring) and relaying system messages to the user (staging stdout and stderr ).

1.2.1 Virtual Organisations In order to implement access control, Grid middlewares often aggregate users in so called virtual organisations. A virtual organisation (VO) is a group of individuals, that can be mapped from real organisations or consists of people sharing a common goal, geographic region, usage motivation or having other incentives for cooperation. For example, in the EGEE Grid infrastructure there exist virtual organisations like VOCE (VO for Central Europe) or compchem (VO for people working in computational chemistry). With many middlewares, Grid access control is performed on the basis of virtual organisations, i.e. if a resource X may be used by VO Z and person Y

5

is a member of VO Z, person Y may use resource X. Of course, authenticity and authorisation have to be ensured. For this purpose, cryptographically well defined public key infrastructure procedures with X509-certificates [6] are used. VOs maintain tables of their members along with the signatures of their digital certificates. Upon accessing a resource, the user is expected to show her certificate or a temporarily limited proxy created from it for proof of VO-membership .

1.2.2 Data staging Many middlewares perform input and output staging of data using the GridFTP protocol [7]. GridFTP is an extension to standard FTP. Additional commands have been introduced for parallel transmissions, striping and accordingly a transmission mode for out of order reception has been created. To make GridFTP fit into the standard Grid security concepts, it supports authenticated transmissions using GSS-API [8] tokens.

1.3 Problem definition Grid middlewares exist to make life for the user easier; details are abstracted away and the user or programmer is usually supplied with command-line tools and/or programming language bindings that access middleware functionality in a uniform way. As an additional layer between the Grid applications or the user and the actual Grid resources, the middlewares can also introduce significant overhead for various operations. The overall temporal overhead in the execution of a job can result from latencies in submitting jobs and scheduling them on CEs (if a metascheduler is involved) and workqueues, data transfer times, queue waiting times and information service overhead. We define information service overhead as the latencies that can occur in information flow within the Grid. Mostly this concerns the middleware’s job monitoring components. Since events such as the initialisation of output staging depend on information about the job’s status, the information service overhead directly influences the execution time. This type of overhead can normally not be modelled by a constant factor. It varies among middlewares, execution sites, workqueues and with different workloads of the respective resources. Due to the architecture of some middlewares, the overhead can also be subject to random variance. Various types of overhead can in the worst case lead to job executions, which show latencies that are disproportionally high when compared to the expected

6

net job execution time or even stay in a non-completed state for an indefinite time span. In a Grid environment, the goal is to avoid such executions, for example by using a timeout and resubmission strategy. To determine feasible timeout values, knowledge about approximate latency values and latency variance is critical.

1.4 Motivation In this thesis, overhead of different middlewares is measured and compared. Such overhead has different negative effects, based on whether the total overhead or the overhead variance is the major problem.

1.4.1 Impact of overhead on short-running jobs Middleware overhead prolongs execution time. This especially impacts experiments with many short-running jobs, because the middleware-overhead adds to the job’s execution time and, in case of high overheads, may even surpass the execution time itself. This can render the Grid infeasible for certain experiment configurations. Quick solutions from the view of the application developers are to bypass the middleware’s workload or job management systems by creating applications that submit job controllers to the Grid, which spawn several of their own small jobs. Because those small jobs are invisible to the workload management of the Grid middleware, they are not direct subjects to the quality of service mechanisms imposed by it. It can neither be in the interest of a Grid middleware creator, to render the Grid infeasible for certain types of jobs, nor can it be intended that users create their own middlewares within the middleware. Hence the goal should be to keep information service latency values at a minimum level.

1.4.2 Impact of overhead variance on timeout values Not all jobs submitted to the Grid will terminate successfully and return the result of the computation. This is a natural result of the Grid’s structure. Due to the heterogeneity of the Grid, the site on which the job is scheduled may lack a required library or may not be able to provide all the needed resources. Due to the decentralised organisation, sites may be offline but not reported as such, or network failures may occur. The job may get stuck indefinitely in a waiting queue on a computing element, because other jobs with higher priorities receive preferential treatment.

7

Solutions to this problem are often based on timeouts and resubmission of jobs. The timeout values either include an estimation of the job’s execution time and a statistical model of the latency or are based on the estimated times a job should spend in a certain state (for example, in the ”WAITING” state). Timeouts can be crucial in an environment without success guarantees. However, latency variance makes it hard to estimate such timeout values. Most notably, the latency should not depend on a random factor introduced by the middleware by its way of organising information flow or by site schedulers with esoteric scheduling policies.

1.4.3 Impact of overhead variance on workflows If jobs are inter-dependent, i.e. if certain jobs require results of other jobs in order to complete execution, latency variance can heavily affect the performance of the experiments, because in order to finish execution, jobs may have to wait a long time for required intermediate results of other jobs. This property holds especially for complex application workflows. If interdependence is high, even a small fraction of outliers can lead to the need of multiple resubmissions of the workflow. Here, again, this may provide an incentive to the user for submitting complete workflow management processes as single jobs to the Grid, which may bypass quality of service mechanisms of the middleware and negatively influence overall stability even more.

1.4.4 Implications of measured overhead for middlewares There is not much that middlewares can do to avoid overhead caused by hardware or network components, or by general workload in the Grid. Scheduling overhead mostly depends on such workload and on the configuration of the workqueue (prioritisation etc.). However, the reason for information service overhead is often to be found in the design of a middleware. By measuring that information service overhead with respect to • Overall latency • Latency variance in different middlewares and taking a look at the design of the middleware components, conclusions concerning potential improvement can be drawn.

8

1.5 Methodology In order to measure the latency values, a testbench was implemented in the Java programming language. To ensure uniform access to different middlewares from the testbench, the middleware wrapper API JavaGAT [9] was used. JavaGAT had to be extended with an adaptor for the gLite middleware. Part of that adaptor had already been developed by Dr. Alexander Beck Ratzka and Andreas Havenstein (Max-Planck-Institut f¨ ur Gravitationsphysik). However, the original adaptor was based on an outdated version JavaGAT, lacked needed functionality and showed problematic behaviour in terms of memory usage. The latter was not the fault of the original authors but originated from third party libraries with memory leaks. We have significantly improved the adaptor and added functionality, while fixing the memory leaks and porting it to the latest JavaGAT interfaces. We have used the adaptor as a part of the testbench to measure gLite execution times. The measurements themselves were conducted on a uniform timescale on different Grid sites with different middlewares. The results were statistically analysed and interpreted. The implementation of the testbench and the adaptor is described in detail in chapter 3.

1.6 Structure of the thesis Section 1.7 presents previous work in the fields of latency modelling with the most significant results obtained by it. To establish a context for the middleware performance comparison, Chapter 2 gives an overview on the components and internals of the analysed middlewares. Chapter 3 explains the mode of operation of the middleware wrapper JavaGAT, our work on the gLite adaptor and our middleware testbench, which uses JavaGAT as its base API. Having outlined all important parts of the testbench itself, Chapter 4 focuses on the collected results. We interpret overall scheduling latencies and middleware overhead for all middlewares and try to isolate obvious patterns in it and to detect influences by factors such as submission date, used CE and configuration options and compare the obtained values. Following that, we interpret the evaluation according to the peculiarities of the different middlewares. Chapter 5 briefly summarises our results. Appendix A contains a read-me written for the gLite adaptor explaining some of the settings and pitfalls in practical use.

9

1.7 Related work Previous work on latency modelling can be categorised according to the viewpoint. An abundance of work exists on the modelling of server-side queueing, while fewer authors have dealt with the latency as it is perceived by the user. In earlier days, this was not necessary, since due to the simple protocols that were in involved in client-server communication, server-sided queueing times and queueing-times from the perspective of the client were identical up to a small more or less constant factor. However, previous work covering differences in server-side queueing times and user perceived latency, exists from other research areas, e.g. w.r.t. network transmissions in the web [10]. Pure queueing times have been analysed by Mu’alem and Feitelson [11] on a typical cluster configuration with a backfilling FIFO scheduler. Backfilling denotes an advance reservation system based on future prediction for the jobs in the queue. The conclusion of the authors is that backfilling achieves lower execution times than simple FIFO strategies. Interesting corollaries of their results are, that users are generally bad at estimating job execution times (which some cluster schedulers require as input) and that users seem to adapt their workload over time, such that it best fits the deployed scheduling policy. In [12] Chun and Culler attach a utility function to jobs in order to evaluate queueing times in a user based approach. It is argued, that in a commercial supercomputing environment, transparent pricing and scheduling based on first price auctions is most effective concerning the overall job waiting times. It is assessed, that pre-emption does not speed up the waiting process, if the differently priced queues have adequate sizes. The workload behaviour on a LCG cluster (LCG is the predecessor of gLite) is measured and analysed in [13]. The LCG cluster is managed by several queues with different policies for long, medium and short-running jobs respectively. The concept behind these multiple queues is, that higher waiting times are acceptable for long-running jobs, when seen in relation to the execution time. It is outlined, that despite the fact that VOs submit to queues with different lengths based on the estimated CPU time of their jobs, some VOs are treated unfairly concerning waiting times. Furthermore, the authors argue that jobs split into several subtasks will experience a comparatively longer waiting time than one large job performing all the tasks alone. In [14], Oikonomakos et al. also measure job waiting times on a Grid site running the LCG middleware. The observed waiting times have standard deviations by factors larger than the mean waiting time values, which attests to the heavy tailed distribution of job waiting times in the EGEE environment.

10

However, the systematic complexity of multi-layered hierarchical middlewares like gLite causes a divergence of user-perceived and cluster-generated overhead, since many middleware components may introduce overhead themselves. Consequently, some works focus on the overhead from the user perspective. Lingrand et al. present experiments conducted within EGEE’s biomed VO over a large period of time in [15]. A timeout value of about 500 seconds is shown to significantly improve the expectation of the job execution time. Furthermore, the authors show that there is no relevant impact on the job execution time by the day of the week. The same authors show in [16] that the optimal timeout value based on their latency model varies significantly among different gLite sites. Furthermore, different queues on gLite sites are evaluated with respect to their latency behaviour. It is argued, that the best result with regard to execution time is obtained by classifying the queues in two classes with different optimal timeouts. As concerning temporal influence, the observation is made, that overall there are slightly higher latencies on weekends. However, the presented data on dayof-the-week influences shows partial inconsistencies and would need further exploration, which is also acknowledged by the authors in the conclusion. Furthermore, it is questionable whether ANOVA is the best choice for analysis of the class partitioning, because the CDFs indicate that the samples are not normally distributed. ANOVA needs normal distribution of the underlying data as a prerequisite. Another probabilistic model for computing optimal timeout values for resubmission strategies is introduced in [17]. The authors apply the model to several well-known distributions in order to derive optimal timeout values for systems modelled by them. For the EGEE grid, the probability distribution corresponds to a mixture of log-normal and Pareto distributions. It is argued, that by deriving optimal timeout values from the model, the expectation of job duration moves within the dimension of outlier-free production systems. As for collecting necessary input data for the model, the authors suggest that it could be taken from the workload management system logs. All three works focus on total execution times as perceived by the end-user as the subject of their analysis. Our approach, on the other hand allows us to analyse the different sources of overhead, namely scheduling and information service delays, separately.

11

12

Chapter 2 The Middlewares 2.1 SSH In order to utilise Grid resources for job execution, the following minimal components are required. • Authorisation and access control • Computing resource access • File transfer These tasks can already be achieved with the standard SSH protocol [18]. Authorisation is part of the protocol itself and access control can be ensured using Unix ACLs and schedulers on the execution site. The computing resource access is a central part of the protocol’s functionality and file transfer can be implemented with SCP, which is based on SSH. Such a solution is very simple, because there is no matchmaking and no metascheduling. The authorisation solution mandates that each user owns an account on the machine on which the Grid application becomes executed, which is not very scalable. There is no inherent support for parallelism with this solution and scheduling on the executing machine has to be performed by the operating system scheduler, which is often hardly apt to do this job in an uncontrolled multi-user environment with long-running, non-interactive applications. Furthermore due to the lack of matchmaking it obliges the user to have knowledge about all computing elements, their respective addresses and hardware and software properties and to do the matchmaking himself. Hence, SSH is not a Grid middleware, because even though it can be used for remote job execution, it lacks vital functionality of middlewares. But the time overhead produced by this kind of solution is very close to zero, basically only comprising network latencies. Therefore it is a good base reference when overheads of complex middlewares are evaluated.

13

2.2 Globus Globus Toolkit has been one of the first Grid middlewares and perhaps it still is the most popular one. Many technologies that have emerged from the Globus project are now de facto standards in Grid computing, such as the GridFTP protocol and the Grid Security Infrastructure (GSI). As a basis for our experiments, we have used the older, non-webservices-based version 2 of Globus Toolkit. All essential components of Globus version 2 are still contained in the current major version 4 [19]; especially GT4 contains modules for webservices based job management and pre-webservices job execution management services (WS GRAM and Pre-WS GRAM). Generally, the Globus Toolkit is a collection of APIs provided by client libraries for different programming languages, server libraries which provide the Globus services and command-line tools for end users. The middleware consists of several components, which implement different infrastructural tasks.

2.2.1 Security concepts In order to guarantee access control and message protection, the Grid Security Infrastructure (GSI) was created as a part of the Globus middleware. GSI enforces that only members of an authorised VO may access Grid resources. Additionally, it provides cryptographic means for message integrity and authenticity. The access control mechanisms require the identity of the user to be established. For this purpose, X509-certificates that are issued by certificate authorities are used, as described in section 1.2.1. Such an X509 certificate consists of the following components: • Version (in Globus, this is version 3) • Serial number • Signature algorithm ID (e.g. sha1WithRSAEncryption) • Issuer (the Certificate Authority) • Validity • Subject (the Distinguished Name of the user in X500-format) • Subject Public Key (used for message encryption) • Optional extensions

14

• Signature algorithm ID • Signature value, encoded in ASN.1 DER The signature at the end is from the CA and testifies the validity of the information in the certificate and especially the ownership of the public key by the certificate’s identity. For GSI, this certificate is required to be saved in the .pem Format, which means that the above information is stored in a Base64encoding. For access control, the user would be required to prove her identity upon using Grid resources every time. Since her identity is established by challenging her knowledge of the private key that belongs to the public key bound to her identity in the certificate, this would mean having her enter the password for decryption of the private key on each resource access. Since this is hardly feasible in a productive environment, GSI supports the delegation of a temporally limited proxy which can impersonate the user [20]. The delegation process consists of creating a new public/private key pair, the public key of which will be used for a new X509 certificate that is signed by the holder of the Grid certificate herself. The generated private key is stored alongside the certificate in a proxy file (see figure 2.2.1). For security reasons, file permissions on the proxy are set restrictively and its validity is temporally limited. The proxy’s subject corresponds to the subject name in the Grid certificate and other properties are derived from the original certificates as well. A new critical X509 extension ensures that the proxy creator can define different policies concerning the proxy rights (i.e. whether it may delegate new proxies by itself). The generated private key is intended for one-time use, which makes proxy revocation an easy task: It suffices to delete the proxy file. Based on these proxy files, authentication in Globus is done based on the GSS-API and SSL version 3, based on its implementation OpenSSL [21].

2.2.2 Resource Management In Globus, resource management consists of the following components [22]: • Resource Specification Language (RSL) • Resource brokers (RBs) • Site co-allocators (e.g. DUROC) • Grid Resource Allocation Manager (GRAM)

15

11 00 00 11 00 11 create private/ public key pair

Proxy file

sign public key

Delegation start

Private key

User signed public key

111 000 000 111 000 111 000 111 User private key

Figure 2.1: The GSI proxy delegation process • Information services (MDS) • Local job queueing systems (SGE, PBS, LFS, fork, etc.) With the Resource Specification Language (RSL), the Grid user specifies the executable, files for input and output staging, estimated job duration wall clock time, environment variables, arguments etc. This is done in a standard textual way, using name/value pairs. The RSL is then sent to a Resource Broker, whose task is to specialise the RSL, i.e. to derive possible execution sites directly from the RSL’s attributes. Such a specialisation can be done by querying the Globus Information service (see section 2.2.3) or subsequent resource brokers for suitable sites according to certain properties derived from the RSL (e.g. available RAM size, number of CPUs, installed software etc.). The resulting processed RSL is called ground RSL. If the job requires parallel execution on different sites in terms of MPI or some other environment, the RSL will be passed to a site co-allocator. We don’t have such a use-case in our testbench scenario. Hence, co-allocation will not be covered here. An in-depth description can be found in [23]. In our case, the specification refined by the resource broker is passed to a GRAM. GRAM exists in a newer, webservices based version (WS GRAM, GRAM4) and in an older version still supported by current releases of Globus

16

Pending

Start

Active

Done

Failed

Figure 2.2: GRAM job states 2. specialize RSL

5. job handle

Resource Broker

Application 1. RSL callback URL

3. ground RSL 4. job handle

GRAM 6. status notification callbacks

PBS

Figure 2.3: Example GRAM job registration process toolkit (Pre-WS GRAM, GRAM2). Since the AustrianGrid sites still exclusively provide GRAM2 service endpoints, our experiments rely on the older version of GRAM, which will be described here. When the resource description reaches GRAM level, it is already on an execution site, i.e. on a cluster, blade, parallel machine or some other compute element. Usually on these compute elements, some local load balancer/scheduler will run, such as Sun Grid Engine (SGE), Portable Batch System (PBS) or LoadLeveler (LL). GRAM provides a uniform interface to all those schedulers while using them for queueing the Grid jobs. At the same time, GRAM supervises these executions, notifies the user of status changes using a callback URL specified at job submission and generates a job handle that can be used for cancelling the job or actively polling its status (figure 2.3). The job’s legal status transitions can be modelled by an acyclic directed graph (see figure 2.2).

2.2.3 Information services Information services in Globus are implemented using central LDAP directories [24] and a standardised information model, called GLUE [25]. Along with the protocols for automated site information retrieval (GRIP) and notification about information availability (GRRP), those components form the Globus in-

17

formation system MDS [26]. A frequent configuration is known as Berkley Database Information Index (BDII [27]), which consists of two or more LDAP databases in combination with a well-defined update-process. Aggregate LDAP directories may be hierarchically grouped and can notify each other about information updates within their controlled realms using GRRP. The resource broker or the user will likely search a top-level LDAP service which includes information of all smaller domain-specific LDAP services. This method of information aggregation ensures scalability. On the other hand, the use of aggregation with notification protocols can mean that information in the aggregate directory services can be outdated due to propagation delay. However, often updates are relatively infrequent and the notification intervals can be fine-tuned to form a trade-off between system load due to useless messages and update notification delay.

2.2.4 Data transfer Globus 2 uses GridFTP (section 1.2.2) as the only protocol for input and output staging and inter-grid-site data transfer.

2.3 gLite gLite is used as the middleware of the EGEE project1 , which is an EU-funded multi-disciplinary Grid environment and one of the largest Grid infrastructures in the world. gLite is a de facto successor of the LHC computing grid (LCG) and uses components developed for the LCG. Since the LCG is a package of Globus, Condor and CERN-made components, many parts of the gLite middleware are based on protocols and technology developed as part of Globus. gLite is distributed as a set of command-line tools that run exclusively on Scientific Linux, various APIs for programming languages for the proprietary protocols of gLite and WSDL files for the gLite webservices, from which bindings for different programming languages may generation. Many components of gLite are webservice-based, so most bindings can be generated from the respective WSDL files.

2.3.1 Security concepts gLite uses GSI and proxy-delegation based on the Globus concepts (section 2.2.1). However, in a Grid environment with the dimensions of EGEE, the 1

http://www.eu-egee.org/

18

existing Globus practise to manually add authorised users to map-files on every potential execution site for a certain VO has been found not scalable. Hence, authorisation itself is performed by a Virtual Organisation Membership Service (VOMS) [28]. The VOMS server has knowledge about the users of a certain VO and maintains that information in a relational database. If the user wants to construct a VOMS proxy, it first handshakes with the VOMS server using a standard GSI-authenticated message and then sends a signed request. The user’s request contains roles, groups and capabilities that later-on be used by grid sites for fine grained access control. If the request can be parsed correctly by the VOMS server and the user has sufficient rights for the demanded group membership, role and capabilities, the server sends a structured VOMS authorisation response. The VOMS response contains the following data: • VOMS user authorisation info (groups, roles, capabilities) • User credentials • VOMS server credentials • Temporal validity of the proxy • VOMS server signature The client can now save the VOMS response as an X509v3 Attribute Certificate in the grid proxy certificate and serialise the resulting certificate to a VOMS proxy file. For access control, the user transmits the full VOMS proxy to the resource broker, which only needs to check the validity of the VOMS information instead of consulting a user map-file. However, unwanted users can still be banned from resources by explicit blacklisting. VOMS proxies can not be revoked, but have a limited lifetime which is included in the attribute certificate to avoid reuse of existing VOMS tickets by malicious users.

2.3.2 Resource Management The gLite resource management consists of the following components [29] [30]: • Job Description Language (JDL) • Workload Management System (WMS) • Logging and Bookkeeping Service (LB) • Information supermarket (ISM)

19

• Computing Element (CE) • Local Resource Management Systems (LRMS, e.g. PBS, Maui, SGE,...) Job Description Language The structure of the Job Description Language is quite similar to the RSL format used in Globus (section 2.2.2). It also consists of name/value pairs, but offers functionality beyond the scope of RSL. The language is based on Condor’s ClassAd language [31] and besides standard fields for executables, input and output sandbox files, stdout and stderr, arguments and environment variables explicit GLUE attributes can be supplied as requirements for matchmaking. Since all GLUE attributes, that the WMS can derive from the information supermarket may be specified in the requirements section, the user can become quite explicit about hardware and software requirements; for example the user may desire more than 1 GB of RAM, RedHat-Linux as the CEs operating system or an i686-processor on the executing machine. It is even possible to specify explicit workqueues for scheduling, or exclude certain workqueues. The Workload Management System Like the Resource Broker in Globus, in gLite the WMS takes the user’s JDL and tries to find an appropriate resource fitting to the requirements expressed in the JDL for job scheduling. Unlike in Globus, there is no active layer between the WMS and the local resource manager, i.e. there is no specific gLite analogy to Globus’ GRAM. However, the CE must provide a webservice interface for submitting and cancelling jobs, retrieving the job status and sending signals to jobs. This is called a computing element acceptance service (CEA). CEs can operate in push and pull models, i.e. receive jobs and execute them, if they are idle or ask for them, when they are idle. In order to make informed matchmaking decisions, the WMS queries the information supermarket. If no resources are available, it keeps the submission request in a task queue. Clients can communicate with the WMS using a webservice-interface and SOAP messages. The whole job registration process is depicted in figure 2.4. The compute element The CE is the subject of the WMS meta-scheduling process. As described above, it offers a webservice interface for job registration and cancellation called CEA. The CEA interacts with the job controller (JC) that controls the locally deployed cluster-manager/load-balancer (LRMS). These local resource managers

20

5. return jobID

logging

WMS

WS−Int.

7. start job

LB server

GridFTP

6. stage in sandbox

WS−Int. Interface

4. register Job

WS−Int.

9. query job state

8. start: jobId

Monitor CEA JC

1. VOMS−proxy−auth

LRMS

2. send JDL string

(e.g. Maui)

3. locate LB server generated JobId

schedule

WN

WN

WN

Figure 2.4: Job registration in gLite

queue arriving jobs and schedule them based on autonomous decisions on various worker nodes (WN), which are responsible for performing the actual computation. Typically deployed LRMSes are PBS, SGE, LCG-PBS, LSF, LL or Maui, which is a queueing plug-in for various LRMS products. The job monitor checks the status of the job (running, completed) and reports it to the logging and bookkeeping system.

Logging & Bookkeeping In gLite, the job’s status is tracked using a specialised component, the logging and bookkeeping (LB) server. The status is updated by events that are generated either by CEs themselves or by an aggregating WMS, using some logging API. The logging API passes the information to a physically close locallogger daemon, which stores it on a local disk file and reports success to the logging API. The interlogger daemon forwards the information from the locallogger daemon to the responsible bookkeeping server. The URL of the bookkeeping server is included in the server part of the job-id, which has already been chosen by the WMS. Hence, LB server and job remain interdependent and an LB server collects information about the job’s status during its complete lifetime. The incoming events are mapped to higher level gLite states (figure 2.6) by the LB-server, where the user may query for them, or actively receive them if she registered for update notifications (figure 2.5).

21

locallogger

interlogger

LB Server

WMS components Local Log File

Figure 2.5: The information flow in L & B reporting

Submitted

Waiting

Running

Ready

Scheduled

Done (Canceled)

Done (OK)

Cleared

Done (Failed)

Aborted

Figure 2.6: gLite job states (bold arrows on successful execution path)

22

LB WSDL API

logging API

LB producer library

2.3.3 Information services The information supermarket (ISM) is a central repository for information about Grid resources. It can be queried by the WMS during the process of matchmaking for a JDL submission request. The architecture of the ISM itself usually is implemented in quite similar way as in Globus (section 2.2.3), based on the BDII. As in Globus, GLUE is used as the information model for the data. More advanced query and update technologies for the information supermarket, such as R-GMA [32] or XML/SOAP exist, but are rarely used.

2.3.4 Data transfer gLite supports multiple ways of transferring data. Like in Globus, input data can be send to the WMS, which sends it along to the CEs where the job becomes executed. Output data will be sent from the CE to the WMS, which passes it back to the user. The whole data transfer is conducted using GridFTP, input and output files for data staging are specified in the job’s JDL string. The WMS will store input and output in local directories, called input/output sandbox. However, the input and output sandbox for a job is usually limited to a size between 10 and 100 MByte. For larger files, a Grid site dedicated to file storage, a so-called storage element (SE) should be used. gLite employs a number of different technologies to allow the user consistent SE access. Upload of a large file to a SE requires the following steps: 1 Space on the SE for the file needs to be allocated. For this purpose the Storage Resource Manager interface (SRM [33]) is used. SRM is an uniform webservice-based interface to different kinds of SEs and their different storage systems. SRM is not implemented by gLite, but by the storage systems, which are supposed to provide a SRM interface. If a space reservation request invoked by a user is granted by SRM, it will return a storage URL (SURL) which denotes the full path where the file is going to be stored and a transport URL (TURL), which is an endpoint for direct GridFTP transfers of the file to the SE. 2 Register the file in the LCG File Catalog (LFC) [34]. For that purpose, the LFC replica server for the current VO has to be contacted using a proprietary protocol. Once the user sends in the address of one replica (the SURL received in step 1), the LFC-server will respond with an unique identifier and a logical file name (LFN). This is a one-to-many mapping, i.e. more replicas with different storage URLs can be added to the same unique identifier. The file’s GUID is immutable, while the LFN can be changed by the user.

23

3 Use the transport URL and GridFTP to copy the file to the SE. 4 Get the file during job execution. Currently, this is only possible using the respective command line interfaces of the lcg-tools. The WMS can be instructed to schedule jobs to CEs ”close” to the SE where the files are stored by adding the LFN or the GUID to the DataRequirement field in the JDL. Hence, management of large files is still in gLite is still quite a complex matter and disappointingly there exist only C reference implementations for the proprietary LFC protocol. Java APIs such as GFAL [35] are just C wrappers and require platform dependent libraries to be installed.

24

Chapter 3 Implementation of adaptor and testbench 3.1 JavaGAT JavaGAT [9] is a high-level Grid-access interface for abstracting the technical details of middleware-specific code, which is maintained independently from the interface.

3.1.1 The motivation behind JavaGAT Learning all the details about specific Grid middlewares in order to use their APIs in a productive way can be a tedious task, especially for application programmers who don’t know and don’t care much about the Grid paradigms and the relations among the various components. Additionally, Grid APIs are constantly evolving and thus render Grid-based applications short-lived. To remove burden from application programmers who prefer to focus on their own field of expertise, the middleware-wrapper JavaGAT was created. The aims of the JavaGAT project are: • Hiding the diversity and frequent API changes of existing middlewares behind a uniform programming API • Abstracting the low level nature of current middleware APIs away and helping the developer concentrate on the core application rather than Grid internals • Allowing applications to stay portable and making the exchange of employed middlewares easy.

3.1.2 JavaGAT’s overall architecture JavaGAT uses an adaptor concept to achieve the goals detailed in section 3.1.1. Adaptors can be written for different Grid middlewares and provide different

25

Grid application

File−API

RB−API

Security−API

Monitoring API

JavaGAT−Engine

...

RB−CPI

Security−CPI

...

gridftp

Monitoring CPI

globus gLite

sftp

...

globus gLite

GAT−Adaptors

File−CPI

Figure 3.1: JavaGAT’s structure

capabilities (such as job monitoring, resource broker contact, I/O operations etc.). JavaGAT defines the interfaces for the middleware-specific adaptors and matches the generic API calls of the end user to middleware-specific code. When the user invokes a high-level operation such as copying a file from a source to a (Grid) destination, JavaGAT, if not restricted by user-supplied preferences, will try to invoke all available middleware implementations until the first succeeds in performing the requested operation. This process is called intelligent dispatching. The contract that middleware adaptors must fulfil is described in JavaGAT’s capabilities provider interface (CPI). The term ”interface” is misleading since the CPIs are actually abstract classes that implement all functionality that can be solved generically, themselves. The structure is shown in figure 3.1. Middleware adaptors that extend the CPI-classes can leave methods described in the contract unimplemented, in which case an UnsupportedOperationException will be thrown upon invoking the respective method in the respective adaptor. JavaGAT hides failures in method invocation on a certain adaptor from the user, unless all applicable adaptors have failed. JavaGAT aims to seamlessly integrate with Java’s programming paradigms.

26

For example, classes for file staging extend Java’s java.io classes. To control the behaviour of JavaGAT itself and the instantiated adaptors, a set of meta-classes is available, such as GAT , GATContext and Preferences. GAT is used as a factory fa¸cade for the construction of actual objects behind the various interfaces. GATContext carries security parameters and additional GAT- or adaptor-specific preferences specified in the Preferences class. For instance, if the user wants to restrict the set of file adaptors that are going to be invoked to SFTP, she may add the preference ("File.adaptor.name", "sftp") to the GATContext and subsequently limit connections to passive mode by specifying the preference ("ftp.connection.passive", "true"). Thus, knowledgeable users can control the behaviour of JavaGAT, while the interface remains simple for inexperienced users.

3.1.3 Writing adaptors for JavaGAT Adaptors for JavaGAT have to fulfil the following requirements: • Extend the respective CPI classes. (e.g. an adaptor for file transfer should extend FileCpi, whereas an adaptor for resource brokerage should extend the ResourceBrokerCpi class) . • Implement the minimal set of methods that are necessary for the adaptor to provide useful functionality. • Deploy the adaptor as a JAR file in the directory specified by the gat.adaptor.path environment variable. • Include all the actual CPI implementations provided by the adaptor in the JAR’s manifest file with the respective CPI as attribute name and the implementing classes as the attribute value. The invocation of the adaptor implementation by the GATEngine is conducted with the help of Java classloading and reflection mechanisms.

3.2 Implementing the gLite adaptor Dr. Alexander Beck-Ratzka1 has kindly supplied us with the gLite adaptor that had been developed at his institute. The adaptor included basic functionality such as job submission, input/output data staging, status polling and getting 1

Max Planck Institut f¨ ur Gravitationsforschung (Albert Einstein Institut) http://www.aei.

mpg.de

27

job information. However, it lacked some functionality considered important by us. First and foremost, the original adaptor didn’t include methods to create VOMS proxies. If the adaptor can not create VOMS proxies itself, a manual step is required before invoking the adaptor: One needs to create the required proxies with the assistance of gLite command-line tools. Since those tools de facto exclusively support Scientific Linux, which is not a common Linux flavor on workplace computers, this mostly implies creating the proxy on a system running Scientific Linux (typically a Grid UI machine) and copying it back to the system executing the adaptor. We found this hardly acceptable with regard to our use cases, so the primary goal was to implement VOMS proxy support. Further improvements to the original adaptor were made, as additional requirements were identified in productive use. Those improvements include: • Copying of output files to user-specified directories • Use of LDAP directory URLs as resource broker names and identification of available resource brokers from LDAP by the adaptor • Custom poll intervals for the job status polling thread • Support of additional JDL attributes (DataRequirements.InputData, RetryCount, GLUE matchmaking attributes) • Support for passive status updates via metric listeners • Resolving problems with multithreaded use • Working around memory leaks in external libraries • Making job information retrieval JavaGAT-compliant • Porting the original adaptor to JavaGAT 2.0 • Refactoring the original adaptor’s code Attempts have been made to implement logical file handling (section 2.3.4), but this idea has been dropped due to unresolvable problems with the proprietary LFC protocol. Furthermore, with respect to JavaGAT’s structure, such functionality would have had to be implemented as a subclass of LogicalFileCpi and not as a part of the gLite-Adaptor. Because CPIs are abstract classes, not interfaces, and Java does not support multiple inheritance, an adaptor can only be derived from one CPI.

28

Figure 3.2: Class diagram of the VOMS proxy management classes

The following sections are going to outline some aspects of the adaptor in detail.

3.2.1 VOMS-Proxy creation We created the package org.gridlab.gat.security.glite, in which we implemented our gLite security classes (figure 3.2). The process of VOMS proxy creation is roughly explained in section 2.3.1. Our classes model these mechanisms. GlobusProxyManager implements the standard Globus proxy factory methods, mostly by using the respective commodity methods from the JavaCoG-Kit [36]. The VomsProxyManager extends the GlobusProxyManager and can create a VOMS-proxy by executing the following steps:

29

1 A standard Globus proxy is created. 2 This proxy is used for obtaining a GSI-secured socket to the VOMS-server. 3 The VomsServerCommunicator sends the request string containing group membership, role and capabilities request encapsulated in a XML message. 3 If the server was able to parse the request, the VomsServerCommunicator receives a XML response which contains the granted attribute certificate (AC), or an error message. 4 The AC is Base64-encoded and must be decoded in order to be readable for use with the certificate tools. 5 Finally, the AC is DER-encoded and stored as an extension in a new Globus proxy, which is now a full VOMS-proxy.

3.3 Job submission The architecture of the gLite resource brokerage adaptor is visualised in figure 3.3. JavaGAT’s adaptor invocation mechanism will call the submitJob() method on the GliteResourceBrokerAdaptor, when the user submits a job. If the brokerURI passed to the GliteResourceBrokerAdaptor upon construction is a LDAP address, the constructor uses the LDAPResourceFinder to locate WMSes corresponding to the virtual organisation specified in the GATContext. It will pick one such WMS randomly and set the brokerURI accordingly. In submitJob() it constructs a new GliteJob and attaches the given metric listener to it. The GliteJob class constructs a VOMS-Proxy using the classes from 3.2.1, authenticates with the WMS via webservice calls, constructs a JDL string with the methods in the JDL class, copies the sandbox files with GridFTP and uses the respective webservice methods to get a job-ID and start the job. Once it receives the job-ID, it tracks the state at the LB server. Thus it closely models the steps depicted in figure 2.4.

3.3.1 External libraries Due to the webservice nature and the security constraints of gLite, the adaptor is dependant on 30 external libraries. Among those there are many webservice support libraries, such as Axis or JAXRPC, the webservice stubs for the various webservices themselves (WMS, LB, Security delegation, JDL processing),

30

Figure 3.3: Class diagram of the gLite adaptor

31

Globus libraries, security libraries (to enable SSL encryption in GSI-protected sessions), and libraries upon which the aforementioned ones are dependant. When loading the adaptors, JavaGAT identifies the adaptor JAR by its manifest file. In a second step, it saves the path to the classes from the libraries, on which the adaptor is dependant (which have to be in the same directory), into an URLClassLoader. This classloader is set as the thread’s context classloader upon invoking the adaptor. Normally, conflicts between differing versions of the same libraries required by different adaptors should thus be avoided. Nevertheless, we found that different versions of Apache Axis can cause problems with that mechanism. For example, classes from the Axis library version found in the Globus adaptor are still in the classpath, when the gLite adaptor is instantiated. The reasons for this behaviour are rather unclear - web discussions about Axis indicate that it may stem from Axis using thread context classloaders itself. A workaround is to include the required version of Axis in a classpath with higher priority than the thread context classloader’s (e.g. Java’s system classpath) when executing the gLite adaptor.

3.3.2 License JavaGAT is published under the conditions of the GridLab Open Source license. This license is BSD-like and basically allows redistribution in source or binary form, modification, use and installation without restrictions. Because the gLite adaptor was submitted to the central JavaGAT repository and published as a part of the JavaGAT release, it is non-exclusively subject to the license provisions thereof. The full license text can be found in Appendix B.

3.4 The testbench The testbench we have used for measuring our results has a structure roughly similar to JavaGAT. A core module includes all the functionality common for all middleware test runners, while middleware-specific modules provide the implementation. An illustration can be found on the simplified class diagram in figure 3.4. This decision has been made out of pragmatic considerations - it works well with JavaGAT’s architecture, allows for easy creation of new adaptor tests, since only little code is adaptor-dependant and makes deploying tests on remote sites easier, because only the dependency libraries of the used modules need to be copied and maintained. The testbench logs interesting information to a XML-file. Typically one XMLfile is associated with one concrete testcase class that extends Testcase. For each started job, the GATTestRunner logs meta-information such as jobID, used

32

middleware, virtual organisation, resource broker URI and the executables with their arguments and input files. Following that, all status updates received by the middleware are logged along with the time at which they were received. Furthermore, upon construction, the GATTestRunner starts a RunListener, which creates a listening socket at the first unbound port in the Globus TCP port range (ports 40000 - 40500). Those ports were chosen because they are usually not blocked by firewalls in Grid environments. Before submitting the job specified in the testcase instance, the GATTestRunner creates a wrapper shell-script. This shell script calls the executables specified in the testcase with the respective arguments and adds a wget call to the IP-address of the computer on which the testbench runs and RunListener’s port respectively before and after execution of the main part. The GATTestRunner replaces the specified executable by /bin/sh and adds the wrapper-script as the only input argument. The actual execution parameters, as they are specified in the testcase-instance, are logged as part of the job’s metainformation. When the shell script becomes executed, it first sends a callback via wget to the RunListener. Thus, the RunListener can record the exact time of execution start and notify the GATTestRunner about it. The same applies to execution end. The GATTestRunner eventually logs the dates received from the RunListener along with the dates reported back by the middleware to the XML file. Because of the fact that under certain circumstances, jobs may exhibit running times that are virtually infinite (see section 1.3), a maximum execution time has to be defined in the concrete testcase instance. After that time interval, the TestInstanceGuardian will forcibly terminate the job. The GATTestRunner logs such terminations to the XML file.

33

Figure 3.4: A simplified class diagram of the testbench

34

Chapter 4 Results 4.1 Experiment setup We have grouped our results according to middlewares, scheduling times and callback overhead and we have tried to quantify the influence of the weekdays and time of the day. Furthermore, we have aggregated data based on the queue to which the job was submitted, the queue management system and the configuration of this system. We have tried to statistically model the gathered results. In all middleware tests, the executable consisted of a small python script which computes prime numbers up to 100000 using the Sieve of Eratosthenes algorithm. Furthermore, small files (smaller than 10 Kilobytes) had to be preand poststaged. This problem was chosen, because it renders the tests less synthetic than i.e. by just executing /bin/hostname, while not imposing too much load on the computational resources. We believe that under such a simulation the behaviour of information services can be observed in a better way than with purely synthetic tasks. Python was chosen, because a python installation is a standard component of virtually every Linux distribution, for which reason it is universally available on the Grid. If the execution did not finish within 45 minutes, the job was forcibly terminated by the testbench.

4.1.1 Statistical methods Important figures of collected data are the median and the mean, because they give an indication of the average absolute waiting times and overhead values. The variance of the data on the other hand, gives an intuition about the deviation of the values from the median. A fast way of visualising both absolute overhead and its variance is the boxplot. It shows the median of a sample, along with the upper and lower quantiles, i.e. the area between the lowest and the highest 25% of the data. The length

35

of this area is called interquartile range (IQR). The bars located 1.5*IQR above the upper and below the lower quantile are called whiskers. Every data point smaller than the lower whisker and larger than the upper whisker is marked as a dot on the graph and considered an outlier. Histograms show data values on the x-axis and the frequency at which they occur on the y-axis. Sometimes, this allows for a fast understanding of the approximate distribution of data values. When the large spread of data values makes it appropriate, we use logarithmic scales on the x- or the y-axis. Whenever an axis is logarithmic, this is mentioned on the respective graph. To determine whether the partition of measurements into categories (executed on a certain weekday, scheduled to a certain queue/type of queue, etc.) is reasonable, we use different tests. The t-test can be applied to two independent vectors of data points and tests the hypothesis that the two samples are from distributions with equal means. The Kruskal-Wallis test is a one way analysis of variance on the collected data and returns a p-value. The p-value is the probability of obtaining, under the null hypothesis (in the Kruskal-Wallis test the null hypothesis is that the samples from the different groups are equally distributed), a result as extreme or more than the one that is tested. If the p-value is smaller than the significance level the null hypothesis is rejected. The significance level is the probability of falsely rejecting the null hypothesis (i.e. a false positive probability) and is often set to 5%. If the pvalue exceeds the significance level, the only conclusion that can be derived, is that the hypothesis can not be rejected at the significance level. Both t-test and Kruskal-Wallis test share the advantage, that unlike ANOVA they do not assume a normal distribution of the underlying measurement values. It must be kept in mind, that such tests can merely give a hint and can not replace thorough interpretation of the data itself. Especially, there may be statistical significant differences among certain groupings, even when the practical implications, i.e. the subjective latency difference in relation to the total execution time or in relation to other delay factors, are comparatively small. Additionally, for large samples, as we have available in most cases, any one way analysis of variance is likely to produce smaller p-values than for smaller samples. Hence, the tests will only be used in addition to interpretation based on other statistic benchmarks. The Kolmogorov-Smirnov test (abbreviated K-S test) is a goodness-of-fit test, which means that it can be applied to check whether a gathered sample matches a certain distribution, or whether two samples are from the same distribution. The null-hypothesis in the K-S test is, that the two tested samples are from the same distribution, and like in the Kruskal-Wallis test, it can either be rejected by the test, or not be rejected at the given significance level.

36

An empirical way to evaluate the distribution of the data, is plotting its empirical cumulative distribution function (CDF) against the CDFs of wellknown distributions. This method is not as exact as a K-S test but it can point out tendencies and give hints concerning the distribution of the sample. The CDF is defined as follows F (x) =

Z

x

f (x)

−∞

where f(x) is the probability density function, i.e. at point x, the CDF represents the summed probability of the value x and all values smaller than x in the sample. The above mentioned techniques become interesting if they can be used in a practical context, e.g. if timeout values for a timeouting and resubmission strategy can be derived from them. In [17], the authors propose the following model for computing the optimal timeout value 1 Ej (t∞ ) = FR (t∞ )

Z 0



ufR (u)du +

t∞ − t∞ (1 − ρ)FR (t∞

(4.1)

where t∞ denotes the timeout value, FR the CDF and fR the probability density function of the samples. ρ is the probability of outliers. We are going to use that model to estimate expected execution time for timeout values in our analysis. However, it must be critically remarked, that even though this model takes the outlier ratio into account, it only considers the past distribution in each evaluation point and not the length of the remaining tail. Hence it may be too optimistic on long-tailed distributions. For the sake of simplicity, from now on, we are going to refer to the above equation simply as the timeout model.

4.1.2 SSH A central remote computer (zid-gpl.uibk.ac.at) served as the execution environment for the SSH test. From this sever, desktop computers in computer rooms of the university are reachable via SSH. These computers constitute the CEs in our test. Every 30 minutes, the testbench picks three user desktops from three distinct computer rooms (RR15, RR 18 and RR20) independently and submits the test job to each of them. There is no a-priori way to make assumptions about whether the computer will be switched on and run Linux, which is a necessary condition, and how big the existing user-created load at the ”CE” will be at the time of job acceptance.

37

4.1.3 Globus The Globus experiment consists of jobs that were submitted between 2008-10-10 and 2008-10-14 and between 2008-11-07 and 2008-11-15 in a 30 minutes interval to the following Globus sites in the AustrianGrid1 (we will use the abbrevations in brackets from now on): • http://blade.labs.fhv.at/jobmanager-sge (blade.labs) • http://schafberg.coma.sbg.ac.at (schafberg) • http://altix1.jku.austriangrid.at/jobmanager-pbs (altix1.jku) • http://hydra.gup.uni-linz.ac.at/jobmanager-pbs (hydra.gup) Due to problems with firewalls, we used the active job polling mechanism in Globus as opposed to GRAM callbacks (like they are in shown in figure 2.3). Active job polling contacts the resource broker, which polls GRAM.

4.1.4 gLite Every 30 minutes, 3 jobs were submitted to the same resource broker in the virtual organisation VOCE (VO for Central Europe)2 . The used resource broker was skurut67-6.cesnet.cz (now replaced by wms1.egee.cesnet.cz). Since the resource broker acts as a meta-scheduler, the decision about the queue, in which the job will become executed, is made by the WMS (see section 2.2.2). The measurements took place between 2008-09-22 and 2008-10-05 and 2008-1009 and 2008-10-14.

4.2 Scheduling times In the following sections we discuss scheduling times for the SSH, Globus and gLite Grid environments. The terms scheduling time, scheduling latency or scheduling delay all refer to the time interval, which passes between the first middleware notification that the job has been registered and the actual execution of the payload on a computational resource in the Grid.

4.2.1 SSH SSH has a median scheduling latency of 0 seconds, the mean of all samples is 39 milliseconds, while the is 7847 millisec2 , which corresponds to a standard 1 2

http://www.austriangrid.at http://egee.cesnet.cz/en/voce/

38

deviation of 0.08 seconds. This means, that nearly all SSH jobs were scheduled at the same time or before JavaGAT could report the submission status. The few-hundred milliseconds latency that some job submissions exhibit, is composed of network and authentication latency, but overall it can be said that delays will be hardly noticeable by the end user.

4.2.2 Globus After evaluating the measured scheduling times in general, we have tried to factor in influences on the scheduling latency by the date at which the job was submitted and the CE on which the job became executed. The results are going to be presented in the following sections. Overall scheduling latency Figure 4.1 shows Globus scheduling latencies that were measured during the experiments. It can clearly be seen that, with the exception of a few outliers, by far most jobs become executed after a scheduling delay below 10 seconds. The mean latency value of all measurements is 3.9 seconds. The median value of the scheduling times is 0. This implies that in the used Austrian Grid configuration with Globus middleware one can expect that the jobs get scheduled almost instantly. Note, that in this case 0 means that the job was executed before or at the same time of the first middleware notification, i.e. the wrapper script callback arrived before the middleware could notify the testbench of the job’s submission status. The histogram in figure 4.2 shows the distribution of scheduling latencies without extreme outliers. It can be seen that a scheduling latency of 0 is by far the most probable. Further scheduling latencies between 1 and 10 seconds are approximately equiprobable. Hence, little surprisingly, the cumulative distribution function (CDF) of the latency values starts at a high value in 0 and is light-tailed. Figure 4.3 contains this CDF along with CDFs of other well known distributions evaluated at the same mean and standard deviation. Truncated Gaussian refers to a normal distribution centred at the mean of the measured data, but without negative values, which can not occur in a scheduling process. The latencies could probably approximately be modelled by a mixture of the log-normal and the exponential distribution. It can be seen that our claim of the light tailed behaviour is confirmed by the decay-rate, which is faster than that of the exponential model. About 79% of all job submissions become executed at their compute element without any noticeable latency and over 90% may consume the required re-

39

5

x 10

Latencies Median Mean

Scheduling latencies (ms)

10

8

6

4

2

0 Submission date

(a) Globus scheduling latencies

12000

Latency values Median Mean

Scheduling latency (millisec)

10000

8000

6000

4000

2000

0

Submission date

(b) Globus scheduling latencies (zoomed)

Figure 4.1: Globus scheduling latencies

40

1200

1000

800

600

400

200

0

0

5

10

15

20

25

Figure 4.2: Globus scheduling histogram (extreme outliers truncated)

Empirical CDF 1

0.9

0.8

0.7

F(x)

0.6

0.5

0.4

0.3

0.2 Globus scheduling Exponential Log−normal Trunc. Gaussian

0.1

0

0

2

4

6

8 10 12 Scheduling latency (sec)

14

16

18

20

Figure 4.3: Globus scheduling latency CDF

41

12000

Scheduling times (millisec)

10000

8000

6000

4000

2000

0 Monday

Tuesday

Wednesday

Thursday Day of the week

Friday

Saturday

Sunday

Figure 4.4: Globus scheduling times per day of the week

sources within 7 seconds. All latencies higher than 12 seconds can be considered as outliers. Since the distribution of the scheduling latency tail roughly approximates an exponential distribution, setting a timeout value in a timeout and resubmission strategy necessarily implies the loss of jobs at a certain probability. However, by taking the last non-outlier value of 12 seconds, the probability of terminating a job that will not be stuck indefinitely, is already lower than 2%. Since in the AustrianGrid, higher latencies form not only outliers but extreme outliers (greater than three times the inter-quantile range), it would be a good strategy to resubmit jobs after a 12 seconds timeout. Scheduling latency with respect to execution date Grouping the results according to the day of the week at which the job was scheduled exposes no significant differences. (figure 4.4). Applying the Kruskal Wallis test yields the result 4.7 ∗ 10−5 , which testifies a different distribution on the daily temporal scale. However, the millisecond scale is of little practical relevance, when considering the latency as it is perceived by the user. Rounding the values to seconds and reapplying Kruskal-Wallis still yields 0.0141, which is below the significance level. However, since large samples are more likely to produce smaller values and the median is the same for all days, we conclude that there are weekday differences in the scheduling behaviour, but that they are negligible from a practical perspective.

42

12000

Scheduling latencies (millisec)

10000

8000

6000

4000

2000

0 00−08

08−16 Daytime period

16−24

Figure 4.5: Globus scheduling times per daytime periods

There is no visible correlation between the time-period of the day and the scheduling latency. Scheduling latency is not lower at the typical work periods as compared to typical spare time periods (figure 4.5). The Kruskal-Wallis test result does not contradict that statement. Scheduling latency with respect to CE Finally, we have evaluated scheduling latency properties of the different queues. Figure 4.6 shows box-plots of those latencies respectively with and without extreme outliers. The sites seem to exhibit specific patterns. Only altix1.jku and hydra.gup show larger latency outliers, while all jobs were scheduled without noticeable delay on schafberg. blade.labs didn’t produce any outliers, but the scheduling times seem to be more uniformly distributed in the interval between 0 and 16 seconds, with a median of about 4 seconds. Intuitively, different queue load balancing and scheduling managers seem to cause different latencies. schafberg uses the operating system to spawn a new process for each arriving job (jobmanager-fork), which explains the very low scheduling latencies. altix1.jku and hydra.gup both use the PBS (Portable Batch System) load-balancer, while blade.labs schedules arriving jobs with SGE (Sun Grid Engine). A survey of these and other cluster schedulers was conducted by Etsion and Tsafrir [37]. According to that survey, the default behaviour of a PBS queue is a shortest job first scheduling policy, while SGE schedules with a simple first-come first-served policy. This explains the low me-

43

dian scheduling times at the PBS queues, since our job was fairly short compared to other Grid workloads. Hence, the scheduling overhead in Globus seems to depend more on the remotely deployed load balancer than on the middleware. However, deriving detailed results with respect to different load-balancing products would require more in-depth investigation.

4.2.3 gLite In addition to the evaluations we have conducted on the Globus measurements, we had the opportunity to evaluate gLite’s behaviour on the same CE with different configuration options. Our findings are described in the following sections, along with possible influences by daytime, weekday and used CE.

Overall scheduling latency With the gLite middleware and the used configuration, scheduling latency is considerably higher than with the Globus middleware. The mean scheduling latency is 135 seconds and the median latency in gLite is 91 seconds. Figure 4.7 shows the distribution of the scheduling latency values. Over 90% of the jobs are scheduled within 200 seconds, as can be seen on the empirical cumulative distribution function on figure 4.8. When comparing the gLite scheduling CDF to the exponential function, it becomes obvious that the higher latency values of gLite form a heavy tail, since the distribution decays slower for higher values than the respective exponential distribution. The third quantile is 120, the IQR is 77 and the upper whisker 235.5. Therefore jobs taking more than 351 seconds (3rd quantile +3 ∗ IQR) before getting scheduled must be considered as extreme outliers and become subject to resubmission. The resulting losses of up to 5% of otherwise completing jobs must be considered a good tradeoff, given the heavy tail and the corresponding low probability that the respective jobs will still be scheduled in what the user may perceive as a short time after that waiting period. Feeding our data into the timeout model yields 209 seconds as the optimal timeout value and confirms the observation of the authors that higher timeout values penalize the execution time behaviour less than aggressive ones (figure 4.9).

44

1200

Scheduling latency (sec)

1000

800

600

400

200

0 hydra.gup.

blade.labs

altix1.jku

schafberg.

Grid sites

(a) Latency box plot 20000 18000 16000

Scheduling latency (millisec)

14000 12000 10000 8000 6000 4000 2000 0 blade.labs

altix1.jku

schafberg

hydra.gup

Grid sites

(b) Latency box plot (zoomed)

Figure 4.6: Scheduling latency box plots per queue

45

2500

Latencies Median Mean

Scheduling latencies (sec)

2000

1500

1000

500

0 Submission date

Figure 4.7: gLite scheduling latencies

Empirical CDF 1

0.9

0.8

0.7

F(x)

0.6

0.5

0.4

0.3

0.2 Latency cdf Exponential Log−normal Trunc. Gaussian

0.1

0

0

500

1000 1500 Latency values (sec)

2000

Figure 4.8: gLite scheduling latency CDF

46

2500

500

Expectation of execution time (sec)

400

300

200

100

0 0

50

100

150

200

250

300

350

400

Timeout value (sec)

Figure 4.9: Expectation of scheduling time (y-axis) against scheduling timeout value (x-axis)

Scheduling latencies with respect to execution date In the context of our measurement results, the variations in scheduling latency in consideration of different weekdays and different daytimes seem marginal. Figures 4.10 and 4.11 show the respective boxplots. The Kruskal Wallis test for the daytime period groups, returns a p-value of 0.29, hence the hypothesis that the sample distributions are equal, is not rejected by the test. The p-value obtained from the weekday grouping is 0.0048, which is below the 5% significance level. This rejects the null hypothesis that the distributions are equal. However, we consider the absolute differences based on submission date to be minimal, when compared to other influences, such as type of CE. Hence, we conclude that the scheduling times in EGEE are rather insensitive to the submission date. Scheduling latencies with respect to CE In figure 4.12 we use the abbreviations listed in table 4.1 for the different queues to which our job was submitted by the WMS. The queues show significantly different scheduling latencies. A clear pattern can be seen with respect to the deployed queue load-balancer. 8 queues are managed by PBS and 7 queues by the LCG-PBS load-balancer. While queues configured with PBS scheduled jobs with a small median latency, but a high latency variance, the opposite holds for queues configured with the LCG-PBS

47

3

Scheduling latency (sec, logarithmic scale)

10

2

10

Monday

Tuesday

Wednesday

Thursday Day of the week

Friday

Saturday

Sunday

Figure 4.10: gLite scheduling times per day of the week

3

Scheduling latency (sec, logarithmic scale)

10

2

10

00−08

08−16 Time interval of day

16−24

Figure 4.11: gLite scheduling times per daytime period

48

SRCE NIIF LINZ ELTE POZNAN BME IRB CYF AMU WROC SAVBA KFKI CESNET TUKE IJS OEAW

ce1-egee.srce.hr:2119/jobmanager-sge-prod egee-ce.grid.niif.hu:2119/jobmanager-pbs-voce egee-ce1.gup.uni-linz.ac.at:2119/jobmanager-pbs-voce eszakigrid66.inf.elte.hu:2119/jobmanager-lcgpbs-voce ce.reef.man.poznan.pl:2119/jobmanager-pbs-voce ce.hpc.iit.bme.hu:2119/jobmanager-lcgpbs-long egee.irb.hr:2119/jobmanager-lcgpbs-grid ce.cyf-kr.edu.pl:2119/jobmanager-pbs-voce pearl.amu.edu.pl:2119/jobmanager-lcgpbs-voce dwarf.wcss.wroc.pl:2119/jobmanager-lcgpbs-voce ce.ui.savba.sk:2119/jobmanager-pbs-voce grid109.kfki.hu:2119/jobmanager-lcgpbs-voce ce2.egee.cesnet.cz:2119/jobmanager-pbs-egee voce ce.grid.tuke.sk:2119/jobmanager-pbs-voce lcgce.ijs.si:2119/jobmanager-pbs-voce hephygr.oeaw.ac.at:2119/jobmanager-lcgpbs-voce

Table 4.1: Abbreviations used in gLite per queue boxplot

SRCE NIIF LINZ ELTE POZNAN BME IRB CYF AMU WROC SAVBA KFKI CESNET TUKE IJS OEAW unknown 2

10 Seconds (logarithmic scale)

3

10

Figure 4.12: gLite scheduling latency per queue

49

system, where jobs were scheduled with a high median latency, but a small latency variance. Among the queues on which our jobs were scheduled, there is one queue managed by the SGE load balancer (SRCE), which shows a low median scheduling latency and high latency variance, but of course no conclusions can be drawn from the measurement results of just one queue. From the perspective of timeout and resubmission strategies, low latency variance is more desirable than lower median latency values, because the system behaviour becomes more predictable. Hence, if not many time-critical small jobs or complex workflows are submitted to EGEE, the LCG-PBS queues perform better. LCG-PBS is a PBS wrapper that increases PBS scalability by allowing the monitoring of all jobs from a user through the same resource broker. The reason for this fine-tuning of PBS is the need for a higher scalability inherent in a meta-scheduling architecture such as gLite [38]. Using the adapted version of PBS seems to be advantageous in terms of scheduling latency. Scheduling latencies with respect to CE configuration Patryk Laso´ n, the administrator of the CYF (cyfronet) CE, has kindly provided us with the opportunity to run our test with different CE load balancer configuration options. The CYF CE is configured with the Maui cluster scheduler [39]. Maui is available as a plugable queue to different load balancers such as PBS or LSF and uses them for resource management, while providing Maui-specific functionality such as reservation and policies to the outside. Maui has to poll PBS for node information. Maui’s scheduling system is based on priority mechanism, which runs and makes reservations for highest priority jobs and tries to fit lower priority jobs into gaps in the reservation system. The interval in which Maui polls PBS is determined by the RMPOLLINTERVAL configuration option. We have measured the performance of the Maui-managed cyfronet queue with a RMPOLLINTERVAL setting of 60 seconds and later with a setting of 5 seconds. The resulting boxplots can be seen in figure 4.13. The median of the results obtained with the 5 second RMPOLLINTERVAL is 127 seconds, while the median of the 60 second RMPOLLINTERVAL measurement is 135 seconds. This marginal improvement comes at the cost of a much higher latency variance. The variance in the first sample is 23905. The second sample, however, shows a variance of 196430. These variances correspond to standard deviations of 155 respectively 443 seconds. But due to the larger variance, also the mean value of the queue with the small poll interval is considerably larger: The mean scheduling latency on the

50

Scheduling time (sec, logarithmic scale)

3

10

2

10

RMPOLLINTERVAL=60s

RMPOLLINTERVAL=05s

Figure 4.13: gLite scheduling latency with different Maui RMPOLLINTERVAL settings

CE with the more frequently updated scheduler is 287 seconds, whereas it is only 157 seconds for the CE with less frequent updates. The reason for the higher mean value becomes obvious when one views the CDF in figure 4.14. With the shorter poll interval comes a higher share of jobs with very long scheduling times. Based on the collected data, 90% of all jobs become scheduled within less than 250 seconds with the 60 seconds RMPOLLINTERVAL setting. With the 5 seconds setting, however, one would have to wait about 600 seconds, until the job is scheduled with 90% probability. About one third of the submitted jobs has to wait considerably longer for a computational resource with the short setting than with the long setting. One possible explanation for that is, that with the smaller poll interval, Maui can make reservations for jobs with high priority more often and run them at a higher frequency. Thus, jobs with lower priority waiting in the CE queue, may starve because there are less time gaps to be filled by the Maui backfill mechanism.

4.3 Notification overhead Notification overhead denotes all kinds of overhead introduced by the reporting capabilities of a Grid middleware. Sometimes, we use the terms middleware overhead, notification delay and reporting overhead, which can all be considered as semantically equivalent in the following sections.

51

Empirical CDF 1

0.9

0.8

0.7

F(x)

0.6

0.5

0.4

0.3

0.2

0.1 RMPOLLINTERVAL=60s RMPOLLINTERVAL=05s 0

0

500

1000

1500 x

2000

2500

3000

Figure 4.14: CDF of gLite scheduling latency with different Maui poll intervals

We have measured two different types of notification overhead. First, notification overhead occurs between the actual execution of a job and the reporting of it to the end user by the middleware. For the sake of simplicity, we are going to call this type of overhead ”pre-execution” overhead. Albeit this is not strictly an exact term, because the overhead actually occurs while the job is already executing, it nevertheless gives a good intuitive notion about which type of overhead it denotes. Consequently, the second type of overhead, which occurs between the completion of a job and the middleware reporting about it, is going to be called ”post-execution” overhead.

4.3.1 SSH We have evaluated a sample of 132 jobs that were executed on the different desktop machines they were submitted to. The overhead between RUNNING-notification and actual execution could not be measured, because the SSH JavaGAT adaptor did not consistently report the RUNNING state. The post-execution overhead is 787 milliseconds in the median and 885 milliseconds in the mean. The variance is 2140800 milliseconds2 and thus the standard deviation is 1463 milliseconds. All values already include potential overhead added by the JavaGAT API. Hence, middleware notification overhead in SSH, when compared to typical payload execution times, is below noticeability. However, it must be considered, that SSH by itself is not a middleware in the common sense of the term.

52

4.3.2 Globus Out of 2098 submissions, 306 states needed to be forcibly terminated after the timeout period, which corresponds to a ratio of jobs without excessive waiting times of about 85%. Overall IS overhead Normally, it could be expected that pre-execution and post-execution overheads converge to the same value with a large number of samples. We have collected a sample of 1368 jobs for pre-execution overhead and 1417 jobs for post-execution overhead. The different numbers stem from the fact that sometimes notifications, either middleware-generated or wrapper-script callbacks, get dropped due to firewall issues or network errors. The collected data contradicts the above assumption that both information service overheads are approximately equal. The delay between the actual job start and the RUNNING notification is about 6 seconds in the mean and 5 seconds in the median, while after job completion the mean and median time until notification is about 11 seconds. No significant outliers can be found in the observed notification overhead times, as displayed in figure 4.15. Thus, the Globus information service reports job start faster than job completion. As displayed in figure 4.16, not only mean and median of notification overhead times differ between pre- and post execution information service messages, but also the distribution of frequencies. With the exception of the spike at 6 seconds, post-execution notification overhead follows a Gaussian distribution and has a variance on the of 23 sec2 . The pre-execution notification delays, on the other hand, are more bursty, but show a lower variance of 9.2. Both samples can be fitted fairly well by a truncated Gaussian distribution, as confirmed by the CDFs in figure 4.17. The Kolmogorov-Smirnov test rejects that assumption for the pre-execution overhead (which is bursty in parts), but does not reject it for the post-execution overhead at the 5% significance level. Since both overheads are more or less normally distributed and don’t contain outliers, in a timeout and resubmission strategy, the timeout value is best set to infinity (i.e. no timeouting of jobs should take place). IS overhead with respect to the CE For post-execution overhead, it must be noted, that the state is set to POST STAGING as soon as execution end is detected - GridFTP is initiated after reporting the state change by JavaGAT and not by Globus. However, the

53

14000

IS overhead (millisec)

12000

10000

8000

6000

4000

2000

Overhead Median Mean Submission date

(a) Time between actual execution and RUNNING notification 25000

IS overhead (millisec)

20000

15000

10000

5000

Overhead Median Mean 0 Submission date

(b) Time between completion and POST STAGING notification

Figure 4.15: Middleware notification overhead in Globus

54

400

350

300

250

200

150

100

50

0

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(a) Histogram of latencies between actual execution and RUNNING notification 200

180

160

140

120

100

80

60

40

20

0 0

5

10

(b) Histogram of latencies POST STAGING notification

15

between

20

completion

25

and

Figure 4.16: Histograms of middleware notification overhead in Globus

55

Empirical CDF 1

0.9

0.8

0.7

F(x)

0.6

0.5

0.4

0.3

0.2 Latency cdf Exponential Log−normal Trunc. Gaussian

0.1

0

0

5

10

15

Latency values (sec)

(a) CDF of latencies between actual execution and RUNNING notification Empirical CDF 1

0.9

0.8

0.7

F(x)

0.6

0.5

0.4

0.3

0.2 Latency cdf Exponential Log−normal Trunc. Gaussian

0.1

0

0

5

10 15 Latency values (sec)

(b) CDF of latencies between POST STAGING notification

20

completion

25

and

Figure 4.17: CDFs of middleware notification overhead in Globus

56

fact that the histogram shows a spike points at the influence that CE reporting polling intervals might have. Consequently, the differences could result from GRAM’s job monitoring mechanisms (and the polling intervals) or from properties of the underlying loadbalancing systems that are polled by GRAM. However, unlike in section 4.2.2, no clear patterns with respect to the load-balancing system can be observed, because the two PBS sites behave differently (figure 4.18). The information service overhead is highest for the SGE system, but this can also be due to the local GRAM installation of the site. In the measured sample, information service overhead can be as high as 15 seconds even on the system with a pure fork load-balancer. Interestingly, GRAM can do most running notification within 3 respectively 4 seconds on two of the four systems, which intriguingly run different schedulers. Hence, without a larger sample regarding the load-balancers/schedulers, no exact conclusions can be drawn.

4.3.3 gLite Out of 2655 jobs, 689 had to be terminated because they didn’t finish within 2700 seconds. Accordingly, based on the collected data, the probability that a gLite job submission will not end up waiting in a state indefinitely is approximately 74%. Overall information service overhead in gLite The observation from section 4.3.2, that notification overhead is different for preand post-job execution notifications, also holds for gLite. Figure 4.19 shows the overheads for pre- and post-execution notifications. As in Globus, the post-execution overhead is higher than the pre-execution overhead. However, now the overhead difference lies in the dimension of factor ten. The median value for pre-execution overhead is 9 seconds and the mean value 28 seconds. For post-execution overhead, these values are 198 respectively 209 seconds. But also the variance is larger for post-execution notifications. Analysing the gathered data yields a variance of 2078 sec2 for pre-execution overhead and 17352 sec2 for post-execution overhead. The histogram of the pre-execution overhead is rather unsurprising, with a global maximum around the median, but smaller local maxima up to the highest measured delay. The histogram of the post-execution overhead shows two spikes around 200 and 300 seconds with a local minimum in between. Both histograms are depicted in figure 4.20

57

14

12

IS overhead (sec)

10

8

6

4

2

blade.labs

altix1.jku

schafberg

hydra.gup

Grid sites

(a) Time between actual execution and RUNNING notification 25

IS overhead (sec)

20

15

10

5

0 blade.labs

altix1.jku

schafberg

hydra.gup

Grid sites

(b) Time between completion and POST STAGING notification

Figure 4.18: Middleware notification overhead by Globus site

58

1000

IS overhead (sec, logarithmic scale)

Overhead Median Mean

100

10

1 Submission date

(a) Time between execution start and RUNNING notification 10000

IS overhead (sec, logarithmic scale)

Overhead Median Mean

1000

100

10 Submission date

(b) Time between completion and DONE notification

Figure 4.19: Middleware notification overhead before and after job execution in gLite

59

250

200

Frequency

150

100

50

0 −20

0

20

40

60 80 100 IS overhead (sec)

120

140

160

180

(a) Histogram of the pre-execution gLite IS overhead 300

250

Frequency

200

150

100

50

0

0

100

200

300 400 IS overhead (sec)

500

600

(b) Histogram of the post-execution gLite IS overhead

Figure 4.20: gLite IS overhead histograms

60

700

Empirical CDF 1

0.9

0.8

0.7

F(x)

0.6

0.5

0.4

0.3

0.2 Latency cdf Exponential Log−normal Trunc. Gaussian

0.1

0

0

20

40

60

80 100 Latency values (sec)

120

140

160

(a) Pre-execution gLite IS overhead CDF Empirical CDF 1

0.9

0.8

0.7

F(x)

0.6

0.5

0.4

0.3

0.2 Latency cdf Exponential Log−normal Trunc. Gaussian

0.1

0

0

100

200

300

400 500 600 Latency values (sec)

700

800

900

1000

(b) Post-execution gLite IS overhead CDF

Figure 4.21: gLite IS overhead CDFs

61

500

Estimated execution time (sec)

400

300

200

100

0 0

50

100

150

200

250

300

350

400

Timeout values (sec)

Figure 4.22: Expected notification times (y-axis) in relation to timeout values

Figure 4.21 contains a visualisation of the respective CDFs. Both distributions do not fit to any well-known distribution. Among the distributions tried, but not shown on the CDF, were also a Pareto-tailed truncated Gaussian distribution and a generalised Pareto distribution. The tail of pre-execution overhead vanishes around the maximum value of 166 seconds, so a conservative approach in a timeout and resubmission scenario would be to set the timeout to infinity. This is also the most reasonable value, since picking the upper whisker or quantile 0.75 + 3*IQR would result in massive loss of jobs, that will otherwise become scheduled after a relatively short additional time interval. Applying the timeout model on the samples yields an optimal timeout value of 14 seconds on the other hand, resulting in an estimated average reporting time of about 12 seconds. Post-execution overhead has a long tail and therefore it makes sense to set a finite timeout value. Heuristically timeouts could be 379 or 509 seconds, i.e. the values that form the upper whisker and the lower boundary for extreme outliers respectively. The timeout model returns an optimal value of 316 seconds. As can be seen on figure 4.22, setting too small timeout values heavily penalizes execution behaviour. The bursty post-execution overhead with its long absolute delay may originate from the hierarchical reporting system deployed within gLite’s logging and bookkeeping mechanism (section 2.3.2), which necessitates multi-layered polling/pushing for state notification propagation.

62

Information service overhead with respect to CE Apart from the insight that gLite’s hierarchical logging system causes large information service overhead and overhead burstiness, it was within the scope of our interest to quantify the influence of the used CE respectively queue load balancer. For Globus we have shown that there are strong indications for an influence by those components. Analysis of the pre-execution middleware overhead per queue reveal significant differences. It can be seen on the box plots displayed in figure 4.23, that especially the overhead for pre-execution notification exhibits a strong queue dependence. CESNET and ELTE are obvious overshoots and testify about the CE’s significance regarding IS overhead. Furthermore, the influence of the specific load balancer is an interesting aspect. Queues managed by LCG-PBS have a median pre-execution reporting overhead of 8 seconds, a mean overhead of 40 seconds and a variance of 3346 sec2 . All values are lower for PBS, the median being 3 seconds, the mean 11 seconds and the variance 701 sec2 . According to the t-test, the samples from the two load balancers are modelled by a different distribution. Therefore, the collected data allows the interpretation, that information service overhead between the actual running state and the RUNNING notification mainly originates from the used scheduler/load-balancer and the CE configuration. The LCG enhancement to PBS, which allows the central resource broker to monitor all submitted jobs, seems to negatively influence IS overhead. Basically, this is the same for post-execution IS overhead. Queues managed by LCG-PBS report job completion with higher mean (270 vs. 169 seconds) and median latency (250 vs. 168 seconds) than PBS and vary more in the expected notification duration (the variance for LCG-PBS is 28843 sec2 , whereas it is 5041 sec2 for PBS). The t-test considers the LCG-PBS and PBS samples to be differently distributed. However, unlike in pre-execution IS overhead measurements, the median overhead value is greater than 100 seconds for all queues equally, which strongly suggests that the hierarchical multistage logging process in gLite’s L & B mechanism may cause parts of the overall delay.

4.4 Staging overhead Poststaging includes copying stdout and stderr messages and the output of the Sieve of Eratosthenes algorithm. We only copied back a stdout file, typically of the size of a few hundred bytes, which is a lower value than the data transfer rate of the used networks. Hence, the measured delays quantify the overhead imposed by the involved middlewares in post-staging and clearing phases.

63

NIIF LINZ ELTE POZNAN BME IRB CYF AMU WROC SAVBA KFKI CESNET TUKE IJS OEAW unknown 0

20

40

60

80 Seconds

100

120

140

160

(a) Box plot for pre-execution gLite IS overhead per queue SRCE NIIF LINZ ELTE POZNAN BME IRB CYF AMU WROC SAVBA KFKI CESNET TUKE IJS OEAW unknown 2

3

10

10 Seconds (logarithmic scale)

(b) Box plot for post-execution gLite IS overhead per queue

Figure 4.23: gLite IS overheads per queue

64

Clearing usually involves deletion of the sandbox files and directories.

4.4.1 Globus Globus post staging process exposes, as shown in figure 4.24, relatively low absolute latency values, with a median of 6 and a mean of 7.3 seconds, and a low variance of 12 sec2 . However, the tested sites behave differently, which can be seen on the boxplots in figure 4.25. blade.labs has a near-to-zero variance, while hydra.gup, altix1.jku and schafberg have a larger variance, with many outliers showing up at altix1.jku. The post-staging latency is not so much dependent upon the middleware, since it is exclusively dependent on the speed of the GridFTP transactions. GridFTP has some overhead compared to FTP, which originates from the need of GSI-authentication. However, the performance of blade.labs shows, that file transfer within Globus environments can be quite a deterministic process in terms of latency. The variance of the other sites is also comparatively low and may have its cause in network effects.

4.4.2 gLite We have no results for gLite, but since the post-staging process is exactly the same as in Globus with the WMS acting as the remote GridFTP endpoint, it can be expected that the results for post-staging will be quite similar.

4.5 Comparison The values that we have presented in our analysis for the different middlewares are summarised in table 4.2. The optimal timeout values correspond to numbers obtained with the timeout model shown in equation 4.1 for our samples. All results computed with the model show, that overestimating the timeout value does not have as negative impacts as underestimating it. The overhead of SSH with respect to scheduling and to information service overhead is close to 0, which is little surprising, since SSH only takes care of forwarding commands and data (via scp), but apart from that it does not include any of the functionalities of Grid middlewares, such as meta-scheduling, matchmaking, job state tracking and queueing. The queue is just a simple fork, which creates a new process that becomes subject to operating system scheduling. Hence, there is also close-to-zero scheduling overhead because jobs may start instantly. The values measured in SSH are merely a base reference which

65

80 Latency values Median Mean

70

Latency values (sec)

60

50

40

30

20

10

0

Submission date (not to scale)

(a) Latencies in the Globus post staging process 20 Latency values Median Mean

18

16

Latency values (sec)

14

12

10

8

6

4

2

0

Submission date (not to scale)

(b) Latencies in the Globus post staging process (zoomed)

Figure 4.24: Post staging overhead in Globus

66

Poststaging latency (sec, logarithmic scale)

1

10

0

10

hydra.gup

blade.labs

altix1.jku

schafberg

Grid sites

Figure 4.25: Data staging latencies per Globus site

Scheduling

Pre-exec. IS-overhead

Post-exec. IS-overhead

Mean (sec) Median (sec) Variance (sec2 ) Empirical distribution Outlier ratio Optimal timeout (sec) Daytime influence Weekday influence CE influence Mean (sec) Median (sec) Variance (sec2 ) Empirical distribution Outlier ratio Optimal timeout (sec) CE influence Mean (sec) Median (sec) Variance (sec2 ) Empirical distribution Outlier ratio Optimal timeout (sec) CE influence

SSH 0.04 0 0.008 mix 2.5% 0 no no 0.9 0.8 2.1 mix 1.5% 1.6 -

Globus 3.9 0 1802 mix 19% 0 no minimal large 6.2 4.6 9.2 Trunc. Gaussian 0% ∞ large 11 10 22 Trunc. Gaussian 0% ∞ large

gLite 135 91 74004 mix 6% 209 no small large 28 9 2078 mix 21% 14 large 209 198 17352 mix 1.4% 316 large

Table 4.2: Summary of the overhead properties of all middlewares

67

assesses that possible overhead introduced by JavaGAT and the testbench is negligible when compared to the measured values themselves.

4.5.1 Scheduling times in Globus and gLite The more interesting middleware is Globus, because it contains the basic components of a middleware. Most importantly with respect to the measurements a queueing mechanisms on the CE, which influence scheduling times and a central resource broker which performs matchmaking and a central interface (GRAM) on the CE for job status tracking. The scheduling times may vary up to 12 seconds within a normal execution trace, but are typically close to 0. Timeouting should be performed nevertheless, since outliers may occur. We have shown, that the times depend heavily on the deployed load-balancer and probably very much on the configuration of the queue itself. In our experiments, the shortest job first scheduling policy favours the average execution of our a job but can also produce outliers, while the scheduling times obtained at the first-come first-serve queue are longer in the median, but also resistant to outliers. These results may look differently with other workloads. Compared to the influence by the CE, the influence by the date at which we submitted our jobs was minimal. The mean scheduling latency in gLite is 33 times the value of Globus and the median latency is much higher as well. On an absolute scale, the latencies are thus much higher than in Globus, but it must be kept in mind, that the scale of the AustrianGrid is by dimensions smaller than the scale of the EGEE infrastructure, where multiple VOs access shared queues and the meta-scheduling has a larger search space. Furthermore, at least one queue was governed by a long-job scheduling policy, which is predestined to discriminate our relatively short-running job. However, it must be said that the fraction of outliers produced by gLite casts a damning light on the middleware, the schedulers or the middleware/scheduler combination. The corresponding latency distribution is heavy-tailed. Outliers occur, as they do in Globus as well, but the distribution in gLite can not be fitted with a faster-than-exponentially decaying function. Hence with high probability there is a noticeable fraction of jobs one must consider to be outliers. Exactly as in Globus, the latencies are queue and queue-configuration dependent and do not vary drastically among different weekdays or daytime periods.

4.5.2 Information service overhead in Globus and gLite Globus IS overhead roughly follows a truncated Gaussian distribution and is outlier-free. Hence the delivery of the respective status update messages can be

68

considered as reliable in Globus. gLite shows outliers in its overhead, both in preexecution and post-execution delays. Especially in post-execution overhead, the outliers form a long tail. Therefore it must be said that gLite can not guarantee reliable status updates and that job completion waiting must be timeouted in practise. It is questionable whether gLite’s hierarchical logging and bookkeeping process is optimal for reporting. The different poll intervals in the stages of this process manifest themselves as local maxima in the histograms of the overhead measurements. For both Globus and gLite, post-execution notification overhead is larger than pre-execution overhead. Since the overhead changes with the picked CE, it can be expected that some load-balancers are slow in registering job completion. We could not factor a clear load-balancer dependency for the four tested Globus sites, but for gLite it is quite obvious that the LCG-PBS scheduler performs worse than the PBS scheduler. LCG-PBS attempts to enhance scalability by allowing the resource broker to track all jobs submitted by the same user. This suggests that the worse performance of LCG-PBS stems from these decentralisation efforts. Furthermore the post-execution notification delay in gLite is much more disproportionate compared to the pre-execution delay than in Globus. While in Globus, the difference between the medians is about factor 2, in gLite the difference is more than factor 20. Accordingly, in direct comparison to Globus, gLite does a bad job in reporting job completion even in consideration of the larger scale of EGEE. Since the logging process is the same for pre-execution and post-execution reporting, the conclusion remains that the interaction between the local resource management software and the CE-deployed job status monitor is suboptimal.

69

70

Chapter 5 Conclusions In this thesis we presented an overview on the different Grid-enabling technologies SSH, Globus and gLite, of which Globus and gLite are fully-fledged Grid middlewares. Our interest in analysing the different delay components summing up to the total Grid delay motivated the construction of a comprehensive testbench, which not only measures total execution times or scheduling times, but also overhead introduced by the middleware reporting mechanisms. To attain that goal, we first strove for the ability to use a single library as the basic Grid-enabling API for our testbench. We picked the Grid middleware API wrapper JavaGAT as the uniform Grid access API for our testbench and consequently introduce the basic concepts and functionalities of JavaGAT in the thesis. To conduct tests with gLite middleware, we enhanced and improved an existing gLite adaptor for JavaGAT to fit our needs. The thesis describes the improvements made on the original work and how they can be of assistance to users in productive use. Problems with the library management of the adaptors are addressed and explained and the architecture of the adaptor outlined. From there, we introduced the testbench which served as the tool for our observations. We described the mechanisms involved in measuring and gave a short overview of the testbench’s architecture. Finally we presented the experimental results with detailed measurement evaluations for SSH, Globus and gLite with respect to scheduling and notification delay. We have concluded that scheduling overhead and IS overhead highly depends on the remotely deployed local resource management system, but not too much on the submission date. Additionally, we found that gLite has a larger overhead and a larger overhead variance than Globus concerning scheduling as well as middleware notifications. The post-execution overhead in gLite represents a particularly strong disturbance in the job execution cycle, since it introduces absolutely high latencies and the corresponding distribution is heavy-tailed with outliers. Hence it urges to set timeouts for resubmission on the waiting times for status notification of the job’s completion, which should not be necessary otherwise. Therefore we have to say, that the hierarchical re-

71

porting system in gLite with the CE’s monitor forwarding information received from the LRMS to a multi-staged L&B system, leaves room for improvement. Due to the fact that our testbench identifies latencies at all stages of execution, it is possible to timeout and resubmit the job at nearly every execution state if reasonable waiting times are exceeded. We presented such timeout values, using both intuitive considerations and the latency model introduced in [17]. With our testbench, new data on the execution behaviour can be gathered every time a middleware environment improves significantly with respect to the perceived delays, as has frequently happened in EGEE in the last few years. Due to the modular design of the testbench and the fact that only very little code is middleware-dependent, it should be easy to create new testcases on demand.

72

Appendix A gLite Documentation

73

Glite Resource Broker Adapter Readme Thomas Zangerl January 7, 2009

Contents 1 VOMS-Proxy Creation 1.1 Frequent errors . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 ”Unknown CA” error . . . . . . . . . . . . . . . . . . 1.1.2 ”Error while setting CRLs” . . . . . . . . . . . . . . . 1.1.3 ”pad block corrupted” . . . . . . . . . . . . . . . . . . 1.1.4 Could not create VOMS proxy! failed: null: . . . . . . 1.2 Preference keys for VOMS Proxy creation . . . . . . . . . . . 1.3 Minimum configuration to make VOMS-Proxy creation work 1.4 (Not) reusing the VOMS proxy . . . . . . . . . . . . . . . . . 2 The 2.1 2.2 2.3 2.4

gLite-Adaptor Adaptor-specific preference keys . . . . . . Supported SoftwareDescription attributes Supported additional attributes . . . . . . Setting arbitrary GLUE requirements . .

2

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

3 3 3 3 3 3 4 4 5

. . . .

6 6 6 7 7

1 VOMS-Proxy Creation 1.1 Frequent errors 1.1.1 ”Unknown CA” error The proxy classes report an ”Unknown CA” error (Could not get stream from secure socket). Probably the VomsProxyManager is missing either your root certificate or the root certificate of the server you are communicating with. It is best, if you include all needed certificates in ∼/.globus/certificates/. (e.g. you can copy the /etc/gridsecurity/certificates directory from an UI machine of the VO you are trying to work with to that location). If this doesn’t suffice, you should try to include a file called cog.properties in the ∼/.globus/ directory. The content of this file could be something like this: Java CoG Kit Configuration File #Thu Apr 05 15:59:23 CEST 2007 usercert=~/.globus/usercert.pem userkey=~/.globus/userkey.pem proxy=/tmp/x509up_u cacert=/etc/grid-security/certificates/ ip= Also check the vomsHostDN preference value for typos/errors.

1.1.2 ”Error while setting CRLs” Try to create a cog.properties file in ∼/.globus and set the cacert property to ∼/.globus/certificates. This will cause the VOMS Proxy classes not to look for CRLs in /etc/grid-security/certificates

1.1.3 ”pad block corrupted” Check whether you have given all necessary information (password, host-dn, location of your user-certificate etc., see section 1.3) as global preferences of the GAT-context. If you have given all necessary information, check the password you have given, for typos.

1.1.4 Could not create VOMS proxy! failed: null: See answer in section 1.1.3, maybe you forgot to specify the vomsServerPort preference value.

3

vomsHostDN vomsServerURL vomsServerPort VirtualOrganisation vomsLifetime

distinguished name of the VOMS host URL of the voms server, without protocol port on which to connect to the voms server name of the virtual organisation for which the voms proxy is created he desired proxy lifetime in seconds

/DC=at/DC=uorg /O=org/CN=somesite skurut19.cesnet.cz

compulsory

7001

compulsory

voce

compulsory

3600

optional

compulsory

Table 1.1: Necessary preference keys for VOMS proxy creation

1.2 Preference keys for VOMS Proxy creation The necessary preference keys are shown in table 1.1. Additionally you need a CertificateSecurityContext which points to your user certificate and your user key. Add that CertificateSecurityContext to the GATContext. With the compulsory preferences, the proxy is created with a standard lifetime of 12 hours. If you want to have a different lifetime, add the optional vomsLifetime preference key. Do something like GATContext context = new GATContext(); CertificateSecurityContext secContext = new CertificateSecurityContext(new URI(your_key), new URI(your_cert), cert_pwd); Preferences globalPrefs = new Preferences(); globalPrefs.put("vomsServerURL", "voms.cnaf.infn.it"); ... context.addPreferences(globalPrefs); context.addSecurityContext(secContext);

1.3 Minimum configuration to make VOMS-Proxy creation work • A cog.properties file with lines as in section 1.1.1. • The following global preferences set in the gat context – vomsHostDN – vomsServerURL – vomsServerPort

4

– VirtualOrganisation

1.4 (Not) reusing the VOMS proxy If multiple job submissions to the same VO happen in a small time interval, it is not necessary to create a new VOMS proxy for every submission. Hence, if the user sets the system property glite.newProxy to false (or doesn’t set it at all), the system will check, whether a valid VOMS-Proxy exists. If such a proxy is found, it is determined, whether the proxy lifetime exceeds the one specified in the vomsLifetime property. If the vomsLifetime property is unspecified, it is checked wether the proxy is still valid for more than 10 minutes. Very frequently, all job submissions of an application will happen within the same VO. In case there is the need to submit jobs to multiple VOs, it won’t be possible to to reuse the old proxy due to the VO-specific attribute certificates stored in the proxy file.

5

2 The gLite-Adaptor 2.1 Adaptor-specific preference keys The mechanisms provided by the GAT-API alone did not suffice to provide all the control we found desirable for the adaptor. Hence, a few proprietary preference keys where introduced. They are useful in controlling adaptor behaviour but are by no means necessary if one just wants to use the adaptor. Nonetheless, they are documented here. To avoid confusion, all of them take strings as values, even if the name would suggest an integer or boolean. If you want to use them, set them using preferences.put(); for example write preferences.put("glite.pollIntervalSecs", "15"); context.addPreferences(preferences); The following preference keys are supported: • glite.pollIntervalSecs - how often should the job lookup thread poll the WMS for job status updates and fire MetricEvents with status updates (value in seconds, default 3 seconds) • glite.deleteJDL - if this is set to true, the JDL file used for job submission will be deleted when the job is done (”true”/”false”, default is ”false”) • glite.newProxy - if this is set to true, create a new proxy even if the lifetime of the old one is still sufficient

2.2 Supported SoftwareDescription attributes The minimum supported attributes from the software description seem to be derived from the features that RSL (the globus job submission file format) provides. Hence, they are easy to translate to RSL properties. However, the format used for glite job submission is JDL and attributes like count or hostCount are hard to translate to JDL. Most of the attributes that are supported are not even achieved by the JDL format itself, but by adding GLUE requirements. Sadly, the JDL format does not provide much of the functionality covered by RSLs, hence many attributes remain unsupported. On the other hand, to enable working with the features that the JDL format provides additionally to the RSL format, a new attribute was introduced. Set glite.retryCount to some String or Integer in order to use the retry count function of glite.

6

2.3 Supported additional attributes The attribute glite.DataRequirements.InputData can be set. It expects an ArrayList<String> as value, in which there are one or more InputData LFNs or GUIDs which will be used by the matchmaker upon scheduling in order to decide on which CE the job is going to be scheduled. Normally, this is a CE ”close” (i.e. with low latency) to the SE.

2.4 Setting arbitrary GLUE requirements If you would like to specify any GLUE-Requirements that are not covered by the standard set of SoftwareResourceDescription or HardwareResourceDescription keys, you may set glite.other either as Software- or HardwareResourceDescription and add a *full* legal GLUE Requirment as entry, such as for example !( other.GlueCEUniqueID == "some_ce_of_your_choice").

7

Appendix B GridLab license GRIDLAB OPEN SOURCE LICENSE The GridLab licence allows software to be used by anyone and for any purpose, without restriction. We believe that this is the best way to ensure that Grid technologies gain wide spread acceptance and benefit from a large developer community. Copyright (c) 2002 GridLab Consortium. All rights reserved. This software includes voluntary contribution made to the EU GridLab Project by the Consortium Members: Istytut Chemii Bioorganicznej PAN Poznaskie Centrum Superkomupterowo Sieciowe (PSNC), Pozna, Poland; Max-Planck Institut fuer Gravitationsphysik (AEI), Golm/Potsdam, Germany, Konrad-Zuse-Zentrum fuer Informationstechnik (ZIB), Berlin, Germany; Masaryk University, Brno, Czech Republic; MTA SZTAKI, Budapest, Hungary; Vrije Universiteit (VU), Amsterdam, The Netherlands; ISUFI/High Performance Computing Center (ISUFI/HPCC), Lecce, Italy; Cardiff University, Cardiff, Wales; National Technical University of Athens (NTUA), Athens, Greece; Sun Microsystems Gridware GmbH, Germany; HP Competency Center France Installation, use, reproduction, display, modification and redistribution with or without modification, in source and binary forms, is permitted provided that the following conditions are met: 1. Redistributions in either source-code or binary form along with accompanying documentation must retain the above copyright notice, the list of conditions and the following disclaimer. 2. The names ”GAT Grid Application Toolkit”, ”EU GridLab” or ”GridLab” may be used to endorse or promote software, or products derived therefrom. 3. You are under no obligation to provide anyone with bug fixes, patches, upgrades or other modifications, enhancements or derivatives of the features,

81

functionality or performance of any software you provide under this license. However, if you publish or distribute your modifications, enhancements or derivative works without contemporaneously requiring users to enter into a separate written license agreement, then you are deemed to have granted GridLab Consortium a worldwide, non-exclusive, royalty-free, perpetual license to install, use, reproduce, display, modify, redistribute and sublicense your modifications, enhancements or derivative works, whether in binary or source code form, under the license stated in this list of conditions. 4. DISCLAIMER THIS SOFTWARE IS PROVIDED ”AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. THE GRIDLAB CONSORTIUM MAKE NO REPRESNTATION THAT THE SOFTWARE, ITS MODIFICATIONS, ENHANCEMENTS OR DERIVATIVE WORK THEROF WILL NOT INFRINGE PRIVATELY OWNED RIGHTS. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE

82

List of Figures 1.1 1.2

The Grid layer model with the components of each layer . . . . . The middleware as a transparent component between APIs and services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1 2.2 2.3 2.4 2.5 2.6

The GSI proxy delegation process . . . . . GRAM job states . . . . . . . . . . . . . . Example GRAM job registration process . Job registration in gLite . . . . . . . . . . The information flow in L & B reporting . gLite job states (bold arrows on successful

3.1 3.2 3.3 3.4

JavaGAT’s structure . . . . . . . . . . . . . . . Class diagram of the VOMS proxy management Class diagram of the gLite adaptor . . . . . . . A simplified class diagram of the testbench . .

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9

Globus scheduling latencies . . . . . . . . . . . . . . . . . . . . . Globus scheduling histogram (extreme outliers truncated) . . . . Globus scheduling latency CDF . . . . . . . . . . . . . . . . . . . Globus scheduling times per day of the week . . . . . . . . . . . Globus scheduling times per daytime periods . . . . . . . . . . . Scheduling latency box plots per queue . . . . . . . . . . . . . . . gLite scheduling latencies . . . . . . . . . . . . . . . . . . . . . . gLite scheduling latency CDF . . . . . . . . . . . . . . . . . . . . Expectation of scheduling time (y-axis) against scheduling timeout value (x-axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . gLite scheduling times per day of the week . . . . . . . . . . . . . gLite scheduling times per daytime period . . . . . . . . . . . . . gLite scheduling latency per queue . . . . . . . . . . . . . . . . . gLite scheduling latency with different Maui RMPOLLINTERVAL settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CDF of gLite scheduling latency with different Maui poll intervals Middleware notification overhead in Globus . . . . . . . . . . . .

4.10 4.11 4.12 4.13 4.14 4.15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . execution path) . . . . . . classes . . . . . . . . . .

. . . .

. . . .

4 5

. . . . . .

. . . . . .

. . . . . .

16 17 17 21 22 22

. . . .

. . . .

. . . .

26 29 31 34 40 41 41 42 43 45 46 46 47 48 48 49 51 52 54

83

4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25

84

Histograms of middleware notification overhead in Globus . . . . CDFs of middleware notification overhead in Globus . . . . . . . Middleware notification overhead by Globus site . . . . . . . . . Middleware notification overhead before and after job execution in gLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . gLite IS overhead histograms . . . . . . . . . . . . . . . . . . . . gLite IS overhead CDFs . . . . . . . . . . . . . . . . . . . . . . . Expected notification times (y-axis) in relation to timeout values gLite IS overheads per queue . . . . . . . . . . . . . . . . . . . . Post staging overhead in Globus . . . . . . . . . . . . . . . . . . Data staging latencies per Globus site . . . . . . . . . . . . . . .

55 56 58 59 60 61 62 64 66 67

List of Tables 4.1 4.2

Abbreviations used in gLite per queue boxplot . . . . . . . . . . Summary of the overhead properties of all middlewares . . . . . .

49 67

85

86

Bibliography [1] Worldwide LHC Computing Grid on a glance. website. http://lcg.web. cern.ch/LCG/, visited on November 7th 2008. [2] Ian Foster. The Grid: Blueprint for a New Computing Infrastructure. Morgan-Kaufman, 1999. [3] Ian Foster. What is the grid? - a three point checklist. GRIDtoday, 1(6), July 2002. [4] Ein petaflop leistung. website. http://www.netigator.de/netigator/live/ fachartikelarchiv/ha_artikel/powerslave,pid,archiv,id,31673719,obj, CZ,np,,ng,,thes,.html, visited on November 7th 2008.

[5] Supercomputer top500 list 06/2008. website. lists/2008/06, visited on November 9th 2008.

http://www.top500.org/

[6] D. Cooper, S. Santesson, S. Farrell, S. Boeyen, R. Housley, and W. Polk. Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile. RFC 5280 (Proposed Standard), May 2008. [7] GWD-R (recommendation W. Allcock. GridFTP: Protocol Extensions to FTP for the Grid. Globus Grid Forum Recommendation GFD 2.0, April 2003. http://www.globus.org/alliance/publications/papers/GFD-R.0201. pdf. [8] J. Linn. Generic Security Service Application Program Interface Version 2, Update 1. RFC 2743 (Proposed Standard), January 2000. [9] Rob V. van Nieuwpoort, Thilo Kielmann, and Henri E. Bal. User-friendly and reliable grid computing based on imperfect middleware. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’07), nov 2007. Online at http://www.supercomp.org. [10] Marik Marshak and Hanoch Levy. Evaluating web user perceived latency using server side measurements. Computer Communications, 26(8):872– 887, 2003.

87

[11] Ahuva W. Mu’alem and Dror G. Feitelson. Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. IEEE Transactions on Parallel and Distributed Systems, 12(6):529– 543, 2001. [12] B. Chun and D. Culler. User-centric performance analysis of market-based cluster batch schedulers, 2002. [13] Emmanuel. Medernach. Workload analysis of a cluster in a grid environment. In 11th workshop on Job Scheduling Strategies for Parallel processing, 2005. [14] Michael Oikonomakos, Kostas Christodoulopoulos, and Emmanouel (Manos) Varvarigos. Profiling computation jobs in grid systems. In CCGRID ’07: Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid, pages 197–204, Washington, DC, USA, 2007. IEEE Computer Society. [15] Diane Lingrand, Johan Montagnat, and Tristan Glatard. Estimation of latency on production grid over several weeks. In ICT4Health, Manila, Philippines, February 2008. Oncomedia. [16] Diane Lingrand, Johan Montagnat, and Tristan Glatard. Modeling the latency on production grids with respect to the execution context. In CCGRID ’08: Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pages 753–758, Washington, DC, USA, 2008. IEEE Computer Society. [17] Tristan Glatard, Johan Montagnat, and Xavier Pennec. Optimizing jobs timeouts on clusters and production grids. In International Symposium on Cluster Computing and the Grid (CCGrid), pages 100–107, Rio de Janeiro, May 2007. IEEE. [18] T. Ylonen and C. Lonvick. The Secure Shell (SSH) Protocol Architecture. RFC 4251 (Proposed Standard), January 2006. [19] Ian T. Foster. Globus toolkit version 4: Software for service-oriented systems. In Hai Jin, Daniel A. Reed, and Wenbin Jiang, editors, NPC, volume 3779 of Lecture Notes in Computer Science, pages 2–13. Springer, 2005. [20] Von Welch, Ian Foster, Carl Kesselman, Olle Mulmo, Laura Pearlman, Jarek Gawor, Sam Meder, and Frank Siebenlist. X.509 proxy certificates for dynamic delegation. In In Proceedings of the 3rd Annual PKI R & D Workshop, 2004.

88

[21] The OpenSSL project. website. http://www.openssl.org. [22] Karl Czajkowski, Ian T. Foster, Nicholas T. Karonis, Carl Kesselman, Stuart Martin, Warren Smith, and Steven Tuecke. A resource management architecture for metacomputing systems. In IPPS/SPDP ’98: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, pages 62–82, London, UK, 1998. Springer-Verlag. [23] Karl Czajkowski, Ian T. Foster, and Carl Kesselman. Resource management for ultra-scale computational grid applications. In PARA ’98: Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems, pages 88–94. SpringerVerlag, 1998. [24] J. Sermersheim. Lightweight Directory Access Protocol (LDAP): The Protocol. RFC 4511 (Proposed Standard), June 2006. [25] Sergio Andreozzi, Stephen Burke, Felix Ehm, Laurence Field, Gerson Galang, Balazs Konya, Maarten Litmaath, Paul Millar, and JP Navarro. Glue specification v2.0.42. OGF Specification draft, May 2008. [26] K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman. Grid information services for distributed resource sharing, 2001. [27] Berkeley database information index. Website. https://twiki.cern.ch/ twiki//bin/view/EGEE/BDII, visited on November 18th 2008. [28] Roberto Alfieri, Roberto Cecchini, Vincenzo Ciaschini, Luca dell’Agnello, Alberto Gianoli, Fabio Spataro, Franck Bonnassieux, Philippa J. Broad´ foot, Gavin Lowe, Linda Cornwall, Jens Jensen, David P. Kelsey, Akos Frohner, David L. Groep, Wim Som de Cerff, Martijn Steenbakkers, Gerben Venekamp, Daniel Kouril, Andrew McNab, Olle Mulmo, Mika Silander, Joni Hahkala, and K´ aroly L¨orentey. Managing dynamic user communities in a grid of autonomous resources. CoRR, cs.DC/0306004, 2003. informal publication. [29] Stephen Burke, Simone Campana, Patricia M´endez Lorenzo, Christopher Nater, Roberto Santinelli, and Andrea Sciab`a. gLite 3.1 users guide, 2006. [30] Erwin Laure. Egee middleware architecture planning (release 2). EU Deliverable DJRA1.4, July 2005. http://edms.cern.ch/document/594698.

89

[31] Rajesh Raman, Miron Livny, and Marvin Solomon. Matchmaking: Distributed resource management for high throughput computing. In Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), Chicago, IL, July 1998. [32] R-gma: Relational grid monitoring architecture. Website. http://www. r-gma.org/index.html. [33] A. Shoshani, A. Sim, and J. Gu. Storage resource managers: Middleware components for grid storage, 2002. [34] Tony Calanducci. Lfc: The lcg file catalog. Slides for NA3: User Training and Induction, June 2005. http://www.phenogrid.dur.ac.uk/howto/LFC. pdf. [35] Gfal java api. https://grid.ct.infn.it/twiki/bin/view/GILDA/APIGFAL. [36] Gregor von Laszewski, Ian Foster, Jarek Gawor, and Peter Lane. A Java Commodity Grid Kit. Concurrency and Computation: Practice and Experience, 13(8-9):643–662, 2001. [37] Yoav Etsion and Dan Tsafrir. A short survey of commercial cluster batch schedulers. Technical Report 2005-13, School of Computer Science and Engineering, the Hebrew University, Jerusalem, Israel, May 2005. [38] Jens Jensen, Graeme Stewart, Matthew Viljoen, David Wallom, and Steven Yound. Practical Grid Interoperability: GridPP and the National Grid Service. In UK AHM 2007, 4 2007. [39] Brett Bode, David M. Halstead, Ricky Kendall, Zhou Lei, David Jackson, and Maui High. The portable batch scheduler and the maui scheduler on linux clusters. In In Proceedings of the 4th Annual Linux Showcase and Conference. Press, 2000.

90

Related Documents