Autonomous Configuration of Grid Monitoring Systems Ken’ichiro Shirose Tokyo Institute of Technology E-mail:
[email protected]
Satoshi Matsuoka Tokyo Institute of Technology National Institute of Informatics E-mail:
[email protected]
Hidemoto Nakada National Institute of Advanced Industrial Science and Technology Tokyo Institute of Technology E-mail:
[email protected] Hirotaka Ogawa National Institute of Advanced Industrial Science and Technology E-mail:
[email protected]
Abstract The problem with practical, large-scale deployment of Grid monitoring system is that, it takes considerable management cost and skills to maintain the level of quality required by production usage, since the monitoring system will be fundamentally be distributed, need to be running continuously, and will itself likely be affected by the various faults and dynamic reconfigurations of the Grid itself. Although their automated management would be desirable, there are several difficulties, distributed faults and reconfigurations, component interdependencies, and scaling to maintain performance while minimizing probing effect. Given our goal to develop a generalized autonomous management framework for Grid monitoring, we have built a prototype, on top of NWS, featuring automatic configuration of its “clique” groups as well as coping with singlenode faults without user intervention. An experimental deployment on the Tokyo Institute of Technology’s Campus Grid (The Titech Grid) consisting of over 15 sites and 800 processors has shown the system to be robust in handling faults and reconfigurations, automatically deriving an ideal clique configuration for the head login nodes of each PC cluster in less than two minutes.
1. Introduction The users of Grids access numbers of distributed computational resources in a concurrent fashion. As an example, an Operations Research researcher performing a large branch-and-bound parallel search on the Grid may need hundreds of CPUs on machines situated across several sites. Another example would be multiple physicists requiring terascale or even petascale storage and databases access
and processing of data across the globe in a large datagrid project. In all such practical deployments of shared resources on the Grid, Grid monitoring systems are absolute musts at all levels, for users for observing the status of the Grid and planning his or her run accordingly, for applications that adapt to the available cycles and network bandwidth, as well as middleware that attempts to do the same such as job brokers and schedulers, not to mention the administrators that observe the overall “health” status of the Grid and react accordingly if any problem arises at their sites. The problem with practical, large-scale monitoring deployment across the Grid is that, it takes considerable management cost and skills to maintain the level of quality required by production usage, since the monitoring system will be fundamentally be distributed, needs to be running continuously, and will itself likely be affected by the various faults and dynamic reconfigurations of the Grid itself. The requirement of such automated management of monitoring system itself, in the spirit of such industrial efforts as IBM’s “Autonomic Computing”[9], may seem apparent but have not been adequately addressed in existing Grid monitoring systems work such as Network Weather Service (NWS)[11] or R-GMA[2]. The technical difficulties with automated management of Grid monitoring systems are several fold. Firstly, their constituent components will be fundamentally distributed and subject to faults and reconfigurations, as mentioned earlier. Another challenge is that, Grid monitoring systems usually consist of several functional components that are heavily dependent on each other, interconnected by physical networks. Such dependencies may hamper faults and reconfigurations as the effect of system alteration will propagate through the system, and as such they may not be isolated. Also, single point of failure would not be desirable, due to its distributed nature. Finally, the system should not add
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04) 0-7695-2050-2/04 $20.00 © 2004 IEEE
significant probing effect to the Grid system subject to monitoring, as the monitoring itself already adds some level of performance intrusiveness already. Given such a background, our goal is to develop a generalized autonomous management framework for Grid monitoring systems. Currently, a prototype has been built on top of NWS, and will automatically configure its “clique” groups as well as cope with single-node faults automatically without user intervention. An experimental deployment on the Tokyo Institute of Technology’s Campus Grid (The Titech Grid) consisting of over 15 sites and 800 processors on campus interconnected by a multi-Gigabit backbone, has shown the system to be robust in handling faults and reconfigurations, automatically deriving an ideal clique configuration for all the nodes in less than two minutes.
2. Overview of Existing Grid Monitoring Systems We first briefly survey the existing Grid monitoring systems to investigate their component-level architecture.
2.1. Grid Monitoring Architecture A Grid Monitoring and Performance working group at the Global Grid Forum (GGF)[5] proposed a Grid Monitoring Architecture (GMA)[10] specification in one of its documents. The paper addresses the basic architecture of Grid monitoring systems, identifying the needed functionalities of each component, as well as allows for interoperability between different Grid monitoring systems. In particular, The GMA specification defines three components: Producer retrieves performance data from various Sensors, and makes them available to other GMA components. Producers can be regarded as a component class of data source. The GMA specification does not define how the Producer and the Sensors mutually interact. Consumer receives performance data from Producers and processes them, such as filtering or archiving their info. Consumers can be regarded as a component class of data sink. Directory Service supports information publication and discovery of components as well as monitored data. The GMA components, those that are defined above as well as external ones such as Sensors, are not stand-alone. Rather, there are interdependent amongst themselves, and an autonomous monitoring system must identify and maintain such dependencies in a low-overhead, automated fashion. In particular, there are two kinds of dependencies amongst GMA components, including the sensors. Data transfer dependency: Since Producers, Consumers, as well as “external” GMA components such as sensors or archives communicate with each other via networks to transfer monitored data, there will be natural
dependencies amongst them. For example, a Producer may assume that a certain sensor exists in its accessible network reach in an efficient fashion, so that monitored data can be pulled off the sensor. Any faults or reconfigurations that would invalidate such an dependence assumption will require the system to take some actions to cope with the situation, based on the dependency information. Registration dependency In a similar fashion, Producers, Consumers, Sensors, etc. register themselves with the Directory Service, and as such there will be natural central dependence assumption that registration will persist and accessible from the Directory Service. Any change in the system must have automated effect on the component registration, as well as coping with the failure of the directory service itself. Here is examples of actual implementations of Grid Monitoring systems used in some forms of production:
2.2
The Network Weather Service (NWS)
The Network Weather Service (NWS) is the wide area distributed monitoring system developed at SDSC, and consists of the following four components: Sensorhost measures CPU, memory and disk usage, and performance of the network, Memoryhost stores and manages the short-term monitored data temporarily, and provide them to client programs Nameserver Synonymous to the GMA Directory Service. The sensorhost and memoryhost will register themselves and their relevant info (e.g., the IP address of the node they will be running on) to the Nameserver. Client program & Forecaster The client program extracts monitored data for its own use, while the forecaster is a special client that makes near-term predictions of monitored data. NWS requires the Grid administrators to make configuration decisions upon starting up the system. For example, each sensorhost must be registered to the nameserver upon startup, upon which an administrator must decide which memoryhost this sensorhost sends data to. This also implies that the nameserver and memoryhost must be running prior to sensorhost startup. Another issue with the older version of NWS had been that bandwidth measurements between the Grid nodes were required for every valid pair of nodes in the Grid; this meant continuous bandwidth measurement of O(n2 ) (n: number of machines) complexity, which would have put considerable network traffic pressure on the entire Grid. To resolve this issue, the later versions of NWS introduced the network “clique ” feature for greatly reducing the measurement cost. Each administrator at sites groups machines typically on local sites with full mutual connectivity and without significant wide-area bandwidth differences into “clique” groups,
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04) 0-7695-2050-2/04 $20.00 © 2004 IEEE
and picks a representative node in each clique for performing O(m2 ). bandwidth measurements, where m is now a number of cliques. The bandwidth measured between the representing nodes in the cliques is now regarded as the bandwidth between any pair of nodes in respective pair of cliques. The problem then is that, sometimes clique grouping may not be obvious, nor how the representing nodes would be chosen. In the days of high bandwidth wide-area networks, some machines in the local area may have less mutual bandwidth compared to the wide area. And it is not obvious without performing proper bandwidth measurements which Grid node would best serve as a representing clique node. Moreover, when faults and changes occur in the network, cliques as well as representing nodes may have to be changed to best suite the new network configurations. To expect the network administrators to coordinate to maintain such a structure in a very large Grid is a daunting task.
2.3. Globus MDS The Globus Alliance[6] distributes the Monitoring and Discovery Service(MDS)[1] as part of the Globus Toolkit. The MDS focuses on being a scalable and reliable Grid information system, to embody various Grid configuration as well as monitoring information. The MDS consists two components: GRIS(Grid Resource Information Service) serves as a repository and collector of local resource information. For example, it collects and sends to GIIS by request the information of local hardware (CPU, memory, storage, etc.), the software (OS version, installed applications and libraries, etc.). The Globus toolkit itself uses GIIS for various purposes, but also arbitrary applications and middleware can utilize GRIS by registering its own private information. GIIS(Grid Index Information Services) is part of a MDS tree or DAG that maintains a hierarchical organization of information services, with GIIS as the interior nodes and GRIS as the leaf nodes. GIIS receives for registration from sibling GRIS and GIIS using a dedicated protocol called GRRP (Grid Registration Protocol). Middleware and Applications look up monitoring information stored in the MDS hierarchy via extraction command, via a Globus proxy.
2.4. Hawkeye Hawkeye[7] is a monitoring system developed as a part of the Condor Project[3]. It is based on Condor and ClassAd technology, in that it uses the Condor job execution hosting mechanism itself to run the monitor process in a faulttolerant way, and the protocol employed is basically the Condor ClassAd. Hawkeye modules, which can be binaries or scripts, collect data from individual machines, and agents gather monitored data from modules. The Manager in turn
gathers data remotely from the agents. These roughly correspond to Sensors, Producers and Consumers in the GMA model. There are command-line, GUI, and web front end tools to observe the monitored data.
2.5. R-GMA R-GMA is being developed as part of the Work Package 3 of the EU DataGrid Project[4]. It is based on the GMA architecture but also combines relational database and Java servlet technology, in that R-GMA implements producer, consumer and registry as Java servlet, and uses relational databases for registration. In addition, Producer and consumer facilitates its own interface (s). Although the R-GMA is flexible enough to accommodate different types of sensors, the EU Data Grid will employ Nagios for data collection.
2.6. Summary of Existing Monitoring Systems As has been pointed out above, much of the Grid monitoring systems have good commonalities in its architecture, although each system calls them differently, and there are some subtle differences among them. Table1 is a summary of Grid monitoring components.
GMA Components Comsumer Directory Service Producer
Registration dependency Data transfar dependency
Sensor
Application
Monitoring System
Database
Sources of Data Figure 1. GMA Components and Sources of Data
3. Autonomous Configuration of Grid Monitoring Systems Based on the analysis of the previous section, we are currently designing and developing a general framework for autonomously configuring Grid monitoring systems. The
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04) 0-7695-2050-2/04 $20.00 © 2004 IEEE
Directory Service Producer Consumer Source of Data
Table 1. Summary of Grid Monitoring Systems NWS MDS Hawkeye nameserver GIIS Manager memoryhost GRIS Agent & Manager Commands Commands Commands sensorhost GRIS & middleware modules
system will be “aware” of the correct configuration of the Grid, based on various info including its own probing info that the systems uses to determine the configuration of the Grid, esp. the status of the nodes and the network topologies. Here is the set of the requirements we imposed on such a system: • Applicability: Support of multiple, existing Grid monitoring systems. • Scalability: Scalable to numerous numbers of nodes interconnected with complex network topologies • Autonomy: Managing of the Grid monitoring system, including coping with dynamic faults and reconfigurations, must be largely autonomous with very little user intervention. • Extensibility: The framework should be extensible to incorporate various autonomic, self-management features. The autonomic management of Grid monitoring systems is largely into three issues. The first is the configuration of the monitoring system, identifying component dependencies, registering with the directory service, starting the sensor, producer, and consumer processes, preparing the storage for data collection, etc. This may not be done all at once, but rather it must be possible for (re-)configurations to occur gradually as new nodes enter and leave from the Grid. Any groupings such as NWS cliques must also be handled here, by observing and choosing the appropriate groupings as well as the representative nodes via dynamic instrumentation. The second issue is detecting and handling of faults in the Grid monitoring system itself. There can be several types of faults • Monitoring process termination: when a process of the monitoring system gets terminated, such as accidental process singalling or OS reboot. • Node loss: when a node is physically lost due to hardware failure, power loss, etc. In this case the system must recover what it can of the current monitoring info, and also reconfigure an alternate node for running the component. This is dependent on what component the node has been running. • Network loss: although difficult to distinguish from node loss, sometimes a network may become disconnected, but alternate paths may be available for indirect
R-GMA Registry Servlet Producer Servret Consumer Servlet some sensors
communication. In this case some proxies may be designated, or the monitoring system may be temporarily split up and individually operate, and later merged when communication recovers. In all cases, the loss of communications between components is the first sign of failure. The system then must proceed to determine what fault had actually occurred, by proving the system dynamically to discover whether the subject node is dead or alive, whether there are alternate network paths that exist, etc. The monitoring feature of the monitoring system itself may be used for this purpose when appropriate. To satisfy the above goals, we apply the following component allocation and execution strategies in our current prototype. • The system first forecasts the network topology of Grid nodes, as well as diagnoses whether particular Grid monitoring components will correctly execute on each node. • It then determines and forms node groups that serve as cliques. For newly added nodes it will edit and reform group memberships accordingly. • It next decides on which nodes respective Grid monitoring components should execute on. • Finally, it actually starts up the components on assigned nodes, and registers them to the directory service(s) thereof. Executability of each component could be determined by ready-made diagnostics utilities that come with each Grid monitoring system. Such a check is essential since some components serve an important role such that all other monitoring components will depend on it. One such example would be the directory server component. Once the system starts executing, the system must support dynamic removal of faulty nodes from the group, and addition of machines which have recovered or designated as a replacement. If the group configuration changes in any way, such a change must be registered, advertised, and readily noted by other parts of the system. For example, when the representing node of a particular group changes, then such a change must be known to all other representing nodes in the group as well as other necessary components. Such an action must be done without incurring significant CPU or networking costs.
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04) 0-7695-2050-2/04 $20.00 © 2004 IEEE
4. Prototype Implementation We have implemented a prototype of autonomously configuring Grid monitoring system on top of NWS. The current prototype configures all the NWS components automatically, and recover from process and node failures of some if not all of the components. Currently, the autonomous management functions are executed on a single Grid node; this is not ideal as it hampers scalability, as well as causing the node to be a single point of failure within the system. We plan to replicate and distribute the management functionalities to solve both problems. For proof-ofconcept for many of the features, the current systems suffices. The prototype configures NWS automatically in the following manner: • Sensorhosts are executed on all the nodes that the administrator listed. • Nameserver is executed on one of the nodes the administrator listed. • Network distances between the nodes are determined by actively measuring and averaging the RTT between the nodes. • Given the RTT info, nodes are grouped into cliques, and a representative node is chosen. Also, a node is chosen to act as the memoryhost of each group. To determine the RTT upon node grouping above, we systematically ICMP “ping” the nodes in parallel in a n-byn fashion. In particular from a list of machines provided by the Grid administrator, we generate two shell scripts, one that runs on a machine that acts as as the autonomic monitor manager, and another that runs on all machines that would ping all other nodes in the network. The scripts are transferred to the necessary nodes and executed using some secure invocation mechanism provided by the Grid itself, or some other mechanism such as ssh. For each Grid node, we measure and record the node with minimum RTT excluding oneself, and record it as the most proximal node. Then, each node calculates the average of RTT amongst all other nodes that have responded. After all the RTT measurements have finished the average RTT data are sent to the autonomic monitor manager by every node in the system. The autonomic monitor manager in turn organizes the nodes into (clique) groups in the following, bottom-up fashion fashion. A node is chosen as the “current node”, and a singleton group is created with the current node being the singleton member, and designated as the “current group”. Then, the following process is repeated: • If the most proximal node from the current node belongs to another group, then the two groups are merged. Then a new group is created with some arbitrary non-member node being the singleton member, and those are designated as the new current node and the current group, respectively.
• If the most proximal node belongs to the same group, then a new group is created with some arbitrary nonmember node being the singleton member, and those are designated as the new current node and the current group, respectively. • If the proximal node does not belong to any group. Then a new node is chosen as current node and it is added to the current group. The autonomic monitor manager then designates the Grid node with the a) most ping connectivity with other machines, and b) the minimum average RTT from other nodes recorded above, as the NWS name server. Then for each group, the node being designated as being the most proximal from the most number of nodes in the group, is designated as being the NWS memory host, and that particular node is also chosen to be the group (clique) representative. Now that the system has sufficient configuration, the system determines the dependency information between the NWS components and the nodes the components will be actually running on. As described earlier, the NWS memoryhost needs to know the node and the port number where the nameserver will run and listen to. The NWS sensorhost will also need the nameserver information as well as needs to be told which memoryhost on which host / port number it should send its data to, etc. Such configurations are formatted as command-line NWS component startup options to be executed at respective nodes via some Grid execution service, in the order of nameserver, memoryhost, and sensor hosts.
5. Fault Handling and Recovery of NWS components in the Prototype The current prototype handles two types of faults in the NWS. One of these is simply when some components fails to execute, or terminates unexpectedly. The other is when that the autonomic monitor manager loses network access due to trouble in the node hardware, OS, or network. (We currently do not distinguish between node failure and network failure.) In the first case, the autonomic monitor manager simply attempts to re-execute the failed component. If the component fails again after repeated trial in a very short interval, the node is deemed to have a problem, and regarded as a node failure (i.e., the second case). In the second case, if the node that failed was executing either the NWS memoryhost or a nameserver, the autonomic monitor manager must designate, prepare, and restart the service on a replacement machine. Moreover, other components running on other nodes must be notified as well; for example, when the nameserver is restarted on another node with a different IP address, all the sensorhost and memoryhosts must be notified of this fact and re-registered with the new nameserver, based on this new configuration info. Similarly, if the memoryhost crashes and restarted on another node, then all the sensorhosts must be told to redirect the data.
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04) 0-7695-2050-2/04 $20.00 © 2004 IEEE
For the first case, the current prototype simply monitors the component execution status using the ps command. Configuration info are held in a file where the autonomic monitoring manager is running. To detect whether a node is running, we periodically check whether we can have ssh connection to the machine. To restart the components a shell script for doing so is generated by the manager, and executed appropriately on the replacement host.
Table 2. Time for initial configuration of our prototype on the Titech Grid Clusters RTT Measurement (sec) NWS execution (sec) Total (sec)
3 21 19 40
6 39 30 69
10 76 52 128
6. Evaluation of the Prototype For evaluation, we installed our prototype on the Campus Grid at the Tokyo Institute of Technology, or the “Titech Grid” for short. The Titech Grid has 15 PC clusters totaling over 800 processors spread out throughout the two Titech campuses which are situated approximately 30kms apart. The entire campus is covered by SuperTITANET which is a multi-gigabit campus backbone. Each Titech Grid node is designed to be connected directly to the backbone via a managed switch, and can communicate peer-to-peer with any other nodes on the Grid. For the experiment, we employed the head login nodes of each PC cluster to run NWS components.
recovery process took 39 seconds according to our measurements, 38 seconds of which was spent re-measuring the RTT and status of the components. The actual configuration decision and restart took less than 1 second. Again, the entire process worked automatically as intended. The results show that, (1) our prototype can cope well with the limited fault scenarios under the current configuration, but (2) for the current system to scale beyond hundreds of nodes (clusters), the measurement time must be reduced drastically. Currently, much of the overhead is attributable to ssh execution of measurement processes; we must devise faster, more persistent measuring scheme to amortize the overhead.
6.1. Initial Setup [tgn008001]
We show the result of time required for initial setup of the system in Table 2. We see that the time required is in the order of tens of seconds, but is proportional to the number of clusters. A breakdown shows that approximately 50 % of the time is being spent to collect RTT data, and the rest to execute the components via SSH. The grouping algorithm worked very well, splitting the clusters automatically into two groups, one for each campus. This is due to the fact that the average RTT between campuses is approximately 2-3 times greater than RTT between the nodes located at the farthest ends of the same campus. (The groups are naturally cliques in this case since they all have p2p connectivity.)
[tgn010001] S M S
S
Measurement of network performance between representation
[tgn011001] M S
[tgn007001] S [tgn015001]
N S
S Oookayama
[tgn002001]
[tgn016001] N M S
S
S [tgn013001]
[tgn018001] S [tgn014001]
NWS Nameserver Memoryhost Sensorhost Data flow form sensor to Memoryhost
Suzukakedai
6.2. Fault Recovery We first investigate the scenario in which the cluster node that executes the memoryhost crashes. The upper half Figure 3 shows the configuration prior to the crash. There are two groups in the Grid, and tgn015001 and tgn005001 execute the memoryhost components (the cluster nodes are real ones from the Titech Grid). Sensors send the monitored data to the memoryhost of the same group. When tgn015001 crashes, recovery is performed automatically, and tgn013001 is designated as a replacement. Appropriate memoryhost restart as well as sensor redirections are performed, as shown in the bottom half of Figure 3. Notice that, only the cluster nodes in the right hand group is affected. Similarly, Figure 4 shows when the cluster node that turns the NWS nameserver in the Grid crashes. The whole
Figure 2. Result for 10 clusters on the Titech Grid
7. Conclusion and Future Work We proposed that Grid monitoring systems need to configure themselves automatically and presented a prototype on top of NWS that deals with limited but common case of faults. The prototype worked well under a limited setting on the Titech Grid, showing reasonable startup as well as recovery performance, and the system executed autonomously as expected.
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04) 0-7695-2050-2/04 $20.00 © 2004 IEEE
S
S
tgn016001
tgn009001
S
S
S
tgn009001
tgn002001
tgn012001
S tgn012001
registration
N
M
S
M
tgn015001
S
tgn015001
S
S
tgn002001
tgn013001
N
S
tgn016001
M
S
M
tgn015001
S
tgn013001
S tgn005001
registration
S
S
tgn016001
tgn009001
tgn012001
M
M
S
S
S
tgn002001
tgn012001
N
S
N
S tgn009001
S
tgn015001
S
M
S
M
tgn016001
S
tgn013001
registration
N
S
tgn005001
registration
tgn013001
tgn015001
S tgn002001
M
S
Figure 4. Example of NWS name server recovery.
tgn015001
Figure 3. Example of memoryhost fault recovery, tgn*** is the name of each node, N: Nameserver, M: Memoryhost, S:Sensor, Arrows indicate dataflow from sensor to memoryhost.
As a future work, we must make the algorithm more distributed for several reasons, One is performance, where we need to parallelize the measurements to attain scalability. The other is to distribute the functions of the autonomic monitoring manager so that no single point of failure will exist, and be able to cope with “double” faults. Algorithms for determining the clique must be made distributed and improved. Finally, we must test the effectiveness of our algorithm on a much larger Grid. One idea is to regard every cluster node of the Titech Grid as a Grid node, which will be more than 400 nodes throughout the entire Grid. We will also employ the Grid testbed being built by the NAREGI project[8] which will come in operation in the Spring of 2004.
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
S. Tuecke. Data management and transfer in highperformance computational grid en vironments, 2001. Bob Byrom et al. Datagrid information and monitoring services architecture: Design, requirements and evaluation criteria, 2002. Condor Project: http://www.cs.wisc.edu/condor/. EU DataGrid Project: http://eu-datagrid.web.cern.ch/eudatagrid/. Global Grid Forum: http://www.ggf.org/. Globus Alliance: http://www.globus.org/. Hawkeye: http://www.cs.wisc.edu/condor/hawkeye/. National Research Grid Initiative: http://www.naregi.org/. P. Pattnaik, K. Ekanadham, and J. Jann. Autonomic ccomputing and grid, in grid computing (edited by fran berman et al.), 2003. B. TIERNEY, R. AYDT, D. GUNTER, W. SMITH, V. TAYLOR, R. WOLSKI, and M. SWANY. A grid monitoring architecture, 2002. R. Wolski, N. T. Spring, and J. Hayes. The network weather service: a distributed resource performance for ecasting service for metacomputing. Future Generation Computer Systems, 15(5–6):757–768, 1999.
Acknowledgments This work is being partly supported by the National Research Grid Initiative (NAREGI), initiated by the Ministry of Education, Sports, Culture, Science and Technology (MEXT).
References [1] W. Allcock, J. Bester, J. Bresnahan, A. Chervenak, I. F. ster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnel, and
Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04) 0-7695-2050-2/04 $20.00 © 2004 IEEE