Autonomic computing and IBM System z10 active resource monitoring
T. B. Mathias P. J. Callaghan
Among the essential components of the IBM System z10e platform is the hardware management console (HMC) and the IBM System ze support element (SE). Both the SE and the HMC are closed fixed-function computer systems that include an operating system, many middleware open-source packages, and millions of lines of C, Cþþ, and Javae application code developed by IBM. The code on the SE and HMC is required to remain operational without a restart or reboot over long periods of time. In the first step toward the autonomic computing goal of continuous operation, an integrated, automatic software resource monitoring program has been implemented and integrated in the SE and HMC to look for resource, performance, and operational problems, and, when appropriate, initiate recovery actions. This paper describes the embedded resource monitoring program in detail. Included are the types of resources being monitored, the algorithms and frequency used for the monitoring, the information that is collected when a resource problem is detected, and actions executed as a result. It also covers the types of problems the resource monitoring program has detected so far and improvements that have been made on the basis of empirical evidence.
Introduction A typical IBM System z10* platform is shown in Figure 1(a). The customer operates the z10* system using a graphical user interface (GUI) available on the hardware management console (HMC) [1]. The HMC is a desktop computer running an operating system (OS), many integrated open-source packages (such as communications, security, and GUI packages), and a few million lines of custom software, often described as firmware or Licensed Internal Code (LIC). Using a local area network (LAN), it communicates with the two support elements (SEs) that are located in the Z-frame [2]. While only one HMC is required, customers normally purchase more for redundancy. Both the HMC and SEs are closed fixed-function devices, and the customer cannot install any code or applications on the systems other than the firmware supplied by IBM. A typical high-availability configuration is to have at least two HMCs communicating via two LANs to the two
SEs, as shown in Figure 1(b). An SE is a laptop computer running an OS, many integrated open-source packages, and several million lines of LIC, developed by several IBM development teams throughout the world. The SEs and HMCs are critical to the continuous operation of the System z10 platform, because they are essential for controlling the system. Autonomic computing is part of the IBM information technology (IT) service management vision. Autonomic computing systems have the ability to manage themselves and dynamically adapt to change in accordance with business policies and objectives, enabling computers to identify and correct problems, often before they are noticed by IT personnel. A key goal of IBM autonomic computing is continuous operation, achieved by eliminating system outages and making systems more resilient and responsive. Autonomic computing takes its inspiration from the autonomic nervous system, which automatically regulates many lower-level functions in
Copyright 2009 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/09/$5.00 ª 2009 IBM
IBM J. RES. & DEV.
VOL. 53 NO. 1 PAPER 13 2009
T. B. MATHIAS AND P. J. CALLAGHAN
13 : 1
The A-frame contains the processors, memory, and some I/O The Z-frame contains the support elements (SEs) and some I/O Hardware management console (HMC) Local area network
(a) HMC
A-frame
Z-frame
implemented in three different ways: locally, using knowledge obtainable on the system; through a peer group autonomic function that requires the local community to share knowledge; and through a networkbased autonomic function that can include software updating and backup and restore [7]. Bantz et al. [7] state that ‘‘a computer system is autonomic if it possesses at least one of the four key attributes.’’ Referring to Figure 1, one could consider the SEs and HMCs to be merely part of a single z10 system. However, for purposes of this paper, Figure 2 is a more appropriate representation because it shows the HMCs and SEs as separate computer systems that happen to work with other firmware components to form the complete IBM System z10 platform. Thus, here we treat the SE and the HMC as separate systems.
Solution requirements HMC
SE
SE
HMC Front view (b)
Figure 1 IBM System z* mainframe: (a) basic configuration; (b) a typical configuration.
animals [3], for example, balance, digestion, and blood circulation [4, 5]. According to Ganek and Corbi [3], four key fundamental attributes are required for a computer system to be autonomic: It must be self-configuring, selfhealing, self-optimizing, and self-protecting. Ganek and Corbi outline how these features can be implemented in a computer system via a control loop that can be described as a monitor, analyze, plan, and execute (MAPE) loop. Also required for implementation are the resources that can be monitored (via sensors) and changed (via effectors). This means that an autonomic system will monitor resources, analyze the data, figure out what to do, and then make the necessary changes—all within a loop intended to ensure that the system performs the desired functions. Additionally, autonomic elements may be cascaded together to interact with each other over an autonomic signal channel [6]. Autonomic function can be
13 : 2
T. B. MATHIAS AND P. J. CALLAGHAN
The SE and HMC have had considerable problem detection and reporting capability for many years. For example, the firmware is written to verify the ability to allocate new resources and, if resources are exhausted, report an error back to IBM through the IBM RETAIN* database (Remote Technical Assistance Information Network) used for system health monitoring and preventive maintenance. This would result in the customer taking some action through the GUI or in service personnel being dispatched to resolve the problem. In severe cases, the SE or HMC would automatically reboot, with the intention of cleaning up the resource problem; however, even the few minutes required to do this can have a negative impact on the customer’s business. Despite a rigorous development process, such as design reviews, code reviews, and extensive manual and automated simulation testing, code bugs still persist. As the first step toward an autonomic SE and HMC, it was decided to focus on the self-healing aspect of autonomic computing and, in particular, on an autonomic element (AE) that primarily monitors and performs low-risk actions to remediate problems when they are detected. In addition to detecting and solving problems, any solution also had to meet the following requirements: Any management or remediation to be performed had
to be low risk. Because the AE is code, there is a potential for it to make a wrong decision. It is important that an action taken does not create a more severe problem or adversely affect other parts of the system. The solution had to have a relatively minimal impact on SE and HMC performance and other resources, such as RAM (random access memory) and direct access storage devices (DASDs).
IBM J. RES. & DEV.
VOL. 53
NO. 1 PAPER 13 2009
The solution had to be something that can be shipped
to the field with few to no false positives (see the section ‘‘False positives’’ below), but at the same time it has to be powerful enough to use during development and test in order to find resource problems. (To date, quite a few resource problems have been detected and fixed in development and test and a few were found in the field.) The solution had to work with different OSs and not be limited to only a few compiler languages. The solution could not require the LIC in the SE or HMC to be instrumented, as this was impractical given the many different packages that comprise the LIC. An additional concern was that instrumentation could slow down the SE or HMC. The solution had to work well in two environments: the customer site where the SE and HMC were shipped, and in the development test environment, where the SEs and HMCs are restarted on a very regular basis, often daily. The solution could not be limited to only the System z platform SE and HMC. It had to also be operational in the HMC for the IBM pSeries* [8] product.
Related work Monitoring of computer systems is not new. OSs normally offer tools and additional resource monitoring programs that are available for common OSs. Examples of basic monitoring tools include the procps package for Linux** [9] (which includes several commands, such as top, ps, free, and vmstat), the IBM z/OS* Resource Measurement Facility (RMF) [10], or the Microsoft Windows** Task Manager [11]. These tools show utilization of resources, but they are premised on an expert reviewing the data to determine whether there is a problem and where. Also, there are many excellent tools available to help with the determination of the problem, but in order to function, the tools have to be in use during the time the problem is occurring. These tools, for example, Valgrind [12], are not normally run in production since the impact on system performance is too severe. There are basic resource-limiting tools, for example, the ulimit command built into shells such as Bash [13], that can be used to restrict the resource usage for a process started by the shell. However, the configuration of these limits does not allow for notification or first failure data capture (FFDC) as the limits are approached. For more details, see the section ‘‘First failure data capture’’ below. More sophisticated tools include the Microsoft Management Console (MMC) Performance Console [14]
IBM J. RES. & DEV.
VOL. 53 NO. 1 PAPER 13 2009
LPAR hypervisor
HMC Alternate SE HMC
FSP
GA n
Channel
I/O processor code
Primary SE HMC
i390 code
CFCC
GA n ⫹ 1
FSP
Processing unit millicode
Channel
Power
Figure 2 System z10 firmware structure.
and Nagios** [15]. Both of these tools provide a means to monitor resources and log and make notification of problems, but they still require an expert to set up the thresholds and, when a notification of a problem is received, diagnose the system to determine the cause of the problem. Similar to the active monitoring solution, Nagios also has plug-ins to perform actions, such as restarting an HTTP (Hypertext Transfer Protocol) server if it goes into a nonoperational state. Nagios also provides a rich infrastructure for plug-ins to be developed. This infrastructure can be used to monitor resources and to configure services for event handling and event notification. However, it does not support the concept of the infrastructure aggregating information from several sources and sending it along with a notification—a capability that is required to analyze the problem offline from a system perspective. For example, when the IBM solution described in the next section detects a problem, data is collected on how the entire system is performing—including aggregated system-level information such as the levels of CPU, memory, file handling, and network resource usage and internal process information, such as the trace events in the HMC and SE firmware processes that are spread across multiple processes and address spaces. When the collection phase has completed, the aggregated information is automatically sent to the RETAIN system. Several autonomic monitors have been described, for example, Personal Autonomic Computing Tools [16]. This work describes self-monitoring for potential personal system failures and outlines several components
T. B. MATHIAS AND P. J. CALLAGHAN
13 : 3
that were monitored in a systems monitor utility prototype: processor work rate, memory usage, thread monitoring, and services and processes. However, this was only a prototype and it did not include any support to actually execute an action automatically. Another area of personal computer health monitoring is the concept of a pulse monitor [17]. The assumption behind pulse monitoring is that dying or hanging processes can indicate how healthy or unhealthy a system is. The problem with using an approach such as this on the HMC or SE is that many processes are considered so critical to normal operations that if they die, the SE or HMC must be restarted. In other words, even one dying or hung process is a critical failure that must be avoided.
The IBM solution: Active resource monitoring The approach we took was to implement a permanent active resource monitoring (ARM) program as an AE for the SE and HMC using the standard MAPE approach. In the monitoring phase, it periodically runs and gathers resource information. Then it analyzes the data and determines whether a problem exists. In the plan and execute phases, a set of actions is determined and performed. For example, upon detection of a problem, FFDC information is collected and saved, and a notification of the service being required is generated. In some cases, the program will also initiate a recovery action or actions. Currently, the actual recovery actions are limited by previously stated requirements to minimize risk. They consist of erasing selected files and terminating programs. Only programs that are deemed noncritical are eligible for termination unless the overall SE or HMC resources are so low that terminating the program is preferable to problems that would occur if the entire SE or HMC ran out of that resource. Compared with the related work described above, the ARM solution has the following benefits:
1. Automatic cleanup of some resources is performed. 2. Under certain conditions, programs with unhealthy amounts of resource utilization are terminated. 3. When a problem occurs, instead of just knowing that there is a problem, sufficient data is captured from multiple sources to enable an offsite expert at IBM to later investigate the problem and correct the firmware bug without requiring a recreation of the problem. 4. The code is not a prototype; it is being used by IBM customers. It has undergone extensive use and modification to verify that it captures the problems we want to find and that it does not report nonproblems (false positives).
13 : 4
T. B. MATHIAS AND P. J. CALLAGHAN
5. Programmatic control of the monitor and its thresholds is provided. Overview When ARM is initiated, it reads in a properties file that is used to configure it. The properties file is a standard key and property value file and is easily edited if an expert needs to alter the behavior of ARM. The properties file contains the following: a flag to disable the checking if desired, the checking frequency, the thresholds (i.e., the rules to indicate when something is a problem and to control automatic recovery actions), and a list of monitor extensions to call. ARM is basically a large MAPE loop. It periodically checks resources (60 seconds is currently specified in the properties file). It obtains data about a type of resource and then checks that resource against the thresholds. If needed, it will initiate recovery actions at that time if supported. If any problem is detected, at most, one service call is placed to service personnel. In addition, after a second longer period (currently 6 hours), additional long-term trends are checked and a snapshot of the collected resource data is saved in a file. This file is one of the pieces of data collected when a resource problem is identified because it can show longterm trends in the system. Because ARM monitors some resources that are inaccessible to low-privilege users, and since the monitor performs actions such as collecting FFDC data and terminating processes, it runs under a high-privilege user identification. Configurable thresholds All resource issues that can be reported have configurable thresholds in the properties file. The thresholds are relative, that is, instead of absolute values, they are expressed as a percentage of the capacity of a resource. For example, most DASD partitions have a threshold of 80.0%, so if that DASD is 80% or more full, then it is considered a problem. If we change the size of the DASD partition, the properties file does not have to change. Likewise, the threshold for the total amount of memory (RAM plus swapped memory) is also a percentage. Again, if the amount of the RAM or the swap space changes, it does not require a change in the properties file. Absolute thresholds include those related to memory usage within a process. The SE and HMC are currently running a 32-bit OS, so it is important to make sure that all processes fit within the address space limitations of such a system. In addition to a properties file, application programming interfaces (APIs) are provided with ARM to allow other firmware to dynamically adjust the thresholds. For example, some firmware needs an extra
IBM J. RES. & DEV.
VOL. 53
NO. 1 PAPER 13 2009
large amount of memory when performing a specific task, such as the power-on reset. This program calls the API to raise its process limits, and then when the power-on reset is complete, it calls the API again to restore the checking to normal limits. All process-specific thresholds can be overridden, and the system-wide limit for processor utilization can be overridden as well. The DASD limits cannot be overridden at this time because sufficient DASD is provisioned on the SE and HMC so that the thresholds should not be reached. Establishing thresholds Initially, thresholds were established by simply looking at the available resources in the SE and HMC. For example, it was decided that we did not want used file space to exceed 80% of that available. This threshold has remained set at this value; when we found we were nearing this limit due to the system design, the file system was changed to provide more space. For process memory utilization, the monitoring program long-term trend files were captured from test systems that had run for a lengthy period of time or that had performed testing that stresses the SE or HMC. Fortunately for the SE and HMC, most processes have a unique name. A tool was written to find the maximum memory utilization for each process, and the thresholds were then set to 125% of this value. During the internal test cycles, the monitoring continues to run, and as problems are detected, they are manually analyzed. Some have turned out to be due to resource problems; in those cases, the program was corrected. However, if the resource utilization was determined to be correct (perhaps it increased due to a required design change), then the threshold was increased. The SE and HMC do have advantages over a normal personal computing environment in that they are closed. The SE or HMC installs itself from the master media, and after restoring any previous user customizations, it automatically starts and runs. This master image is created through a very tightly controlled development process, so it is well known when a new piece of code will appear requiring that experimentation be performed to determine the proper threshold. While we could fully automate the setting of thresholds on the basis of a collection of trend data from various systems, we decided that, for now, we still want to have an expert in the process. By requiring that a developer justify a significant increase in resource utilization, it forces a design or code inspection of the changed area, and that ensures that the design and code are acceptable. Specific resource monitoring Many different types of resources and usage characteristics are checked with each iteration of the
IBM J. RES. & DEV.
VOL. 53 NO. 1 PAPER 13 2009
checking loop. Unless otherwise stated, the checks described in the following sections are performed on every loop. Memory usage and trends ARM checks that the amount of free memory (RAM plus swap space) must be at least a minimum percentage. It also checks that the total amount of free space in the swap area is above a minimum threshold and the total swap size is above a minimum. Together, these two checks ensure that the swap area is accessible and functioning as expected. In earlier versions, these checks were not made, so we did not become aware of swap problems until the total amount of free memory was too small. With these new checks, we can immediately identify a problem rather than waiting for the system to run low on memory. The percentage of free memory in the Java** Virtual Machine (JVM**) [18] is checked. For each process, the amount of real and virtual memory used is checked. There is one threshold for reporting an error and a second set of higher thresholds that, if exceeded, cause ARM to terminate the offending process. In addition to the data directly collected, existing firmware in the SE or HMC then captures data from the terminating process. Also, the existing firmware may restart the application if it is critical enough. The thresholds for a process can be tailored to each process. Regular expressions are used to match the line in the properties file with the name of the process. This allows a process to be assigned unique thresholds and ensures that all processes have a threshold. In addition to checking on every loop, a snapshot of the usage is stored in a table periodically at a longer interval, called the snapshot interval (currently once every 6 hours). It is then analyzed for memory leaks. The algorithm is fairly straightforward: A process is deemed to be leaking if over the last n snapshots (currently n ¼ 8, which means the code is looking at the last 48 hours of data), its memory usage has increased m times (currently m ¼ 7) and it has never gone down. The one process that is skipped for this analysis is Java. A JVM loads in classes only as needed, so it takes a very long time (longer than 48 hours) for it to stop increasing in size. (See the section ‘‘JVM operational monitoring,’’ below, for details on how ARM detects memory leaks in the JVM.) DASD usage: File size, number of files, and number of open files For each DASD partition, the amount of free space and the number of free inodes1 is checked. In addition, the 1
An inode is a Linux structure that stores basic information about a regular file. It is possible to have so many files that the file system runs out of inodes but still has room on the hard drive.
T. B. MATHIAS AND P. J. CALLAGHAN
13 : 5
number of file descriptors used by each process is checked and the total number of file descriptors in use by the entire system is checked. Currently, the list of partitions checked is fixed in the code in order to ensure a unique error code being reported for each partition. If a DASD partition has a problem, then a full list of all files and their sizes is collected for FFDC purposes. Then, special cleanup firmware is called. This cleanup firmware will automatically erase files in known temporary directories or with certain names that, by convention, indicate files that should be erasable. CPU usage The average CPU usage of a process and the entire system is examined over the last n minutes (currently 10 minutes). There is a threshold for a process and a higher threshold for the system, and a problem is reported if either threshold is exceeded. Certain programs tend to make the CPU busy for a much longer period of time. These programs use APIs to completely disable the CPU checking while they execute. The override is valid only for a relatively short period of time (currently 1 hour). Thus, if the program forgets to restart the CPU checking, the checking will automatically resume. Additionally, a record of all overrides of CPU usage are kept in a log file for later analysis if necessary. Performance and dispatch testing Not all code on the SE or HMC runs with the same priority level. This means that it is possible for a highpriority program to monopolize the CPU time, thus starving other programs of CPU time. Also, DASD problems can sometimes result in programs not being dispatched (i.e., not being given processor time to run). In addition, sometimes firmware developers think the SE or HMC is running too slowly, either because it is encountering a timeout in its firmware or it is examining FFDC data for another problem. In order to try to detect problems such as this during the time between full resource checks, ARM divides the full resource checking time (currently 60 seconds) into 1-second intervals. Every second, it wakes up and adds the elapsed time to the remaining number of seconds and compares that against the expected time. If the computed time exceeds a threshold (currently 150%), then it concludes that dispatching has been delayed and it reports a problem. For example, suppose the code has iterated ten times with 50 iterations (i.e., 50 seconds) remaining. If the elapsed time since the start of the loop is 12 seconds, then no problem is detected because 12 seconds plus the remaining 50 iterations would yield a time of only 62 seconds, which is below the 150% threshold. However, if it woke up and found an elapsed time of 45 seconds, then an error would be reported because the
13 : 6
T. B. MATHIAS AND P. J. CALLAGHAN
elapsed time of 45 seconds plus the remaining time of 50 seconds totals 95 seconds, and that is more than 150% above the nominal 60-second period. When ARM wakes up, the time consumed by the previous periods of sleeping is not used in estimating how long the remaining loops might take, because we only want to know whether it looks as if there has been a problem. If there are, for example, 45 loops remaining, we know that these will take at least 45 seconds. What we really want to know is given the length of time already taken, if we factor in the minimum amount of time the remaining loops will take, will we be over the threshold? If so, then we want to flag an error immediately, as we want to collect data as close to the onset of the problem as possible. If a firmware developer thinks that the SE or HMC is running too slowly, there is an API call to trigger an immediate report of a performance-related problem and appropriate data is collected. Not only is the current performance data collected, but also the performance data associated with the previous main polling period. The delta between these two sets of data can be used to see which threads ran in the interval and hopefully identify whether a high-priority thread monopolized the CPU during the time period. Thus, if someone is looking at FFDC data for a different problem, such as a timeout, and they think that a slow system is involved, then the presence or absence of one of these problem reports can prove or disprove the hypothesis. JVM operational monitoring A key process monitored on the HMC and SE is the Java bytecode interpreter (i.e., the JVM) because this process provides many key functions, including providing the internal Web server and servicing all GUI functions selected by the user. Therefore, ARM ensures that this process does not have operational problems such as hang conditions or resource problems such as out of memory. ARM provides the ability to add extensions via code supplied in a shared library. These extensions are listed in the configuration file and are coded to a designed interface. The interface allows the extension to be presented with the values in the configuration file of the active monitor and for the extension to return a free format description of the status of the extension. The JVM monitoring support is provided as one of these extensions. The JVM monitoring extension provides four functions by default. The first is to perform an HTTP request to the internal Web server running in the JVM. This call is made in order to measure the roundtrip response time for a call between the active monitoring process and the Web server. If a response is not received in a configurable
IBM J. RES. & DEV.
VOL. 53
NO. 1 PAPER 13 2009
amount of time (the default is 10 seconds), then the request is assumed to fail. If there is a configurable pattern of these failures (the current default being eight consecutive failures), the JVM is assumed to be not operating normally, and FFDC information is obtained. The second function performed by the JVM monitoring support is to use the infrastructure in which other non-JVM processes on the system communicate with the JVM. The ARM process uses this infrastructure to communicate with the JVM and to ask the JVM for its unique instance identifier. This request is performed for two purposes. First, the identifier is used so that the active monitor can track unique JVM invocations and ensure that any JVM problem is reported once and only once for that invocation. Second, the call is used to measure the response time for the roundtrip processing of a request from the active monitor process to the JVM process. If a response is not received in a configurable amount of time (the default is 10 seconds), then the request is assumed to fail, and if there is a configurable pattern of these failures (the default is eight consecutive failures), the JVM is assumed to be not operating normally. Also, if there are configurable patterns of failure in both the first function and the second function (the default is four consecutive failures), the JVM is assumed to be not operating. In other words, failures in both functions indicate stronger evidence that the JVM is unhealthy. The third function performed is to measure the Java heap usage in the JVM. If the usage is over a configurable threshold, a log entry is taken along with a JVM heap dump [19]. Additional log entries are not taken for high usage unless the heap usage drops below a lower configurable threshold and then, once again, subsequently surpasses the threshold. The fourth function performed by the JVM monitoring support is to attempt to discover memory leaks in the JVM process that are outside of the Java heap, for example, in the JNI** (Java Native Interface) [20] extensions. To implement this support, a function was created, ProcessCheck(x,y,z,b), which returns true if the virtual address space of the JVM increases at least x times out of y checks, and where the duration between the checks in seconds is z, and b is the minimum number of bytes that have to be increased; false is returned if the specified increase does not occur. The algorithm is coded such that it can detect large memory leaks quickly before they cause the system to become unstable, and at the same time, it can detect smaller memory leaks if they are observed over a prolonged period and before they can cause the system to become unstable. This is done while attempting to prevent false positives from occurring (see the section ‘‘False positives,’’ below) and without significantly impacting the system performance. For example, if a
IBM J. RES. & DEV.
VOL. 53 NO. 1 PAPER 13 2009
single ProcessCheck(48,48,60*60,3*1024*1024) rule was defined, then a large memory leak could be detected, but it would not catch a small memory leak. Conversely, if a single ProcessCheck(24,24,6*60*60,5*1024*1024) rule was defined, a large memory leak would probably cause the system to become unstable before the leak could be detected. The algorithm in Listing 1 attempts to catch large, medium, and small memory leaks as quickly as possible. Reporting problems found First failure data capture The LIC package for the System z platform was designed to support FFDC. The goal is to collect all of the data necessary to fix a problem (hardware or LIC) anywhere in the system at the time the problem occurs without having to rely on the ability to reproduce a problem in order to fix it. ARM was designed to meet this goal. On any detected problem, it logs all of the resource data to a file and also an error log. Firmware in the SE or HMC then analyzes the problem and gathers up the error log, the resource data file, and other data into a bundle that is sent to IBM for further investigation and service. Included in this bundle are many types of data, including various views of performance and recent trace data from the firmware. IBM service personnel have access to tools that sort and interpret the data. The tools can convert the data to a human-readable form, consolidate the information as a sequence of events sorted by time, and automatically attempt to find problems in the data. One of our lessons learned is that it is extremely important to make sure that enough data is collected to understand the problem being identified and where in the customer’s system the problem might originate. A second lesson learned on the SE and HMC is that the collection of data must be done as close to the detection of the problem as possible. In early implementations, when this was not always the case, we would sometimes see that the DASD was very full, but by the time the data was collected, the large files were gone. We had a similar problem with performance problems, where the offending process had terminated or stopped doing whatever had led to the issue. Long-term history and trend file ARM maintains a long-term history and trend file. Every so often (every 6 hours), the program writes a snapshot of all resource information available. It prunes this file as needed to make sure that it does not exceed a size specified in its properties file. The size of this file was selected to allow for several weeks of data to be retained. This file is included with any problem reported and can be
T. B. MATHIAS AND P. J. CALLAGHAN
13 : 7
Listing 1
Graduated memory-leak detection algorithm.
if (ProcessCheck(48, 48, 60*60, 3*1024*1024)) f // 3 Meg per hour for 48 consecutive hours logErrorLog(); g else if (ProcessCheck(96, 96, 60*60, 1024*1024)) f // 1 Meg per hour for 96 consecutive hours logErrorLog(); g else if (ProcessCheck(16, 16, 6*60*60, 10*1024*1024)) f // 10 Meg per 6 hours consecutive for 4 days logErrorLog(); g else if (ProcessCheck(24, 24, 6*60*60, 5*1024*1024)) f // 5 Meg per 6 hours consecutive for 6 days logErrorLog(); g else if (ProcessCheck(40, 40, 6*60*60, 2*1024*1024)) f // 2 Meg per 6 hours consecutive for 10 days logErrorLog(); g else if (ProcessCheck(22, 24, 6*60*60, 5*1024*1024)) f // 5 Meg per 6 hours for 22 of 24 checks (over 6 days) logErrorLog(); g else if (ProcessCheck(75, 80, 6*60*60, 2*1024*1024)) f // 2 Meg per 6 hours for 75 of 80 checks (over 20 days) logErrorLog(); g else if (ProcessCheck(110, 120, 6*60*60, 1024*1024)) f // 1 Meg per 6 hours for 110 of 120 checks (over 30 days) logErrorLog(); g
used to determine whether the problem began sometime in the past or whether it occurred suddenly. In addition to the periodic entries, if an error is detected, then a snapshot of the data at the time of the error is also appended to the file. Automatic recovery actions If the memory usage of a process exceeds a relatively high threshold, then the process is terminated. Even before ARM was implemented, the firmware contained code to restart a process that trapped or otherwise terminated abnormally and it also contained support to restart all of the firmware code if a critical piece were to fail. ARM can, therefore, terminate a process if necessary; the process, or even the entire SE or HMC, is automatically restarted if necessary. If a DASD partition becomes too full, then there is a program that will automatically erase files that are deemed by convention to be erasable. Extensions On every pass through the checking loop, ARM calls extensions described in the properties file. The properties file controls the order in which these extensions are
13 : 8
T. B. MATHIAS AND P. J. CALLAGHAN
called and describes the module and function name to call. Each called function must conform to a standard template that gives it access to the options file (so all thresholds are in one file) and provides a way for it to return resource information. This allows ARM to save a snapshot of monitored resource data when desired. Currently, the JVM monitoring described in this paper is implemented via an extension.
Practical experience This section outlines a number of resource problems that we have investigated and fixed. In most cases, the monitor worked well, and many resource problems were found. However, we found instances where we needed to enhance the monitoring and FFDC. Java thread leak In a Java thread leak, Java code that made calls to the JNI was incorrect and did not properly terminate these native threads. Over time, a large number of the threads built up. ARM detected this when the number exceeded the threshold and there was enough FFDC data to investigate a handful of specific Java classes and determine the source of the problem.
IBM J. RES. & DEV.
VOL. 53
NO. 1 PAPER 13 2009
Java heap leak Since the introduction of the Java heap support in ARM, there have been numerous detections of excessive heap usage. For all of these, a Java heap dump was requested and made available in the FFDC information collected for these types of problems. When this heap dump was processed with JVM tools, it was always clear which Java objects were being leaked, and usually an examination of the source code was enough to determine the problem. For some instances of these problems, examination of the source code was not enough, and trace buffers included in the FFDC information were used to determine the problem. All occurrences of this type of problem have been successfully diagnosed and corrected. Priority inversion problem In this classic priority inversion problem, some lowpriority code obtained a resource but then high-priority code ran and tried to obtain the resource. Because the high-priority program polled instead of yielding the processor, the low-priority program could not finish with the resource and release it. Eventually the high-priority program timed out, and the firmware resumed normal operations. This problem could be recreated every so often (perhaps once every few hours), so it was hard to track down. It was finally identified by altering ARM to collect more historical CPU utilization kept by the OS, which allowed an expert to manually discover the thread that was consuming an unexpectedly large amount of CPU, and after a brief code inspection, he found the problem. As a result of this problem, the CPU dispatching test support was added to ARM, and additional historical CPU utilization data from the OS was also permanently collected. Memory leak Before support was added to monitor long-term process memory usage, we had a problem from the field whereby a process was using too much memory and the memory leak was fairly severe. Within a few weeks, a process would leak enough memory to trigger the threshold for problem reporting and, within a few more weeks, would have completely run the system out of memory or exhausted its addressing space. While we did have enough information to find the problem, the lesson learned was to try to identify memory leaks sooner. As a result of this problem, support for long-term memory checking was added to ARM. Since its introduction, this support has successfully identified many memory leaks found in the development and testing phase.
IBM J. RES. & DEV.
VOL. 53 NO. 1 PAPER 13 2009
Running out of space on a DASD partition When space began to run out on a DASD partition, ARM started to report a problem on the DASD partition. Further investigation revealed that this was a design problem in that the DASD partition was not large enough to handle what needed to be stored. The result was a decision to increase the size of the DASD partition prior to shipment to customers. JVM hang detection Before JVM monitoring support was added to ARM, there were several JVM hangs that went undiagnosed. This was because the system user, upon encountering the HMC or SE in an unusable state, would reboot the system and then report the problem to development. Since the system had been rebooted, there was little information about what caused the problem. When JVM monitoring support was added, the user performed the same action of power recycling the system, but in all instances, the JVM was hung long enough so that the monitor detected the problem, collected FFDC information, and saved the data to disk. This FFDC information proved to be very useful in diagnosing these types of problems. False positives One thing that all implementations of ARM must deal with is a false positive, that is, an incident when ARM reports a problem when no problem actually exists. One example relates to checking the long-term memory usage. The initial checking was for no decreases over the last eight 6-hour samples and at least four increases. This was found to be too sensitive, so the threshold was changed to the current one, which is to report a potential leak if there are no decreases in the last seven 6-hour samples and at least six increases. Another example of a false positive occurred when checking the long-term memory usage of the JVM. Because the JVM loads a class the first time it is needed, it was found that it was still loading classes several days after it was started. This fooled the long-term analysis routine enough that we had to develop the more sophisticated JVM memory monitoring described above in the section ‘‘JVM operational monitoring.’’
Potential enhancements Automatic learning One potential enhancement to ARM is to have it learn what amount of memory usage for each process is considered normal and to report unusual changes in order to minimize the number of false positives. This will require further investigation and experimentation.
T. B. MATHIAS AND P. J. CALLAGHAN
13 : 9
Restarting the JVM A potential enhancement to ARM is in the area of monitoring the JVM. Currently, upon detecting a hung JVM, FFDC information is captured and a log entry is created. This is useful because it captures FFDC information very close to the time of the failure. However, it probably does not alert anyone about the problem until the JVM is restarted, because the code to send an alert about the problem runs in the hung JVM process. Therefore, a potential enhancement is to automatically restart the JVM upon detecting that the process is hung. After it is restarted, the problem analysis code detects that the problem was logged but not yet reported, so it does so. (Problem analysis is a component on the SE and HMC that analyzes all errors. It correlates these errors and decides which one or ones are the most important and then proceeds to collect data related to these problems before transmitting the data to IBM for service.) In addition, the restart of the JVM should allow for the HMC or SE to be usable once again. Because restarting the JVM is a potentially destructive action, there would be two levels of confidence that the JVM is hung: a lower level of confidence needed for capturing FFDC information and a higher level needed before restarting the JVM.
Conclusions ARM has proven to be a useful technique to detect design and programming defects. It has also proven to be extendable to monitor different types of resources and for different types of problems than did the original implementation. We expect to continue to extend ARM as necessary in the future.
Acknowledgment We thank Kurt Schroeder, who contributed some code to monitor the Java heap utilization of the JVM. *Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both. **Trademark, service mark, or registered trademark of Linus Torvalds, Microsoft Corporation, Nagios Enterprises, LLC, and Sun Microsystems, Inc., in the United States, other countries, or both.
References
3. A. G. Ganek and T. A. Corbi, ‘‘The Dawning of the Autonomic Computing Era,’’ IBM Syst. J. 42, No. 1, 5–18 (2003). 4. D. M. Russell, P. P. Maglio, R. Dordick, and C. Neti, ‘‘Dealing with Ghosts: Managing the User Experience of Autonomic Computing,’’ IBM Syst. J. 42, No. 1, 177–188 (2003). 5. IBM Corporation, ‘‘An Architectural Blueprint for Autonomic Computing,’’ white paper (June 2005); see http:// www-03.ibm.com/autonomic/pdfs/AC%20Blueprint%20White% 20Paper%20V7.pdf. 6. R. Sterritt and D. Bustard, ‘‘Towards an Autonomic Computing Environment,’’ Proceedings of the 14th International Workshop on Database and Expert Systems Applications, Prague, Czech Republic, 2003, pp. 694–698. 7. D. F. Bantz, C. Bisdikian, D. Challener, J. P. Karidis, S. Mastrianni, A. Mohindra, D. G. Shea, and M. Vanover, ‘‘Autonomic Personal Computing,’’ IBM Syst. J. 41, No. 1, 165–176 (2003). 8. IBM Corporation, Operations Guide for the Hardware Management Console and Managed Systems, Version 7, Release 3, Document No. SA76-0085-04, April 2008; see http://publib.boulder.ibm.com/infocenter/systems/scope/hw/ topic/iphdx/sa76-0085.pdf. 9. SOURCEFORGE.NET, Procps - The /proc File System Utilities; see http://procps.sourceforge.net/. 10. IBM Corporation, Resource Measurement Facility User’s Guide, Document No. SC33-7990-11, September 2006; see http://publibz.boulder.ibm.com/epubs/pdf/erbzug60.pdf. 11. Microsoft Corporation, Task Manager; see http:// www.microsoft.com/technet/prodtechnol/windows2000serv/ reskit/core/fneb_mon_oyjs.mspx?mfr¼true. 12. Valgrind Developers, Valgrind; see http://valgrind.org. 13. GNU Project, Bash Reference Manual; see http:// www.gnu.org/software/bash/manual/bashref.html. 14. Microsoft Corporation, Microsoft Management Console; see http://technet2.microsoft.com/windowsserver/en/library/ 329ce1bd-9bb4-4b63-947e-0d1e993dc27d1033.mspx?mfr¼true. 15. Nagios Enterprises, LLC, Nagios Open Source Project; see http://www.nagios.org. 16. R. Sterrit, B. Smyth, and M. Bradley, ‘‘PACT: Personal Autonomic Computing Tools,’’ Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems, Greenbelt, MD, 2005, pp. 519–527. 17. R. Sterrit and S. Chung, ‘‘Personal Autonomic Computing Self-Healing Tool,’’ Proceedings of the 11th IEEE International Conference and Workshop on the Engineering of ComputerBased Systems, Brno, Czech Republic, 2004, pp. 513–520. 18. IBM Corporation, System z10 Enterprise Class System Overview, Document No. SA22-1084, June 2008; see http:// www-1.ibm.com/support/docview.wss?uid¼isg29ea3f936978cba 27852573f900774732. 19. IBM Corporation, Java Diagnostics Guide 5.0; see http:// publib.boulder.ibm.com/infocenter/javasdk/v5r0/ index.jsp?topic¼/com.ibm.java.doc.diagnostics.50/diag/ welcome.html. 20. S. Liang, The Java Native Interface: Programmer’s Guide and Specification, Prentice Hall PRT, Upper Saddle River, NJ, 1999; ISBN 0-201-32577-2.
Received January 18, 2008; accepted for publication June 4, 2008
1. IBM Corporation, System z Hardware Management Console Operations Guide, Version 2.10.0, Document No. SC28-6867, July 22, 2008; see http://www-1.ibm.com/support/docview. wss?uid¼isg2bac11e0b02e3aa73852573f70056c860. 2. IBM Corporation, System z10 Enterprise Class Support Element Operations Guide, Version 2.10.0, Document No. SC28-6868, February 26, 2008; see http://www-1.ibm.com/ support/docview.wss?uid¼isg2e4d256a8a69d49da852573f 7006c82db.
13 : 10
T. B. MATHIAS AND P. J. CALLAGHAN
IBM J. RES. & DEV.
VOL. 53
NO. 1 PAPER 13 2009
Thomas B. Mathias IBM Systems and Technology Group, 1701 North Street, Endicott, New York 13760 (
[email protected]). Mr. Mathias is a Senior Engineer. He received his B.S. degree in electrical engineering from Ohio State University. He worked in System z hardware development and later in firmware development. He is a licensed Professional Engineer in the state of New York. He is coinventor of three U.S. patents, and he has one pending patent application. He has received numerous IBM awards. Patrick J. Callaghan IBM Systems and Technology Group, 1701 North Street, Endicott, New York 13760 (
[email protected]). Mr. Callaghan is a Senior Engineer. He received his B.S. degree in computer science from the State University of New York at Buffalo. He worked on a variety of advanced technology projects and recently worked on the team developing System z firmware. He is the inventor or coinventor of three U.S. patents, and he has four pending patent applications. He has published five articles as IBM Technical Disclosure Bulletins and received numerous IBM awards.
IBM J. RES. & DEV.
VOL. 53 NO. 1 PAPER 13 2009
T. B. MATHIAS AND P. J. CALLAGHAN
13 : 11