OmniVision and Capacity Management
White Paper September 2008
© Copyright 2008 Systar. All rights reserved.
Abstract Organizations are striving to operationalize their approach toward managing capacity for both virtual and physical server investments. In order to operationalize the growing complexity of their server environments, IT operations teams face an important challenge: providing sufficient IT capacity to meet the demands of the business at all times while keeping costs and operational efficiencies under control. Unfortunately, the traditional non-scientific method of assessing and managing capacity for a few servers, falls short of the intelligence, analytics, and scalability required to properly and quickly assess environments extending beyond 100 virtual or physical servers. Hardware investments are often 50% of an IT organization’s annual budget. Because the investment is so substantial, IT organizations and their management teams are under constant pressure to: • • •
rationalize new purchases; maintain the right level of staff expertise to manage the environment; ensure the right management tools and processes are in place to provide proper levels of visibility and control; and
© Copyright 2008 Systar. All rights reserved.
•
optimize service performance in the existing landscape.
This paper discusses how OmniVision capacity management software addresses these challenges by providing clear insight to IT’s preparedness to successfully meet the business’ demands across the ever-changing virtualized and physical server landscape. This paper not only explorers how OmniVision can be used to manage capacity across the entire server, but how it also facilitates the capacity management process. Organization’s today rarely have the time or expertise to introduce new ITIL practices like capacity management. To accelerate adoption of industry best practices for capacity management, OmniVision has automated the processes and analysis sought by IT Operations Managers and System Administrators. Although OmniVision is well suited for experienced capacity planners, it is also a powerful solution for non-specialists. Discussions include details on OmniVision’s functionality. This white paper describes the features and functionality of OmniVision Version 5.9.2.
Table of Contents OmniVision Methodology ...........................................................................................................4 OmniVision Architectural Basics ................................................................................................5 OmniVision Metrics ....................................................................................................................5 Capacity Management: Quality Reports.....................................................................................6 Introduction to Quality Reports...............................................................................................6 Quality Reports as a Guide to Capacity Issues...................................................................7 Quality Incidents.................................................................................................................7 Quality Severity ..................................................................................................................7 Capacity Management: Resource Reports.................................................................................7 Introduction to Resource Reports...........................................................................................7 Resource Reports Classification System............................................................................8 Resource Reports as a Guide to Capacity Issues ..............................................................8 Capacity Management: Virtual Capacity Reports .......................................................................9 Introduction to Virtual Capacity Reports .................................................................................9 Virtual Capacity Trends and Forecasts...................................................................................9 Virtual Capacity Reports as a Guide to Capacity Issues.......................................................10 Capacity Management: Virtualized Performance Reports ........................................................10 Introduction to Virtualized Performance Reports ..................................................................10 Virtualized Performance Metrics...........................................................................................11 Virtualize Performance Reports as a Guide to Capacity Issues............................................11 Summary .................................................................................................................................11 Component-In-A-Haystack Management .............................................................................11 OmniVision Domains of Value for Capacity Management ....................................................12
© Copyright 2008 Systar. All rights reserved.
OmniVision Methodology Organizations have both budgetary and service quality reasons to "right size" their resources against demands from the business and its key stakeholders. OmniVision addresses these issues by focusing on overload risks across the entire IT enterprise. Managing capacity can be achieved on a granular level focusing on individual elements within the IT infrastructure, but is even more powerful when applied to the IT server environments as a whole. At a macro level, managing capacity aids IT managers in rationalizing new and optimizing current investments, prioritizing staff assignments, communicating priorities objectively to key stakeholders, and finding the proper balance of a complex array of assets used to deliver a better quality of service. At the granular level, managing capacity to minimize overload risk is a logical, automated extension of the methods used in past decades, before the scale of the IT enterprise became the limiting factor in that approach. One basic premise of capacity management is to understand the balance of supply and demand that equates to risk levels. A resource is at risk of saturation when the amount of work placing demands on the resource exceeds its ability to respond optimally. As the imbalance between the demand and the ability to respond becomes greater, the risk for saturation of that resource increases. OmniVision is a capacity management software solution built specifically for the management of hundreds to thousands of virtualized or physical systems within a distributed infrastructure. OmniVision’s powerful analytics are able to constantly assess performance, availability, and capacity saturation risks across the data center – something that is impossible to achieve through manual calculations or traditional enterprise performance reporting tools. In virtualized environments, OmniVision capacity management focuses not only on the host systems, but also on virtual partitions, pools, clusters, and datacenters. OmniVision's approach is to minimize all aspects of system saturation risk by staying in
© Copyright 2008 Systar. All rights reserved.
front of capacity issues without jeopardizing any of the benefits from proactive management. OmniVision accomplishes this in two ways. The first is achieved by distilling the fundamentals of capacity management's best practices into a multi-level algorithmic approach that distributes a certain amount of detection and intelligence across the infrastructure. The objective of this method is to isolate and identify the risks of saturation or performance degradation for any resource individually as well as within the context of the enterprise as a whole. The second pertains to delivering comprehensive, intuitive, and automated analysis to managers and staff that often do not have the time or skills to contribute to lengthy data collection, analysis, and interpretation efforts. OmniVision provides best-in-class capacity analysis for nonspecialists and capacity management experts alike. The OmniVision approach codifies the type of analysis that a skilled performance or capacity analyst performs, sometimes without conscious thought. If you present an experienced analyst with a set of performance indicators, evolving over time, they can determine the level of risk of saturation shown by the indicators. OmniVision can perform this same level of analysis, automatically, on thousands of systems and then provides the results in a series of easily managed web-based reports. There are a few means by which OmniVision presents its analysis of capacity and performance: • • • •
Quality Reports Resource Reports Virtualized Capacity Reports Virtualized Performance Reports
Understanding the OmniVision approach to capacity management requires an understanding of the basic structure of OmniVision, the application of performance and saturation metrics and the methodology behind the evaluations in each type of report offered. The sections that follow will discuss how each type of OmniVision report guides IT operations managers through common capacity management issues.
OmniVision Architectural Basics Seen at the elementary level, OmniVision has a three tier architecture. The first tier is responsible for the analytic evaluation and risk determination of data collected from each virtualized or physical system being monitored in the environment. At this tier, OmniVision collectors are assigned to each system or virtual machine. These collectors gather capacityrelated information on a multiple times each minute. Hundreds of low-level, system-specific technical data points are used to construct a few normalized measurements used in analytic evaluations and risk determination. Data from heterogeneous systems is also normalized at this level. Once an hour, the OmniVision collector compresses the capacity-related data and transmits a small block of metrics to the second OmniVision tier containing an active capacity database (CDB). This tier accepts the hourly metrics from each collector and stores the information in the active capacity database. Where other performance reporting tools are limited in scale to 100 or 200 servers, OmniVision has been built for short- and longterm analysis of thousands of systems. Intelligence features inside the second tier perform two types of combination analysis on the collected information: •
•
the first type of analysis aggregates and stores the normalized metrics in logical groups reflecting business assignments (e.g., geographies, applications, lines of business) or technical organizational structures (e.g., clusters, pools, virtualized servers); and the second type of analysis organizes metrics into daily, weekly, and monthly trend categories.
The final tier automatically generates out-of-thebox capacity reports that can be used by IT operations managers and systems administrators to easily get a global or granular sense of the well-being for the entire virtual and physical server environment. OmniVision reports can also be personalized, using an adhoc query-based interface, to fit any specific need for reporting on the capacity and
© Copyright 2008 Systar. All rights reserved.
performance of virtualized and physical server environments. Both automated reporting methods, out-of-the box and intelligent ad-hoc, use the OmniVision active capacity database as a source of information to build reports or audit information system resources.
OmniVision Metrics The ability to quickly assess capacity, availability, and performance across large, complex data centers – for any given time -- is impossible for any individual to achieve manually. Too many variables exist, including the number of servers, subsystems, configurations, applications, OS types, OS vendors, and workload behaviors. To facilitate capacity evaluations, a number of other enterprise management software vendors rely on the collection, storage, and presentation of raw performance metrics. This approach expedites data collection, but continues to leave analysis of the variables mentioned above to the user. To make matters more complex, proper analysis requires knowledge of individual nuances within the raw data from system to system, OS to OS, and configuration to configuration. For example, memory utilization measures from Linux are not calculated the same as measures from HP-UX, AIX, or Windows. OmniVision facilitates the collection, analysis, and presentation of metrics across the data center. Multiple times per minute, OmniVision collects hundreds of raw performance metrics from each server. These raw metrics are then evaluated, analyzed, and consolidated into about 30 normalized metrics relevant to shortand long-term analysis. OmniVision’s analysis accounts for the nuances of each individual system it order to create the normalized metric. Because the metrics are normalized by OmniVision, users can evaluate systems across a server farm, cluster, pool, etc., down to the individual server level. Without normalization, accurate evaluation of capacity of environments with over 100 VMs or physical servers would be impossible. From the normalized metrics, OmniVision analyzes the data further into saturation risk metrics. OmniVision’s saturation risk metrics
are used to analytically evaluate the level of risk that an important resource (on any given host node) could become overloaded. The metrics provide qualitative information about resource saturation risk that is also normalized across differing operating systems and system platforms. Saturation metrics also quantify the level of risk to permit rational assessment of the relative importance of the risk.
At its heart, the process of producing a saturation risk metric is the same for each evaluated resource. Multiple technical metrics are compared against specific behavior thresholds. The results of these comparisons are weighted and consolidated to produce a standardize measurement. OmniVision expresses the resulting saturation risk levels as a value between 1 and 5, where a value of 1 represents low risk and a value of 5 represents high risk. For each saturation risk metric, OmniVision tracks three hourly statistics, providing the average, the highest and lowest values observed. OmniVision calculates three saturation risk metrics: • CPU • Memory • Disk I/O No other capacity management software vendor offers saturation risk metrics. The algorithms used to calculate and normalize these metrics are proprietary to Systar.
© Copyright 2008 Systar. All rights reserved.
Capacity Management: Quality Reports Introduction to Quality Reports One of the most important aspects of any Capacity Management practice is to ensure that all systems are and have been meeting demand requirements of the business. OmniVision quality reports provide timely analysis and visibility to any incidents where services have been impacted due to capacity constraints. Quality reports are designed for IT managers, executives, and capacity management teams who want timely, automated reports on critical capacity-related incidents. In order to prioritize efforts to fix (or avoid) the most critical capacity problems, IT operations managers will benefit by first taking a birds-eye view of their system capacity. OmniVision quality reports can quickly tell managers where capacity problems exist, what the problem is, and who has been affected. Quality reports identify problems related to CPU-, Memory-, I/O- and network performance, capacity saturation, and overall availability. The saturation threshold is evaluated from a set of indicators that assess if the system is able to process information without excessive delays. Beyond the immediate operational impact of any problem, OmniVision quality reports also inform the manager when a capacity incident occurred along with an assessment of its severity and trend. In order to better determine the proper priority level toward fixes, the quality reports also inform managers of the duration and number of occurrences for each capacity incident. To provide further clarity about an incident, the quality reports reveal historical trends leading up to any incident. Details of the capacity incident (e.g., I/O saturation) are also compared to previous days and weeks in order to better identify if the problem offers any historical clues or if the incident was relatively new in nature. By assessing historical patterns of capacity behavior, IT operations teams avoid making short-sighted decisions that may minimize the immediate disturbance but fail to correct the root-cause of an issue.
Quality Reports as a Guide to Capacity Issues Typically, IT operations managers and capacity managers review the weekly quality reports to identify any ongoing capacity issues that require short-term action. Another common application of these reports is to review them in conjunction with traditional system performance and availability management alerts. For example, if an alert for high CPU utilization appears in the traditional performance management console, it can quickly be compared to OmniVision’s quality reports for the same system. By reviewing the OmniVision quality reports before any action is taken, administrators can better determine if the alert is related to a one-time event, or to a longer term trend of capacity constraints.
classification of service quality incidents. An incident can be either an interval when the risk of resource saturation was likely to impact optimum performance, or a period of application or server downtime.
Quality Severity Saturation incidents are given a severity rating based upon the duration and scale of their overall risk during monitored production hours. OmniVision supports the identification of production hours through the use of service windows that are based upon time-of-day and day-of-week.
These OmniVision reports offer a short summary of service quality for each week using a group perspective. The OmniVision user may choose from a number of different service perspectives including: functional group, location, technology (platform/OS), and organization.
Figure 2 - Quality Reports Reflecting Saturation Figure 2 shows a sample dashboard from an ESX server displaying I/O saturation. The reports on this dashboard show the duration of incidents on the ESX server, daily averages, and severity risk levels (ranked 1 – 5) at intervals over the last week. Similar reports are available for virtual machines running on each ESX server.
Figure 1 - Weekly Service Quality Report Figure 1 shows a sample weekly quality report from OmniVision highlighting capacity status through easy to read <weather reports>. Incidents from distributed systems are grouped by server OS. Managers can use the report to quickly pinpoint saturation issues or other related incidents affecting capacity. By drilling down within any of the troubled groups showing poor <weather conditions>, managers are able to get more detail on trends, severity, and impact of the incidents.
Quality Incidents OmniVision highlights systems and groups of systems based upon the detection and
© Copyright 2008 Systar. All rights reserved.
Capacity Management: Resource Reports Introduction to Resource Reports Another critical aspect of capacity management practices is being able to quantify capacity levels. IT operations managers and capacity managers need to know how much capacity is available to meet the demands of the businesses they serve. Additionally, IT managers would like to know how much of their capacity is currently under-utilized or oversaturated. One benefit of knowing what servers are under-utilized is that targets for upcoming
server consolidation or virtualization efforts can quickly be identified. A benefit of knowing which servers are over-saturated is that considerations for upgrades, workload reassignment, or performance tuning can be prioritized toward areas where change is needed most. Using the OmniVision’s resource reports to communicate IT’s capacity priorities to line of business managers, procurement, or even the CFO can help to further justify purchase requests or denials with an objective perspective. OmniVision resource reporting provides IT and capacity managers with normalized information about heterogeneous server environments in order to help them objectively assess where and when to optimize server resources, justify new server investments and plan for infrastructure growth. The capacity utilization assessments performed by OmniVision are based on a minimum of the past three weeks of data collected. Reports often assess internals weekly (9-week view) and monthly (13-month view). Custom reports that analyze one or several servers, on the last (n) weeks or last (n) months, can also be developed. These reports study the behavior of any specific servers through time to objectively decide whether upgrades or other capacity remedies are needed. In addition to displaying the status of underutilized and over-saturated systems, the resource reports describe other elements of capacity. Saturation of systems may be the result of recent changes to the server environment, including new machines, subsystem upgrades (e.g., CPU, memory), and new instances of existing applications or virtual machines. OmniVision’s resource reports identify these configuration or environmental changes that may offer clues to growing or declining capacity constraints.
Resource Reports Classification System Resource reports enable quick system-wide server capacity reviews using five levels of use classification. Use classification assesses the balance between the configuration of a server and the workload it supports. There are five basic levels of classification:
© Copyright 2008 Systar. All rights reserved.
•
•
• • •
•
Underused – the workload has no measurable impact on system resources Nearly underused – the workload is measurable, but system resources are underused Normal – the workload and system resources are in balance Normal with Risk – resources are in balance but the trend is worrisome Nearly overused – the system is experiencing a measurable level of saturation risk or is trending towards saturation Overused – the resources of this system are being saturated by the intensity of the workload.
The system classification is performed weekly for all system groups in the enterprise and is reported at both the enterprise and group level.
Resource Reports as a Guide to Capacity Issues The automated analysis presented within the centralized resource reports from OmniVision act as the eyes and ears of IT and capacity managers, allowing them to continually stay in front of capacity and workload issues. Typically, capacity managers use the weekly resource reports as a shopping list of areas where capacity issues will need attention, either to expand, contract or reallocate resources. The weekly reports show a global synopsis of system capacity, with drill-down reports available at several system levels (e.g., server, pool, cluster, VM, partition, zone) to identify IT server elements that are out of balance with workload demand.
To facilitate management of these complex environments, OmniVision capacity reports provide a clear picture virtualized server capacity across the business – including both Winteland Unix-based virtualization technologies. Capacity reports allow IT operations managers to quickly assess how much virtualized server capacity is available, where it is located, and if it is meeting the performance expectations of the business it serves.
Figure 3 - Global Resource Reporting System Classification Figure 3 shows a sample of the high-level capacity classification for an enterprise with four location groups. Florida and California show the presence of overloaded systems (red), while Florida and the corporate location both display nearly overloaded systems (orange).
Capacity Management: Virtual Capacity Reports Introduction to Virtual Capacity Reports Server virtualization technology can be deployed quickly with relative ease. Once deployed, virtualized server environments require continuous planning and careful monitoring. Properly maintaining the right level of capacity in these environments requires new tools, improved insight, enhanced skills, and updated processes. Complex configurations of virtual machines, host servers, pools, and clusters can quickly overwhelm any systems administrator trying to assess the current state of capacity within their virtualized server environments. When the number of objects under management is small (e.g., under 50 virtual machines), the number of individuals with the technical skills to manage capacity can be manually controlled. But the same approach that works well for 50 virtual machines fails to be cost effective for 500 or 5000 due to the investment of skills, manpower, and the infrastructure overhead to needed assess the many complex interactions that define the capacity of the environment. The challenge is multiplied when more than one virtualized technology or operating system platform are added to the mix.
© Copyright 2008 Systar. All rights reserved.
Figure 6 - Virtualized Capacity Report Figure 6 shows the current state of capacity in three virtualized data centers. Within each data center, a number of ESX server clusters are described by pool, server, and VM configurations. The report also shows the capacity status of the server clusters, where red status bars represent frequent saturation of elements, yellow represents occasional saturation, and green represents no capacity constraint.
Virtual Capacity Trends and Forecasts Configurations of a virtualized environment can be in constant flux. Capacity planners and systems administrators may be under continuous pressure to meet physical to virtual consolidation project objectives, optimize virtual pools and server clusters to meet application demands, or add new VMs to support a growing user base. At the same time, business demands of users and applications may fluctuate due to seasonal activities, market promotions, mergers and acquisitions, or batch processing windows. Underused capacity can quickly or gradually become saturated. Where capacity reaches a saturation point in virtualized environments gradually, capacity managers and systems administrators can track
the trends using OmniVision’s capacity reports. Reports such as the one below (Figure 7) reveal the total capacity of virtualized servers, pools, clusters, and VMs compared to their used capacity over a period of several weeks. In addition to showing historical trends for these virtualized elements, the reports also provide insight to future capacity usage.
performance, including options like VM reassignments, new VM creation, assignment of new CPUs or memory, or reconfiguration of pools, clusters, and servers. Additionally, where reports show under-utilized virtual capacity, managers can prioritize efforts to increase density of VMs on ESX servers or aim to further consolidate workloads on physical servers to the virtual environment.
Capacity Management: Virtualized Performance Reports Introduction to Virtualized Performance Reports
Figure 7 - Weekly Virtualized Capacity Metrics Capacity trends and forecasts are revealed in OmniVision’s “9-week” reports. Trend and forecast reports can be viewed by day or week for data centers, clusters, pools, and servers.
Tracking down the cause of capacity-related performance issues can be a challenge in virtual server environments. Where traditional server environments had relatively stable configurations, virtual server environments introduce the flexibility of resource pools, server clusters, and limitless virtual machines. OmniVision performance reports start with a global view of virtual capacity that guide IT operations and capacity managers toward locations where performance has been impacted due to capacity constraints.
Virtual Capacity Reports as a Guide to Capacity Issues Managing capacity within a complex virtualized environment needs to begin with an understanding of what is running where, how is it configured, and how well the existing infrastructure is handling the demands of the business. IT operations managers and systems administrators in charge of virtualized servers use OmniVision capacity reports on a regular basis to get a clear, consolidated picture of their environment. Using OmniVision’s capacity indictors, they can get a quick sense of where configurations are challenged, then instantly dive into the details of what infrastructure elements are at or near saturation risk. Using the capacity trend and forecast reports, managers can also see if virtualized infrastructure elements have been or will soon be at risk. With this understanding, the team can quickly prioritize changes to the environment that will enhance its overall
© Copyright 2008 Systar. All rights reserved.
Figure 4 - Virtualized Performance Report Figure 4 shows a sample virtualized performance report from OmniVision highlighting capacity status for multiple datacenters through easy to read <weather reports>. Managers can use these reports to quickly pinpoint saturation issues or other related incidents affecting capacity within each virtualized data center, pool, and server. By drilling down within any of the troubled groups, managers are able to get more detail on trends, severity, and impact of the incidents affecting virtual machines, -servers, -pools, and -clusters.
Virtualized Performance Metrics
•
Each datacenter displayed in the virtualized performance report contains a set of clustered ESX servers. Each of these clusters can represent a number of virtualized pools, servers, and VMs. OmniVision tracks incidents of saturation risk and performance degradation of each of these elements.
•
OmniVision performance reports provide a summary of the operational performance incidents for each CPU, I/O, and memory element inside each datacenter down to the virtual machine level. This allows IT/operations managers and their teams to quickly assess where resource saturation exists, the duration of any incident, and recent capacity trends related to that element (e.g., cluster, ESX server, VM, CPU). OmniVision virtual performance reports also profile capacity incidents by showing daily hour-by-hour views, daily averages from the last week, and averages over the past 30 days. The capacity profile allows staff viewing the reports to quickly assess if a problem is new or displays repeated saturation incidents.
• • •
pools running on a server by CPU consumption pools running on a server by memory consumption VMs running on a server by CPU consumption VMs running on a server by memory consumption VM events (e.g., VM “moon” was moved to server XYZ, VM “sun” was created on server MNO)
Virtualize Performance Reports as a Guide to Capacity Issues IT operations managers and systems administrators in charge of virtualized servers often use OmniVision performance reports on a weekly basis to review where saturation incidents are either repeatedly impacting performance or where virtualized infrastructure has reached capacity limits for long durations of time. By gaining a better idea of where the saturation of subsystem elements (CPU, I/O, Memory) are impacting performance, virtual server administrators can begin to plan for remedies to optimize performance in those areas. For example, if memory within a virtual server pool is constantly saturated from 11am – 2pm on Friday, administrators might want to allocate more memory to the pool overall, or just within a specific service window.
Summary Figure 5 - Virtualized Performance Metrics Figure 5 shows saturation reports for CPU and I/O on an ESX server. Capacity incidents are summarized by hour, day, week, and month. The charts at the bottom of the report show saturation metrics from an average, minimum, and maximum measure of risk.
The performance reports also help administrators for virtualized systems to assess how pool and VM configurations are behaving within a single ESX server or cluster. Activity reports are generated to show: • the inventory of VMs and pools running on a server
© Copyright 2008 Systar. All rights reserved.
Component-In-A-Haystack Management Capacity management is a challenging task whenever the number of systems exceeds the logical grasp of any one performance or capacity analyst. The challenge is not so much a technical one – the components and construction of a good capacity study don't really change with scale – but more of a management one. How, given an enterprise with hundreds or thousands of physical and virtual servers, can you locate a server or virtual pool that most needs attention right now? Which out of all those ESX servers will be running out of CPU horsepower this week, and which ones are likely to run out next week, or next month?
That is the challenge. Assuming an organization has the talent and expertise to understand capacity issues once they are identified, the barrier to their success is this simple: a basic inability to be everywhere and to anticipate everything at once.
•
Mid-term planning – OmniVision’s resource and virtual capacity reports highlight areas and systems where capacity if falling short of demand. As virtual and physical systems become classified as nearly overloaded, it is generally time to start planning for expansion, balancing or reallocation. Resource and virtual capacity reports provide early warning of trends in growth, giving enough time for proactive planning.
•
Long term planning – Longer term planning with OmniVision can now be driven by organization and business growth plans, using the automated analysis to fine tune the general plans into specific action as needed. Capital purchases can be postponed until the capacity needs appear in OmniVision reports, rather than purchased in anticipation of unquantified needs.
OmniVision Domains of Value for Capacity Management OmniVision has a definite role to play in any technical organization maintaining 50 or more production servers or virtual machines. The part OmniVision can play takes four forms: •
•
Immediate planning support – OmniVision’s capacity database (CDB) can be accessed through a query-based interface to interrogate the enterprise data store of activity and risk metrics. The CDB can provide quick answers to questions such as "what systems are most at risk for CPU overload, anywhere in my enterprise, right now?" Short-term planning – OmniVision quality and virtual performance reports offer a top-down, exception management approach that identifies where capacity issues are impacting service quality. Capacity actions based upon the systems and incidents identified in these reports show a quick return on investment in terms of immediate solutions to current service issues.
© Copyright 2008 Systar. All rights reserved.
As discussed at the beginning of this document, organizations have both budgetary and service quality reasons to "right size" their resources against business demand and need. OmniVision addresses these issues by focusing on overload risks across the entire IT enterprise. Managing capacity to minimize overload risk ensures the best performance seen from both service quality and investment vantage points.