Managing Capacity in VMware® Environments Part I – Introduction and Concepts February 2009
http://www.systar.com/products/omnivision/for_VMware
1
Overview Server virtualization must reduce costs while not increasing business risk. Paramount to achieving cost reduction objectives is the application of capacity management practices. As mentioned time and again – server virtualization presents a paradigm shift in IT management and particularly in capacity management. But what does this really mean? Let’s say you have migrated significant amounts of production workloads to VMware, but the server environment continues to be considerably underutilized compared to the cost reduction promises initially offered by the shift to virtualization. In order to reduce costs further, you need to boost the average utilization of your capacity while not introducing substantial risk to the business. Capturing the planned investment returns of VMware without a mature understanding of capacity management is a real challenge. Taking advantage of VMware’s resource management capabilities – with the proper understanding of its capacity management implications – will enable you to further optimize the utilization of your virtualized capacity. Combining VMware’s resource management with an automated capacity management solution will enable you to assess, plan, and optimize your VMware environments to meet the promised returns. Those firms that continue to rely on the default resource settings will set in place significant risks to quality of IT services as workloads in their VMware environments expand. By relying on the default resource settings, an IT organization will never fully optimize their VMware environment. If you are aiming to maximize investment returns from your VMware cluster environments, the best place to start is a good understanding of the fundamentals. This initial white paper begins with a discussion of VMware resource management concepts and then focuses on how these concepts affect capacity management. The concepts in this white paper all apply to VMware Virtual Infrastructure 3.x versions and assume a Distributed Resource Scheduling (DRS) enabled cluster with a single resource pool for simplification. VMware Administrators, Architects, and Capacity Planners expanding from departmental to enterprise wide deployments of VMware, while seeking to reduce costs through optimization of the virtual environment, will benefit most from reading this paper.
http://www.systar.com/products/omnivision/for_VMware
2
VMware Resource Management Concepts Let’s start by looking at how VMware handles resource management, at a high level, and establish the key terms and controls. The key unit described here is the Virtual Machine (VM) since that is the granular element of work in a VMware Virtual Infrastructure (VI) environment. Each virtual machine has various sizing properties and policies, including: configured size, reservations, limits, shares and entitlements. We will visit each of these concepts to understand its impact on capacity of the VM, host and cluster in which it resides. Your understanding of these concepts will be key to optimizing your VMware environment and has the potential to save your business hundreds of thousands of dollars in capital expenses.
Figure 1. VMs are configured with a number of different resource settings. The resource settings will help the ESX host and cluster establish the amount of virtual CPU and Memory available for its operation. Please note: this graphic intentionally over-simplifies complex configuration policies, but is intended to provide a visual foundation for the discussion below.
Each VM has the following sizing properties: Configured Size The Configured Size equates to the number of virtual CPUs (vCPUs) and the number of MB of memory. These sizes represent the size of the physical machine the VM is provided. They serve as a hard cap on resources unless a Limit below that is specified (see Limit definition below). This information is specified at VM creation time and is not generally changed dynamically.
http://www.systar.com/products/omnivision/for_VMware
3
Reservation A Reservation is the amount of vCPU and memory (in absolute units) that a VM is guaranteed should it need it. It has been called “Min”, and “Guaranteed” in the past but Reservation is now the accepted term. Resources reserved for a VM are not allocated to it unless they are actually used – thus CPU and memory reserved, but not requested by a VM, are made available to other VMs. For example, an application runs in a VM configured with one vCPU and uses an average of 50% of that vCPU over time with spikes up 80%. A reservation of 0.5 (50%) vCPU will guarantee that the application gets its average CPU requirement and allows it to compete for CPU needed above that. This balances resource sharing with resource requirements to optimize the process. A VM cannot be “Powered On” in a host unless there are resources available to meet the Reservations specified for the VM. The resources available for starting new VMs are calculated by subtracting overhead and the sum of all Reservations from the total capacity. The default setting is to have no Reservation. Without Reservations specified, it is possible for a host to become over-subscribed as a result of new VM additions or migrations. If the sum of the Reservations for all VMs has exceeded capacity, a host is considered in violation of its Reservation. This condition will trigger a migration on a periodic Distributed Resource Scheduling (DRS) balancing operation. Although a single triggered migration can improve quality of service, frequent migrations can crush performance within VMware clusters. DRS concepts are discussed in more detail, later in the paper. On a resource constrained host, a VM should be provided at least its reserved amount of resources if requested. The inability to do this is a key measure of an imbalance in a host and cluster. If a Reservation is not specified, then a value of 0 is used and no resources are required for a Power On and none are guaranteed. In the section Impact on Capacity, we will examine how the Reservation policies can impact capacity using three different configuration and usage scenarios.
Shares Shares are relative units that determine a VM’s priority among sibling VMs (all VMs in the absence of multiple resource pools) and are used to determine resource allocation under contention. VMware always uses a fair share allocation scheme and thus a VM gets a share of the available resources based on its virtual configuration. Thus a single vCPU machine gets half of the CPU resources of a two vCPU SMP machine. The same for is true for memory allocations. However, the allocation can be modified by setting the Shares property for a virtual machine; a VM with 2000 Shares will get twice the CPU resources per vCPU as a VM with 1000 Shares. It is important to remember that Share properties affect the per vCPU and per MB fair share allocations. When visiting a prospect recently, we learned of a scenario where 10 VMs were installed on a host. Without thinking much about the consequences, the business established fair share allocations for all 10 VMs. Later, when 1 of the 10 was experiencing heavily degraded http://www.systar.com/products/omnivision/for_VMware
4
performance, the company was challenged to diagnose the performance issue. As a short-term remedy, they decided to power down 5 of the 10 VMs and noticed that the one troubled VM returned to acceptable performance levels. In this case, where the 10 VMs were once sharing 10% of the configured resources each, the remaining 5 VMs were now provided a 20% share. Although performance returned to normal for the impacted VM, the question remained of what to do with the 5 VMs that were powered down. Taking advantage of the point system of shares, new policies could be set to provide the proper required capacity, or as an alternative, the VMs could be migrated to another host in the cluster that is considered to have sufficient capacity. When managing environments where several business applications are accessing the same server resources, establishing shares allows VMware architects and administrators to establish priorities according the importance of the application to the business. Shares determine resource allocation under contention. Resources allocated to a VM based on shares are bound by the Reservation setting on the low end and the Limit on the high end. This concept is further illustrated in Figure 1 above.
Entitlement Although referred to in the past as “EMIN”, Entitlement is now the preferred term. Entitlements are the computed result of configurations, reservations, limits and shares used to establish the resource allocation given to each VM for its operation. The Entitlement will always fall between the Reservation and the Limit, based upon its Share. The general measure of the capacity and health of a cluster is the ability of the cluster to deliver the entitled resources to all VMs. Additionally, a good measure of a host’s ability to provide expected capacity is the measure of its total entitlements vs. total capacity. Unlike Reservations, the host will not identify a violation when entitlements exceed capacity. By understanding the sum of all VM entitlements on a host and within a cluster, VMware architects and administrators will have a clear picture of the resources being made available to meet demands on their capacity.
Limit A Limit serves as a hard cap on resource allocation for a VM. In some cases, Limits may be defined to represent allocations of resources based upon service agreements. Limits can provide both benefit and burden. For example, a VM might have 2GB of memory configured but a Limit set to 1 GB for business priority reasons. In general, Limits should be used only after careful consideration of the capacity impact. For example, in a recent client visit, we were able to view reports of a cluster that was configured with 32GB of memory. Looking at capacity trends over a 9-week period, we noticed that 19 GB of the 32 GB was consistently used. With 92 VMs in the cluster, each VM was blindly assigned a Limit on Memory of 4GB, in order to provide protection in the overly subscribed environment. In one instance, the Limit on memory was insufficient for a key application that required more than 4GB during a heavy processing period and resulted in excessive disk swapping. End-user performance was degraded for over three hours each week as a result of this Limit. If a Limit is not specified, then the Configuration Size is the Limit. http://www.systar.com/products/omnivision/for_VMware
5
From the resource management concepts described above, you can see that when compared to a traditional capacity management of non-virtualized server environments, VMware capacity management is more complex. In the non-virtualized server environment, the limits of CPU and Memory are established by the physical representation of those resources (e.g., 4 processers = capacity of 4 processors). In the VMware environment, we have new settings, metrics, and concepts that determine total, used, and available capacity of a VM (e.g., 2 vCPUs may equate to 20% of a physical CPU).
Distributed Resource Scheduling Distributed Resource Scheduling (DRS) provides a watchful eye over VMs in clustered environments. With an intention to provide each VM its required resources, DRS carefully tracks workload activity on each Host within a cluster. When unsatisfactory conditions are observed for a VM within one host, DRS assesses other host locations within the cluster where conditions may be more attractive. If DRS finds a suitable location, it then facilitates the VM move, known as a VM migration. DRS has three goals (all with respect to a DRS enabled cluster): •
Load balancing (cluster balance factor )
•
QoS enforcement (shares, reservations, and limits)
•
Policy enforcement (admin roles, power management, access control, maintenance mode, etc.)
Figure 2. DRS manages load balancing of VMs in order to maintain expected quality of service levels. Source: VMware.
DRS uses two primary mechanisms to achieve these goals: •
Periodic Invocation o
Load Balancing
http://www.systar.com/products/omnivision/for_VMware
6
•
o
Host evacuation (used in High Availability configurations, or in conjunction with dynamic power management policies)
o
Reservation balancing
o
Affinity/Anti-affinity (rules to keep certain VMs together or apart)
VM Initial Placement o
VM Power On
A common approach is to use the “Automatic” settings of DRS for each cluster and to set the migration threshold to “Moderately Aggressive”. At first these settings may trigger VM migrations, but tend to settle down after a while if workloads demands are similar for the VMs in the cluster. However, since most clusters contain a variety of workload profiles and demand patterns, there tend to be more migrations than optimal over time. In order to get a better understanding of how DRS impacts capacity management, let’s take a closer look at how it works. Using DRS, VMware Administrators can specify an initial automatic or manual placement of a VM in a cluster. Placement has a large impact on capacity since resources are shared. DRS obeys all rules like affinity/anti-affinity, and reservations, and then selects the lightest loaded host for the new VM. Although it is logical for DRS to place a VM within the lightest loaded host, it may not always represent the best choice – especially when taking into consideration whether or not certain VMs should be placed on the same host (note: VMware warns to use affinity/anti-affinity policies on rare occasion). DRS’s lightest load policy does not consider how well a new VM’s workload will fit with existing VM workloads. Nor does it consider strategies like Distributed Power Management (DPM) where you may want to minimize the load on a host so it can be evacuated and powered off at low demand periods. On a periodic basis, DRS invocation occurs and does the following: •
DRS computes an Imbalance Metric by a formula which effectively compares the variance in all host’s ability to deliver on their entitlements
•
If the Imbalance Metric is greater than a threshold (set as the migration aggressiveness factor) then migration from highly loaded to less loaded hosts begins.
•
If any host is in violation of its Reservation policy (more reservations than capacity) or in violation of affinity or anti-affinity rules, this causes migrations as well.
•
DRS also calculates which migrations need to be made and then looks ahead to see if the desired migration will trigger additional movements in order to fully optimize the environment.
DRS invocation allows companies to maximize the utilization and thus capacity of all resources in a cluster (if the VMs and DRS have been configured correctly). We should also note that VMware recommends an “Auto” setting for DRS but also recommends that Administrators override “Auto” settings at the VM level for important application workloads. As you can see, VMware’s DRS is designed to deliver the best capacity for a given set of hardware and workloads and can do this job well. However, its assessment of capacity takes a short-term focus, initiating migrations as needed. DRS does not consider the longer-term view of capacity, taking into account historical trends or projecting future growth that may impact VM or cluster performance. In order to reduce costs through optimization of the VMware environment, it will be critical to combine both the short-term perspectives of DRS with the longer-term perspective and detailed analysis offered http://www.systar.com/products/omnivision/for_VMware
7
by automated capacity management solutions. Where automated capacity management solutions help to provide sustainable, high performance environments (especially through rapid growth), DRS is best suited for handling occasional shifts in workload that are difficult to plan.
Impact on Capacity Now that we have reviewed the fundamentals of VMware resource management and Distributed Resource Scheduling, we can further examine how these concepts impact capacity management. We will use three examples to highlight VMware configurations, their common uses and their impact on capacity. •
Default: no Reservation, no Limit, normal Shares
•
Conservative: high Reservations, Share priority matches business priority
•
Optimized: Reservation matched to historical baseline/trends, Share priority matches business priority
Default Settings Using Default settings is the most common approach among VM configurations, especially in the early stages of VMware adoption. The Default approach takes into account the standard settings for VM resource management: no Reservation, no Limit, and normal Shares. In early stages of VMware deployments, experienced Administrators with a history in physical server environments hesitate to tweak the default VMware settings too much, as they strive to become more comfortable with operations of the new virtualized environment. For this example, we will assume that all VMs have been assigned single vCPU and 1GB configurations. In the Default environment, a VM can always be added to the host and Powered On because Reservations will never exceed capacity of the host (n x 0 = 0); technically, if you pushed this configuration to its limits, you could stack VMs on the host until it storage filled up. Resources will be allocated to a VM when needed, providing every VM as much resource as it needs. When workload demand for resources begins to exceed available capacity then allocation under contention policies begin. As noted earlier, Shares determine resource allocation under contention. Since all Shares are the same and the configurations are the same, VMware’s fair share allocation scheme will allocate available resources to each VM equally. Entitled resources will be the same for each VM. With no real allocation scheme, each new VM will decrease each existing VM’s allocation by 1/numVMs amount. For example, if 40 VMs are located on a host, each VM will receive 1/40 of the CPU and Memory capacity available to the host. The host is not in Reservation violation but will quickly become overcommitted on required resources, given enough demand. Without an ability to clearly quantify VM resource utilization trends today, Administrators and Architects have stacked fewer VMs per host in order to avoid contention for vCPU and Memory. This http://www.systar.com/products/omnivision/for_VMware
8
type of low density configuration helps to maintain service levels, but results in higher VMware costs than environments that are properly balanced for capacity requirements. Capacity and flexibility have been prioritized higher than stability. For those organizations striving to reduce costs and optimize the performance of their VMware environments, this approach is not sustainable.
Conservative Settings On the opposite end of the spectrum, we can consider the case where every VM is given an overly high Reservation setting, there is a mix of configurations and each VM is given a Share property that matches its priority to the business (or at least to the application). This is a very conservative approach, and is the second most common technique we have seen – especially in organizations that are still “getting their feet wet” with VMware in production environments. In this scenario, each VM will run quite smoothly, even above its normal load, since it is guaranteed a very large amount of resources. The host will stop allowing new VMs via Admission Control when its total Reservations exceed available capacity. Since the VMs are over-sized the host will be limited in how many new VMs can be started. Although quality of service is optimized for the few VMs running in this environment, resource sharing is not optimized for the applications running on the VMs. Additionally, due to the bloated size of the VMs in this configuration, the load balancing mechanism of DRS will be limited in effectiveness. As DRS searches for open space within a cluster, the high reservation requirements of the VM will limit potential destination sites. Although density of the data center may have increased with the Conservative configuration, Architects will not be able to achieve the VM density and resource utilization objectives that were initially promised by VMware. Capacity and flexibility have been traded for stability. For those organizations striving to reduce costs and optimize capacity as the footprint of VMware usage expands within their IT organization, this approach is also not sustainable.
Optimized Settings In our third scenario, every VM is given an accurate Reservation setting, there is a mix of configurations, and each VM is given a Share property that matches its priority to the business (or at least to the application). Taking the Optimized approach, the host will stop allowing new VMs via Admission Control when its total Reservations exceed capacity with a new VM. Thus the host cannot be oversubscribed at the Reservation level. However, if the VM sizing was accurate, minimal resources will be wasted. Resources will be allocated to VMs as they are required until their Reservation is met. Beyond that, if VMs require more resources when demand spikes up, the Host will strive to provide each VM with its Entitled resources (they contend for these extra resources based on Share priority).
http://www.systar.com/products/omnivision/for_VMware
9
Although this approach may seem like the most logical configuration choice, it is not often pursued due a lack of understanding about the historic capacity requirements of the applications begin placed on VMs in the data center. For example, to set an accurate Reservation of 0.5 vCPUs, VMware Architects must first have established a baseline for capacity requirements during normal operations as well as during periods of peak demand. Baselines, peaks, and performance trends can be assessed easily with automated capacity management solutions. Following the capacity assessment, VM demand for these resources could then be balanced across the hosts and clusters in the data center. This approach is a less common one for Architects planning enterprise wide deployments of VMware but is essential to keeping costs under control as the environment expands. In addition to reducing costs of hardware required to support the VMware environment, it will also assist in reducing server administration and operational costs. For those organizations targeting enterprise wide expansion, the Optimized approach is preferred.
Recommendations for Policy Setting In environments where IT organizations are still “getting their feet wet” with VMware and growth is not significant, Default or Conservative settings within DRS and its VM policies are sufficient. In situations where IT organizations have decided to dramatically expand their VMware environments across the enterprise, and cost considerations are a constant concern, one can see that it is very important to set the Reservation and Shares for a VM carefully – avoiding risky default settings. For organizations looking to reduce costs while managing expanding VMware environments, we offer the following guidelines: •
Taking the default Reservation setting of 0 is not a good best practice. It is imperative to set reasonably accurate Reservations for VMs for VMware resource management and DRS to work effectively.
•
Setting the Reservation for a VM too small allows good resource sharing but can trigger resource contention as more VMs are added – even during periods of normal demand.
•
Setting Reservations too high can prevent the Host from accepting a new VM when excess capacity is often available. Setting Reservations too high also allows a VM to acquire too many resources easily when demand spikes, thus impacting other workloads unfairly.
•
Thus the goal is to size the Reservation to handle the “normal” workload with guaranteed resources and then contend for resources to handle peaks based on its priority. For now, one technique is to set the Reservation around the 50th percentile for historical resource utilization. To be more aggressive, drop below that and to be more conservative go above that. This works best for sustained workloads – we will look at peaky and other workload types in an upcoming paper.
Summary In this white paper, we have learned that capacity management is very different for a VMware DRS enabled cluster. We have seen the capacity impact of the sizing concepts that VMware uses for http://www.systar.com/products/omnivision/for_VMware
10
resource allocation like Reservations, Shares, and Limits. Finally, we have discussed what DRS can and cannot do to manage capacity. We can conclude this introductory white paper with some basic ideas that will help you make the best capacity management decisions possible at this point: •
Remember that it is imperative that you set the Reservations property for your VMs. Relying on the default settings can adversely impact a number of VMware capabilities by: o
Taking some of the effectiveness of Admissions Control away, allowing any and all VMs to be added to a host without regard to available capacity.
o
Requiring a VM to compete for resources at all levels of demand instead of being guaranteed the resources it needs to handle “normal” demand.
o
Weakening the effectiveness of DRS Load Balancing.
•
Do not expect DRS to completely solve your capacity management issues. DRS is very good at balancing load across the hosts in a cluster and enforcing policies.
•
DRS works best when paired with automated capacity management analysis, reporting and proactive actions.
•
As VMware Administrators and Architects first begin to work with VMware, Default and Conservative resource management settings and policies are acceptable because optimization is not seen as an early goal in the technology adoption cycle.
•
As Administrators and Architects aim to expand their VMware environments while keeping costs and quality of service under control, Optimized settings and policies are required. Without Optimized settings, cost reduction objectives will not be achievable.
The next white paper in this series we will offer some new definitions of “Effective Capacity” for a VMware Cluster and discuss concepts for optimal management in VMware high availability (HA) environments. We will also begin to explore the concept of workload profiling and how it can be invaluable in setting the Reservation property and placing workloads proactively to minimize migration.
http://www.systar.com/products/omnivision/for_VMware
11
About Systar
Systar is a leading worldwide provider of performance management software. Systar’s OmniVision product suite enables customers to achieve the optimal alignment between IT resources and business requirements in both distributed and virtualized server environments. Systar’s proven capacity management solutions deliver the full benefits of virtualization by enabling customers to gain visibility into these complex environments, tune for optimal capacity and move business-critical applications into production with full confidence.
United States 8618 Westwood Center Dr. Suite 240 Vienna, VA 22182 Tel. +1 703-556-8400 Fax +1 703-556-8430
[email protected] France 171 bureaux de la Colline 92213 Saint-Cloud Cedex Tel. +33 (0) 1 49 11 45 00 Fax +33 (0) 1 49 11 45 45
[email protected] United Kingdom Systar Ltd Ground Floor Left 3 Dyer’s Buildings
London EC1N 2JT Tel. +44 2072 692 799 Fax +44 2072 429 400
[email protected] Germany Mergenthallerallee 79-81 D-65760 Eschborn Tel. +49 211 598 8520
[email protected] Spain Centro de Negocios Eisenhower C/ Cañada Real de las Merinas, 17 Edificio 5 - 1º D 28042 Madrid Tel. +34 91 747 88 64 Fax +34 91 747 54 35
[email protected]
Systar, BusinessBridge, OmniVision, BusinessVision, ServiceVision, WideVision and Systar’s logo are registered trademarks of Systar. VMware, ESX Server, and all other brand names, product names and trademarks are the property of their respective owners.
http://www.systar.com/products/omnivision/for_VMware
12