Beyond Performance Testing Part 1: Introduction Scott Barber, Performance Testing

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Beyond Performance Testing Part 1: Introduction Scott Barber, Performance Testing as PDF for free.

More details

  • Words: 54,707
  • Pages: 124
Beyond performance testing part 1: Introduction Scott Barber, Performance testing consultant, AuthenTec Summary: What happens after performance test results have been collected? This article introduces a series focused on exploring what test results mean and what can be done to improve them. Performance testing is the discipline concerned with determining and reporting the current performance of a software application under various parameters. My series "User experience, not metrics" discusses and demonstrates how to do it, with the help of the IBM® Rational Suite® TestStudio® system testing tool. Computers excel at this stuff, crunching numbers and displaying answers in neat ways. But there comes a time after the tests are run when someone who's reviewing the results asks the deceptively simple question, "So what, exactly, does all this mean?!?" This point beyond performance testing is where the capabilities of the human brain come in handy. "Computers are good at swift, accurate computation and at storing great masses of information. The brain, on the other hand, is not as efficient a number cruncher and its memory is often highly fallible; a basic inexactness is built into its design. The brain's strong point is its flexibility. It is unsurpassed at making shrewd guesses and at grasping the total meaning of information presented to it," writes British journalist Jeremy Campbell in Chapter 16 of Grammatical Man: Information, Entropy, Language, and Life (1982). "Beyond performance testing" will address what happens after initial test results are collected, the part it takes a human brain to accomplish. We'll explore what performance test results mean and what can be done to improve them. We'll focus on performance tuning, the subset of performance engineering that complements performance testing. We'll examine the process by which software is iteratively tested using Rational Suite TestStudio and tuned with the intent of achieving desired performance, by following an industry-leading performance engineering methodology that complements the IBM® Rational Unified Process® approach. This first article is intended to introduce you to the concept of performance engineering that underlies the series and to give you an overview of the articles that follow. "Beyond performance testing" builds on and is a companion to the "User experience, not metrics" series, so you should be familiar with the topics presented in that series before you begin reading this one. About performance engineering Performance engineering is the process by which software is tested and tuned with the intent of realizing the required performance. This process aims to optimize the most important application performance trait, user experience. Historically, testing and tuning have been distinctly separate and often competing realms. In the last few years, however, several pockets of testers and developers have collaborated independently to create tuning teams. Because these teams have met with significant success, the concept of coupling performance testing with performance tuning has caught on, and now we call it performance engineering. Let's begin by exploring this concept at a high level. The performance testing part of performance engineering encompasses what's commonly referred to as load, spike, and stress testing, as well as validating system performance. You may also have heard other terms used to describe aspects of what I'm calling performance testing. Regardless of the terms you may use to describe testing it, performance can be classified into three main categories:

• • •

Speed -- Does the application respond quickly enough for the intended users? Scalability -- Will the application handle the expected user load and beyond? Stability -- Is the application stable under expected and unexpected user loads?

The engineering part comes in as soon as the first measurement is taken and you start trying to figure out how to get from that measurement to the desired performance. That's really what engineering is all about: solving a problem to

achieve a desired and beneficial outcome. As a civil engineering student at Virginia Tech in the early 1990s, I was taught an approach to solving engineering problems that can be summarized as follows:

• • •

Given -- What are the known quantities? Find -- What are the requirements for the desired item/outcome? Solution -- How can we build or acquire the desired item/outcome, meeting the requirements, within the parameters of the known quantities?

Or as my father used to say, "What have you got? What do you want? How do you get there?" In civil engineering, the implied problem was always "build or modify a structure to satisfy the given need." In performance engineering the implied problem may seem different, but it's fundamentally the same -- that is, "build or modify a computer system to satisfy the given performance need." With that said, I suspect you would agree that performance is the trait of the system that we wish to engineer, thus performance engineering.

Overview of this series "Beyond performance testing" is a fourteen-part series, with parts to be published regularly. Each article will take a practical approach; it will discuss how to immediately apply the methodology or technique being introduced and will give examples and case studies taken from real performance engineering projects. Where feasible, articles will also include "Now you try it" exercises so that you can practice applying the methodology or technique on your own. Articles that don't contain specific exercises with solutions will contain approaches, checklists, and warnings that will help you apply what you learn directly to your projects. In many cases, I won't be able to address all of the technical issues specific to your system under test, but you'll find out what questions to ask. Each article will be identified as beginner, intermediate, or expert level, depending on the technical complexity of the code or systems analysis required to accomplish the technique being discussed. The concepts presented in each article will be valuable to all members of the performance engineering team, from the project manager to the most junior developer. Following is a brief description of the articles. Performance engineering housekeeping Parts 2, 3, and 4 start us out with what I think of as "performance engineering housekeeping" topics. These three articles aren't particularly technical but will give us the common footing we need for the remaining articles.



• •

Part 2: A performance engineering strategy -- The performance engineering methodology that serves as a basis for applying the concepts in later articles is described here. This article also outlines a performance engineering strategy document (equivalent to a functional test plan) that can be used to help organize and document performance engineering activities, plus a performance engineering results document. Part 3: How fast is fast enough? -- One of the biggest challenges we face in performance engineering is collecting performance-related requirements. This topic was touched on in the "User experience, not metrics" series and is discussed here in more detail. Part 4: Accounting for user abandonment -- If an application is too slow, users will eventually abandon it. This must be accounted for in both modeling and analysis, because not accounting for abandonment can drastically change your results and tuning strategies.

Detecting and tuning bottlenecks Parts 5 through 10 are technical articles focused on detecting, exploiting, and preparing to tune bottlenecks. These articles follow the steps of the approach we discuss in Part 2 and continually relate back to the performance requirements we discuss in Part 3.

• • • • • •

Part 5: Determining the root cause of script failures -- The obvious symptom is very rarely the actual cause of a script failure. It's imperative to determine if the failure is due to the script or the application and be able to quantify the root cause, not just the symptoms. Part 6: Interpreting scatter charts -- Scatter charts are the most powerful evaluation tool at a performance engineer's disposal and must be used wisely to pinpoint performance issues. This article expands on discussions of the scatter chart in the "User experience, not metrics" series. Part 7: Identifying the critical bottleneck -- One part of the system is always slowest, and until you remedy that bottleneck, no other tuning will actually improve the performance of the application along that usage path. Before you tune it, you must first conclusively identify it. Part 8: Modifying Tests to Focus on Failure/Bottleneck Resolution -- Once a failure or bottleneck is found from a functional perspective, resolution can be reached more quickly if you modify your existing tests to eliminate distraction from ancillary issues. Part 9: Pinpointing the architectural tier of the failure/bottleneck -- Just because you've identified a bottleneck or failure from a functional perspective and can reproduce it, doesn't mean you know where it is. Pinpointing exactly where the bottleneck is can be an art all its own. Part 10: Creating a test to exploit the failure/bottleneck -- Now that you know what the bottleneck is functionally and where the bottleneck is architecturally, you will likely need to create a test to exploit the failure/bottleneck in order to help the developer/architect with tuning. This test needn't bear any resemblance to real user activity but rather needs to exploit the issue, and that issue alone. In fact, these scripts often don't even interact with the system in ways that users would and may include direct interaction with back-end tiers. TestStudio can save you significant time and effort in developing such a test.

Advanced topics Parts 11 through 14 are "clean-up" topics. These are areas that are referred to in one or more of the previous articles but not done justice due to topic and length constraints. These articles expand on topics that are applicable throughout the engineering process.

• • • •

Part 11: Collaborative tuning -- Once you can exploit the failure or bottleneck, you need to know how to work with the rest of the team to resolve it. Part 12: Testing and tuning on common tiers -- Web, application, and database servers are common tiers related to testing and tuning. Here we discuss specific issues, methods, and resolutions for these areas. Part 13: Testing and tuning load balancers and networks -- These components require a significantly different approach from that called for by common tiers. This article discusses issues specific to these areas and approaches for handling them. Part 14: Testing and tuning security -- Security is the most performance-intensive activity of an application. Tuning security issues often involves taking a risk-based approach to balancing necessary security with desired performance.

Summing it up Our job isn't done when we accurately report the current performance of an application; in fact, it's just begun. The "Beyond performance engineering" series is designed to discuss and demonstrate how to take the step from simply reporting results to achieving desired performance with IBM Rational TestStudio and a proven performance engineering methodology. The articles will focus on the topics that are hardest to find information about and point you to where you can find more information. I hope you're looking forward to reading this series as much as I'm looking forward to writing it! Part 2: A performance engineering strategy

Without a strategy, performance engineering is simply an exercise in trial and error. Following a sound strategy in the engineering effort will increase your performance engineering team's efficiency and effectiveness. This article outlines a strategy that complements the Rational Unified Process® approach and is easily customizable to your project and

organization, and that's been validated by numerous clients worldwide. The templates I provide will give you a starting point for documenting your performance engineering engagement. Applying this strategy, coupled with your own experience, should significantly improve your overall effectiveness as a performance engineer. This is the second article in the "Beyond performance testing" series, which focuses on isolating performance bottlenecks and working collaboratively with the development team to resolve them. If you're new to this series, you may want to begin by reading Part 1, the series introduction. This article is intended for all levels of users of the IBM® Rational Suite® TestStudio® system testing tool, as well as managers and other members of the development team. A closer look at the process of performance engineering As defined in Part 1, performance engineering is the process by which software is tested and tuned with the intent of realizing the required performance. Let's look more closely at this process. In the simplest terms, this approach can be described as shown in Figure 1.

Figure 1: Performance engineering in its simplest terms I've seen this chart, or a similar one, in many software performance presentations and seminars. Although this chart makes great common sense, it doesn't shed much light on what we really want to discuss here, which is "How, exactly, do I detect, diagnose, and resolve?" Figure 2 gives a much more detailed picture of the various aspects of the process.

Figure 2: Aspects of the performance engineering process The strategy detailed in Figure 2 has been applied successfully in many performance-engineering projects and has been adopted successfully internationally. Following is a short overview of each of the eight major aspects of this performance engineering strategy, indicating where else in this "Beyond performance testing" (BPT) series, or the previous "User experience, not metrics" (UENM) series more information on that aspect can be found. My Web site gives a full description of each of these aspects and their subcomponents, as well. Please note that while many people would refer to these aspects as phases, I'm using the word aspect here intentionally to make a distinction. The word phase implies a certain sequence. While some of the aspects of the performance engineering process are performed in order, many of them are completed in a very fluid manner throughout the project. Don't think of this as a step-by-step approach, then, but rather as a list of things to consider. Evaluate system Evaluation of the system under test is critical to a successful performance testing or engineering effort. The measurements gathered during later aspects are only as accurate as the models that are developed and validated in this aspect. The evaluation also needs to define acceptable performance; specify performance requirements of the software, system, or component(s); and identify any risks to the effort before testing even begins. Evaluating the system includes but isn't limited to the following steps:

• • • •

determine all requirements related to system performance (BPT Part 3) determine all expected user activity, individually and collectively (UENM Part 2, Part 3, Part 4) develop a reasonable understanding of potential user activity beyond what's expected (UENM Part 2, Part 3, Part 4) identify and schedule all non-user-initiated (batch) processes (UENM Part 2, Part 3, Part 4)

• • • •

develop a reasonable model of actual user environments identify any other process/systems using the architecture define all system/component requirements in testable terms (BPT Part 3) define expected behavior during unexpected circumstances (BPT Part 3)

As performance engineers, we need to become intimate with the core functions of the system under test. Once we know and understand those functions, we can guide the client to develop performance acceptance criteria as well as the user community models that will be used to assess the application's success in meeting the acceptance criteria. Develop test assets A test asset is a piece of information that will remain at the completion of a performance engineering project. Some people refer to these items as "artifacts." These assets are developed during this aspect of the process:

• • •

Performance Engineering Strategy document (discussed later in this article) Risk Mitigation Plan (discussed later in this article) automated test scripts (referenced throughout both series)

The Develop Test Assets aspect begins before performance testing is scheduled to start. The Performance Engineering Strategy document and the Risk Mitigation Plan can be started immediately upon completion of the Evaluate System aspect. Automated test script development can begin only after development of a stand-alone component or when the entire application is believed to be stable and has undergone at least initial functional testing. This aspect concludes when:

• • •

the Performance Engineering Strategy document has been completed and approved by the stakeholders, mitigation strategies have been defined for all known risks, and All load-generation scripts have been created and individually tested (for the "testable" sections of the application).

Execute baseline/benchmark tests The Execute Baseline/Benchmark Tests aspect is where test execution actually begins. The intention here is twofold:

• •

All scripts need to be executed, validated, and debugged (if necessary) collectively (as they've already been individually to move beyond the Develop Test Assets aspect). Baseline and benchmark tests need to be conducted to provide a basis of comparison for all future testing.

Initial baselines and benchmarks are taken as soon as the test environment is available after the necessary test assets have been developed. Rebenchmarking occurs at the completion of every successful execution of the Tune System aspect. Designed exploratory scripts are baseline and executed at volume if necessary during this aspect. It's important to analyze the results of baseline and benchmark tests. While the methodology we're discussing makes this clear, many people I've talked to don't fully appreciate the necessity of analyzing the results of these early lowvolume tests. It's our responsibility as performance engineers to ensure that this analysis isn't left out. Analyze results Analysis of test results is both the most important and the most difficult part of performance engineering. Proper design and execution of tests as well as proper measurement of system and/or component activities make the analysis easier. Analysis should identify which requirements are being met, which ones aren't, and why. When the analysis shows why systems or components aren't meeting requirements, then the system or component can be tuned to meet those requirements.

Analysis of results may answer the following questions (and more):

• • • • • • • •

Are user expectations being met at various user loads? (BPT Part 3, Part 5, Part 6) Do all components perform as expected under load? (BPT Part 3, Part 5, Part 6) What components cause bottlenecks? (BPT Part 6, Part 7, Part 9) What components need to be or can be tuned? (BPT Part 11) Do additional tests need to be developed to determine the exact cause of a bottleneck? (BPT Part 8, Part 10) Are databases and/or servers adequate? (BPT Part 12) Are load balancers functioning as expected? (BPT Part 13) Is the network adequate? (BPT Part 13)

The Analyze Results aspect focuses on determining if the performance acceptance criteria have been met, and if not, what the bottlenecks are and whose responsibility it is to fix those bottlenecks. This aspect involves close coordination with stakeholders to ensure that both the performance engineering team and stakeholders agree that all requirements are being validated. System administrators may also be involved in results analysis. Keeping a record of the tests being analyzed and the results of that analysis is an important part of this activity. Execute scheduled tests Scheduled tests are those that are identified in the Performance Engineering Strategy document to validate the collected performance requirements. Scheduled tests shouldn't be conducted until baseline and benchmark tests are shown to meet the related performance requirements. There are countless types of measurements that can be gathered during scheduled tests. Requirements, analysis, and design will dictate what measurements will be collected and later analyzed. Also, required measurements may change throughout the course of testing based on the results of previous tests. Measurements collected during this activity may include but aren't limited to the following:

• • • • • •

end-to-end system response time (user experience) (UENM Part 5, Part 8) transactions per second for various components memory usage of various components by scenario CPU usage of various components by scenario component throughput component bandwidth

Other measurements as requested by stakeholders may also be included. Applications or environments with back-end processes that aren't directly triggered by user activity usually require transactions-per-second measurements to be collected. Measurements collected here will be compared to measurements collected during baseline and benchmark testing. Multiuser tests will be executed to determine actual current performance, find "knees" in performance (places where performance degrades dramatically rather than smoothly; see UENM Part 10), and determine bottlenecks (BPT Parts 6--10). Exploratory tests may need to be developed and executed to help find or tune bottlenecks based on analysis of the measurements collected at this time (BPT Part 10). Identify exploratory tests This is the aspect of performance engineering in which unplanned tests to detect and exploit performance issues are developed to aid in the tuning effort. To be effective, these tests must be researched in collaboration with the technical stakeholders who have intimate knowledge of the area of the system exhibiting performance issues. The results of this research lead the project back into the Develop Test Assets aspect, where exploratory tests are created and documented. Exploratory tests are designed to exploit the specific system area or function suspected to contain a performance bottleneck based on previous results analysis. Typically, the suspect tier or component of the bottleneck is identified and then decisions are made about the metrics that need to be collected to determine if the bottleneck does, in fact,

reside in that area, and to better understand the bottleneck. Finally, the type of test that's required is identified and described so that it can be developed. We'll discuss this in detail in Part 10 of this series. Tune system Tuning must occur at a component level while keeping the end-to-end system in mind. Even tuning each component to its best possible performance won't guarantee the best possible overall system performance under the expected user load. After tuning a component, it's important not only to retest that component but also to rebenchmark the entire system. Resolving one bottleneck may simply uncover another when systemwide tests are reexecuted. The Tune System aspect may address but isn't limited to the following topics, all of which are explored in more detail in BPT Part 11 and some in Part 12 or Part 13:

• • • • • • • • •

Web server configuration database design and configuration application or file server configuration cluster management network components server hardware adequacy batch process scheduling/concurrency load balancer configuration firewall or proxy server efficiency

This aspect of the performance engineering project is a highly collaborative effort involving the performance engineering team and the development team. Often, once tuning begins the performance engineer must be available to retest and analyze the effects of the changes made by the developer before any other activity can occur. The information gained from that analysis is critical to the developer or system administrator who's making the actual changes to the system. It's very rare for the performance engineering team to make actual changes to the system on their own. The activities associated with tuning the system need to be at least loosely documented so that differences from the original design can be captured and lessons can be passed on to future developers and performance engineers. Complete engagement Documentation is created primarily to be used as a historical reference and as validation of requirements being met. For applications that are likely to have future releases, upgrades, or increased future load, it's important to document the capabilities of the system, known bottlenecks, and areas where most improvement can be realized in the future. The format of the documentation should be agreed upon during the Develop Test Assets aspect. The results document, which is discussed in detail below, is specifically geared to show stakeholders whether the system under test meets the performance acceptance criteria. If the criteria haven't been met, the document should explain why not, particularly if significant tuning or upgrading of the system, which may fall outside the scope of the project, is required. The document should also identify areas of future improvement if bottlenecks are detected but not resolved. The performance engineering strategy document Since I started writing articles and moderating forums, one of the most common questions I've been asked is "Where can I get a performance test plan template?" As much as we may want to deny the fact, we have to concede that it's important to document our projects. During my tenure as a performance engineer, with the input of countless clients, friends, and coworkers, I've developed a document template based on the process I've just outlined. I call the resulting document the Performance Engineering Strategy. You can download a .pdf version of the template if you want to use it. Why, you may wonder, do I call this an engineering strategy rather than a test plan? It's my opinion that while these two things fundamentally serve the same purpose, they're quite different documents. A functional test plan tells the who, what, when, why, and how of the functional testing effort. It lists specific tasks assigned to specific people to be completed by specific dates. In our performance engineering strategy, on the other hand, once we execute our first test

we reach a set of decision points. It's simply not possible to put a predetermined structure around activities such as "tune system" or "identify and develop exploratory tests." We don't even know if any tuning will be required, or how many exploratory tests may ultimately be developed. How can we assign the optional activity of "tune system" to a person when we don't yet know what will need to be tuned? Obviously, we can't. What we can do, however, is explicitly detail a strategy to address the question "What do we do when?" So I like to make the distinction between a test plan and an engineering strategy up front. While it's possible to build a performance test plan, I've found that it becomes more of a hindrance than a help by the conclusion of the first battery of executed tests. I find it more useful to have a document that outlines the strategy, and then when a performance issue presents itself, to create a "mini-plan" for resolving that performance issue that's consistent with the overall strategy. Let's discuss the basics of this strategy document. The template I've provided for you to download includes example verbiage in the sections that are unique to the application under test, and since we'll be discussing some of the sections in more detail in other articles, I won't go into too much detail here. Instead, I'll outline the document and describe what I recommend including in each section. 1.

Introduction

1.1. Description Describes the document, not the performance engineering effort, to let the reader know what to expect.

1.2. Purpose Gives a high-level overview of the document's purpose.

1.3. Scope Is similar to the purpose statement but focuses on boundaries and what's not covered in the document rather than what is covered.

1.4. Related Documents Lists other documents that provide information referenced in this document and may also list project documents that aren't directly referenced but could be valuable to the readers of this document. 2.

Performance acceptance criteria

2.1. Introduction gives a brief definition of what we mean by performance acceptance criteria.

2.2. Performance Criteria Defines the specific types of performance-related criteria being used for the engagement.

2.2.1. Requirements Details those performance criteria that must be satisfied at a minimum in order for the application to be put into production.

2.2.2. Goals Details those performance criteria that would ideally be satisfied when the application is put into production. These are always more stringent than the criteria listed in "Requirements." 1. 2.3. Engagement Complete Criteria Defines what it will mean to be done with the engagement. Assuming all criteria and/or goals can't be achieved, details how the performance engineering engagement will conclude.

Workload distribution

3.1. Introduction Describes what a workload distribution is and how it relates to performance engineering. 3.2. Workload Distribution for Details the workload distribution(s) and/or user community model(s) to be simulated during the performance engineering engagement. 2.

Script descriptions

4.1. Introduction This introduction is very important for stakeholders who don't understand automated loadgeneration tools and must be customized for your particular audience. It should describe how all the tools you'll be using work, how the application will be scripted, and how this relates to the workload distribution(s) described in Section 3 of the document. This may also be a good place to describe how measurements will be collected and how that relates to scripts. 4.2. <Script Name 1> Script Overview Discusses what the script does and how it relates to the workload distribution.

o

4.2.1. <Script Name 1> Script Measurements Discusses what measurements will be collected by the script and what measurements may be collected by other means while that script is executing. 4.2.2. <Script Name 1> Script Think Times Discusses what the delay times and distributions are for the pages included in the script, and how those times were determined. 4.2.3. <Script Name 1> Script Data Discusses what data will be used/required for this script to simulate real application usage, such as unique IDs and passwords for each simulated user, or "test" credit card information. If this data doesn't already exist, also describes how this data will be obtained.

4.3. <Script Name 2 etc.> Script Overview 3.

Test execution plan If you adopt the methodology we're discussing in this article, this part of the template won't change. Rather than duplicating what's written there, I'll simply include the outline here and let you review the template for more detail. 5.1. Introduction 5.2. Evaluate System 5.3. Develop Test Assets 5.4. Execute Baseline/Benchmark Tests 5.5. Analyze Results 5.6. Execute Scheduled Tests 5.7. Identify Exploratory (Specialty) Tests 5.8. Tune System 5.9. Project Closeout

4.

Risks Since this is sometimes an entirely separate document, I'll discuss it separately below and therefore have left just the basic outline here for completeness.

6.1. Introduction 6.2. 6.3

That's the outline of the Performance Engineering Strategy document, then. Now I'll describe the Risk Mitigation Plan, which can be either Section 6 of the strategy document or a separate document.

The risk mitigation plan Risk identification and mitigation are critical to any project, and performance engineering is no exception. My experience has shown that it's absolutely imperative to publicly identify risks to performance engineering efforts as soon as they present themselves. Note that I'm not talking about the high-level risks that are managed by the project manager, such as "The application may not scale to the proper number of users." That type of risk should be identified in the overall project plan. The kind of risk I'm talking about is more along the lines of "The performance test environment may be late, thus eliminating some of the time dedicated to performance testing prior to the 'go live' date." Risks like these need to be raised and documented so that at the end of the project there will be a history of all the identified risks, their potential impact, the mitigation strategy, and the resolution. I like to document these risks in the Performance Engineering Strategy document but some organizations prefer that this be a separate document. Either way, I've found the following format to be easy to use for tracking risks:

• • • • •

Risk name -- Give each identified risk a descriptive name that stakeholders will immediately recognize. This will ensure that the rest of the discussion about that risk gets reviewed. Discussion -- The discussion of the specific risk and its potential impacts needs to be very detailed. The discussion doesn't pass judgment but simply states facts. Quantifiable facts are best. Mitigation strategy -- The mitigation strategy includes two parts: (1) How are we going to try to keep this risk from happening? (2) If this risk does happen, how do we minimize its impact? The more detailed the plan, the better. The mitigation strategy is the most important part of risk management. Owner -- Each risk should have an owner, preferably an individual rather than a group or an organization. The owner of the risk isn't necessarily responsible for taking all of the action related to that risk but rather is responsible for ensuring that any required action is accomplished by the right people at the right time. Status -- Because this is a living document and should be updated no less often than weekly, a current status should always be included.

It's beyond the scope of this article to discuss formal risk mitigation techniques, and there are plenty of quality books and resources available on this topic that go far deeper than I could in one small section of an article. The point I want to make here is that risks associated with a performance engineering engagement should be managed and documented independently of general project risks.

The performance engineering results document One of the most often overlooked aspects of a performance engineering engagement is the documentation of results. What typically happens is that performance testing falls behind, testing continues frantically until "go live" day, the application goes live, it doesn't crash, and everyone forgets about performance. Then a couple of months later someone finds a performance problem and asks, "Did we find this during testing? Does anyone know how to fix this? Don't we have scripts to help isolate the problem?" And no one can answer these questions. Why does this happen? Because we didn't document the results. And now, not only is it our fault that performance "suddenly" got bad, but we're also being accused of not really doing a good job of testing in the first place! There's one simple way to fix this -- by compiling a Performance Engineering Results document. I've created a template for this document that I'll outline below. You can download a .pdf version of the template if you want to use it. You'll notice that this document duplicates much of the information in the strategy document. Experience shows that stakeholders like to have all of this information in one place, rather than having to go back and forth between two documents. I recommend that you discuss this format with your stakeholders to ensure that it meets their needs before starting the document. 1.

Executive summary

This one-page summary of the results should provide the information that a high-level stakeholder needs to make a "go live" decision about the application. Focus is on the actual performance of the application at the time of the final test and may include recommendations if appropriate. 2.

Introduction

2.1. Scope Is similar to the purpose statement but focuses on boundaries and what's not covered in the document rather than what is covered.

2.2. Purpose gives a high-level overview of the document's purpose.

2.3. Related Documents Lists other documents that provide information referenced in this document and may also list project documents that aren't directly referenced but could be valuable to the readers of this document. 3.

Performance acceptance criteria

3.1. Introduction gives a brief definition of what we mean by performance acceptance criteria.

3.2. Performance Criteria Defines the specific types of performance-related criteria being used for the engagement.

o 3.2.1. Requirements Details those performance criteria that must be satisfied at a minimum in order for the application to be put into production.

3.2.2. Goals Details those performance criteria that would ideally be satisfied when the application is put into production. These are always more stringent than the criteria listed in "Requirements."

3.3. Engagement Complete Criteria Defines what it will mean to be done with the engagement. Assuming all criteria and/or goals can't be achieved, details how the performance engineering engagement will conclude. 4.

Workload distribution

4.1. Introduction Describes what a workload distribution is and how it relates to performance engineering. 4.2. Workload Distribution for Details the workload distribution(s) and/or user community model(s) to be simulated during the performance engineering engagement. 5.

Baseline results

5.1. Introduction describes the baseline tests as they were actually conducted, in detail.

o 5.1.1. System Architecture describes the environment that the baseline tests were conducted against.

5.2. Baseline Results Summarizes results. Supporting data can be included in an appendix as appropriate. 6.

Benchmark results

6.1. Introduction describes the benchmark tests as they were actually conducted, in detail.

o 6.1.1. System Architecture describes the environment that the benchmark tests were conducted against.

6.2. Benchmark Results Summarizes results. Supporting data can be included in an appendix as appropriate.

o

6.2.1. Benchmark Results briefly summarizes any points of interest for specific benchmark test executions.

7. Other Scheduled Test Results Follows the same format as used for baselines and benchmarks for each of the types of tests conducted. 7.1. Scheduled tests

o

7.1.1. User Experience Tests 7.1.2. User Experience Test Results 7.1.3. Common Tasks Tests 7.1.4. Remote Location Tests 7.1.5. Stability Tests 7.1.6. Batch Baselines 7.1.7. Production Validation Tests

7.2. Exploratory (specialty) tests

o 8.

7.2.1. Concern/Issue 1 7.2.2. Concern/Issue 2 Conclusions and recommendations

8.1. Consolidated Results Contains charts and narratives that summarize the overall results, as described in UENM Parts 8--10. This is more detailed than the executive summary but still doesn't include the supporting data.

8.2. Tuning Summary summarizes the performance bottlenecks that were found and how they were resolved. It's not necessary to give a complete list of all the activities leading to detecting the bottleneck and the resolution.

8.3. Conclusions Is an expanded version of the executive summary, providing more detail for an audience of stakeholders rather than technical team members.

8.4. Recommendations Is a discussion of all of the performance test team's recommendations, not just a yes/no on the "go live" decision. Often includes recommendations for future testing and tuning of the application, and insight into capacity and scalability planning.

Beyond performance testing part 3: How fast is fast enough? Scott Barber, Performance testing consultant, AuthenTec Summary: There's no industry standard for Web application performance, so we must depend on our own best judgment to determine just how fast is fast enough. Learn how to gather performance expectations and convert them into explicit, testable requirements. "You thought that was fast? I thought it was fast. Well, was it?" -- Jodie Foster as Annabelle in the movie Maverick (1994) As a moderator of performance-related forums on QAForums.com, I've seen questions like this one posed numerous times: "I'm desperately trying to find out what the industry standard response times are. What are reasonable benchmarks that are acceptable for Web sites at the moment? Is 1.5 seconds a reasonable target????" My answer to questions like this one always starts with "It depends on . . . " My friend Joe Strazzere addressed this question particularly well, as follows: There are no industry standards. You must analyze your site in terms of who the customers are, what their needs are, where they are located, what their equipment and connection speed might be, etc., etc.

I suspect 1.5 seconds would be a rather short interval for many situations. Do you really require that quick of a response? The bottom line is that what seems fast is different in different situations. So how do you determine how fast is fast enough for your application, and how do you convert that information into explicit, testable requirements? Those are the topics this article addresses. This is the third article in the "Beyond performance Testing" series. Here's what the series has covered so far: Part 1: Introduction Part 2: A Performance Engineering Strategy This article is intended for all levels of users of the IBM® Rational Suite® TestStudio® system testing tool and will be particularly useful to managers and business analysts involved in determining the performance requirements of a system. It expands on concepts mentioned in "User experience, not netrics, part 5: using timers," so you should have read that article before tackling this one. Considerations affecting performance expectations Let's start by discussing the leading factors that contribute to what we think fast is. I believe these considerations can be divided into three broad categories:

• • •

user psychology system considerations usage considerations

None of these categories is any more or less important than the others. What's critical is to balance these considerations, which we'll explore individually here, when determining performance requirements. User psychology Of the three categories, user psychology is the one most often overlooked -- or maybe a better way to say this is that user psychology is often overridden by system and usage considerations. I submit that this is a mistake. User psychology plays an important role in perceived performance, which, as we discussed in detail throughout the "User experience, not metrics" series, is the most critical part of evaluating performance. Consider this example. I recently filled out my tax return online. It's a pretty simple process: you navigate through a Web application that asks you questions to determine which pages are presented for you to enter data into. As I made a preliminary pass through my return, I was happy with the performance (response time) of the application. When I later went back to complete my return, I timed the page loads (because I almost always think about performance when I use the Internet). Most of the pages returned in less than 5 seconds, but some of the section summary pages took almost a minute! Why didn't I notice the first time through that some pages were this slow? Why didn't I get frustrated with this seemingly poor performance? I usually notice performance as being poor at between5 and 8 seconds, and at about 15 seconds I'll abandon a site or at least get frustrated. There's no science behind those numbers; they're just my personal tolerance levels. So what made me wait a minute for some pages without even noticing that it was slow? The answer is that when I requested a section summary page, an intermediate page came up that said: "The information you have requested is being processed. This may take several minutes depending on the information you have provided. Please be patient." When I received that message, I went on to do something else for a minute. I went to get a drink, or checked one of my e-mail accounts, or any of a million other things, and when I came back the page was there waiting for me. I was satisfied. If that message hadn't been presented and I had found myself just sitting and waiting for the page to display, I

would have become annoyed and eventually assumed that my request wasn't submitted properly, that the server had gone down, or maybe that my Internet connection had been dropped. So, getting back to the initial question of how fast is fast enough, from a user psychology perspective the answer is still "it depends." It depends on several key factors that determine what is and isn't acceptable performance. The first factor is the response time that users have become accustomed to based on previous experience. This is most directly tied to the speed of their Internet connection. My mother, for example, has never experienced the Internet over anything other than a fuzzy phone line with a 56.6-kilobytes-per-second modem. I'm used to surfing via high-speed connections, so when I sign on at my mother's house I'm frustrated. My mother thinks I'm silly: "Scott, I think you're spoiled. That's as fast as we can get here, and a lot faster than we used to get! You never were very patient!" She's right -- I'm not very patient, so I have a low tolerance for poor Web site performance, whereas she has a high tolerance. Another factor is activity type. All users understand that it takes time to download an MP3 or MPEG video, and therefore have more tolerance if they're aware that that's the activity they're performing. However, if users don't know that they're performing an activity like downloading a file and are just waiting for the next page to load, they're likely to become frustrated before they realize that the performance is actually acceptable for the activity they're performing. This leads us to the factor of how user expectations have been set. If users know what to expect, as they do with the tax preparation system I use, they're likely to be more tolerant of response times they might otherwise think of as slow. If you tell users that the system will be fast and then it isn't, they won't be happy. If you show users how fast it will be and then follow through with that level of performance, they'll generally be pretty happy. The last factor we should discuss here is what I call surfing intent. When users want to accomplish a specific task, they have less tolerance for poor performance than when they're casually looking for information or doing research. For example, when I log on to the site I use to pay bills, I expect good performance. When I'm taking a break from work and searching for the newest technology gadgets, I have a lot of tolerance for poor performance. So with all of these variables you can see why, as Joe Strazzere said, "There are no industry standards." But if there are no industry standards, how do we know where to start or what to compare against? I'll describe some rules of thumb later, when we get to the topic of collecting information about performance requirements. System considerations System considerations are more commonly thought about than user psychology when we're determining how fast is fast enough. Stakeholders need to decide what kind of performance the system can handle within the given parameters. "Fast-enough" decisions are often based purely on the cost to achieve performance. While cost and feasibility are important, if they're considered in a vacuum, you'll be doomed to fielding a system with poor performance. Performance costs. The cost difference between building a system with "typical" performance and building a system with "fast" performance is sometimes prohibitive. Only by balancing the need for performance against the cost can stakeholders decide how much time and/or money they're willing to invest to improve performance. System considerations include the following:

• • • •

system hardware network and/or Internet bandwidth of the system geographical replication software architecture

Entire books are dedicated to each of these considerations and many more. This is a well-documented and wellunderstood aspect of performance engineering, so I won't spend more time on it here. Usage considerations

Usage considerations are related to but separate from user psychology. The usage considerations I'm referring to have to do with the way the Web site or Web application will be used. For example, is the application a shopping application? An interactive financial planning application? A site containing current news? An internal human resources data entry application? "Fast" means something different for each of these different applications. An application that's used primarily by employees to enter large volumes of data needs to be faster for users to be satisfied than a Web shopping site. A news site can be fairly slow, as long as the text appears before the graphics. Interactive sites need to be faster than mostly static sites. Sites that people use for work-related activities need to be faster than sites that are used primarily for recreational purposes. These considerations are very specific to the site and the organization. There really isn't a lot of documentation available about these types of considerations because they're so individual, depending on the specific application and associated user base. What's important is to think about how your site will be used and to determine the performance tolerance of expected users as compared to overall user psychology and system considerations. I'll say more about this in the next section.

Collecting information about performance requirements So how do you translate the considerations described above into performance requirements? My approach is to first come up with descriptions of explicit and implied performance requirements in these three areas:

• • •

user expectations resource limitations stakeholder expectations

In general, user and stakeholder expectations are complementary and don't require balancing between the two. For this reason, I start by determining these requirements. Once I've done this, I try to balance those with the system/financial resources that are available. Truth be told, I generally don't get to do the balancing. I usually collect the data and identify the conflicts so that stakeholders can make decisions about how to balance expectations and resources to determine actual requirements. Determining the actual requirements in the areas of speed, scalability, and stability and consolidating these into composite requirements is the final step. I'll describe that process in detail later on, but first let's look at each of the three areas where you'll be collecting information about requirements. User expectations A user's expectations when it comes to performance are all about end-to-end response time, as we touched on earlier in our look at user psychology. Individual users don't know or care how many users can be on the system at a time, how the system is designed to recover in case of disaster, or what the cost of building and maintaining the system has been. When a new system is replacing an old one, it's critical from the user's perspective for the requirements of the new system to be at least as stringent as the actual performance of the existing system. Users won't be pleased with a new system if their perception is that its performance is worse than the system it's replacing -- regardless of whether the previous system was a Web-based application, client/server, or some other configuration. Aside from this situation, there's no way to predict user expectations. Only users can tell you what they expect, so be sure you take the time to poll users and find out what their expectations are before the requirements are set. Talk to users and observe them using a similar type of system, maybe even a prototype of the system to be built. Remember that most users don't think in terms of seconds, so to quantify their expectations you'll have to find a way to observe what they think is fast, typical, or slow. During my tenure as a performance engineer, I've done a lot of research in the area of user expectations. I believed at first in the "8-second rule" that became popular in the mid-1990s, simply stating that most Web surfers consider 8 seconds to be a reasonable download time for a page. But since then I've found no reliable research backing this rule of thumb, nor have I found any correlation between this rule of thumb and actual user psychology. I'm going to share with

you what I have found, not to suggest these findings as standards but to give you a reasonable place to start as you poll your own users. I've found that most users have the following expectations for normal page loads when surfing on high-speed connections:

• • • • •

no delay or fast -- under 3 seconds typical -- 3 to 5 seconds slow -- 5 to 8 seconds frustrating -- 8 to 15 seconds unacceptable -- more than 15 seconds

In my experience, if your site is highly interactive or primarily used for data entry you should probably strive for page load speeds about 20% faster than those listed. For mostly static or recreational sites, performance that's about 25% slower than the response times listed may still be acceptable. For any kind of file download (MP3s, MPEGs, and such):

• •

If the link for the file includes a file size and the download has a progress bar, users expect performance commensurate with their connection speed. If users are unaware they're downloading a file, the guidelines for normal pages apply. For other activity:

• •

Unless user expectations are set otherwise, the guidelines for normal pages apply. If users are presented with a notice that this may take a while, they'll wait significantly longer than without a notice, but the actual amount of time they'll wait varies drastically by individual.

When users are made aware of their connection speed, their expectations about performance shift accordingly. Part of my experiments with users was to create pages with the response times in each of the categories above over a highspeed connection, then to throttle back the connection speed and ask the users the same questions about performance. As long as I told them the connection rate I was simulating, users rated the pages in the same categories, even though the actual response times were very different. One final note: There's a common industry perception that pages that users find "typical" or "slow" on high-speed connections will be "frustrating" or "unacceptable" to users on slower connections. My research doesn't support this theory. It does show that people who are used to high-speed connections at work but have slower dial-up connections at home are often frustrated at home and try to do Internet tasks at work instead. But people who are used to slower connections rate the same pages as fast, typical, slow, and unacceptable over their typical connection as people who are used to high-speed connections and try to load these pages over their typical connection. Resource limitations Limitations on resources such as time, money, people, hardware, networks, and software affect our performance requirements, even though we really wish they didn't. For example, "You can't have any more hardware" is a resource limitation, and whether we like it or not, it will likely contribute to determining our requirements. Anecdotally, there's a lot to say about the effects of resource limitations on performance requirements, but practically all it really comes down to is this: Determine before you set your performance requirements what your available resources are, so that when you're setting the requirements, you can do so realistically. Stakeholder expectations Unlike user expectations, stakeholder expectations are easy to obtain. Just ask any stakeholder what he or she expects.

"This system needs to be fast, it needs to support ten times the current user base, it needs to be up 100% of the time and recover 100% of the data in case of down time, and it must be easy to use, make us lots of money, have hot coffee on my desk when I arrive in the morning, and provide a cure for AIDS." OK, that's not something an actual stakeholder would say, but that's what it feels like they often say when asked the question. It's our job to translate these lofty goals into something quantifiable and achievable, and that's not always an easy task. Usually stakeholders want "industry standards as written by market experts" to base their expectations on. As we've already discussed, there are no standards. In the absence of standards, stakeholders generally want systems so fast and scalable that performance becomes a nonissue . . . until they find out how much that costs. In short, stakeholders want the best possible system for the least possible money. This is as it should be. When it comes to stakeholders, it's our job to help them determine, quantify, and manage system performance expectations. Of the three determinants of performance requirements that we've been discussing, the stakeholders have both the most information and the most flexibility. User expectations very rarely change, and resource limitations are generally fairly static throughout a performance testing/engineering effort. Stakeholder expectations, however, are likely to change when decisions have to be made about tradeoffs. Consider this. Recently I've been involved with several projects replacing client/server applications with Web-based applications. In each case, the systems were primarily data entry systems. Initially, stakeholders wanted the performance of the new application to match the performance of the previous client/server application. While this is in line with what I just said about user expectations, it's not a reasonable expectation given system limitations. Web-based systems simply don't perform that fast in general. And I've found that even users who are accustomed to a subsecond response time on a client/server system are happy with a 3-second response time from a Web-based application. So I've had stakeholders sit next to users on the prototypes (that were responding in 3 seconds or less) and had those users tell the stakeholders how they felt about performance. When stakeholders realize that users are satisfied with a "3-second application," they're willing to change the requirement to "under 3 seconds." Speed, of course, isn't the only performance requirement. Stakeholders need to either inform you of what the other requirements are or be the final authority for making decisions about those requirements. It's our job to ensure that all of the potential performance requirements are considered -- even if we determine that they're outside the scope of the particular project.

Determining and documenting performance requirements Once you've collected as much information as possible about user and stakeholder expectations as well as resource limitations, you need to consolidate all of that information into meaningful, quantifiable, and testable requirements. This isn't always easy and should be an iterative process. Sending your interpretation of the requirements for comment back to the people you gathered information from will allow you to finalize the requirements with buy-in from everyone. As I've mentioned before, I like to think of performance in three categories:

• • •

speed scalability stability

Each of these categories has its own kind of requirements. We've been discussing speed and scalability extensively in both the "User experience, not metrics" series and the "Beyond performance testing" series. Stability is a slightly different type of performance that we won't be discussing much in either series. However, I think the topic of collecting stability requirements is important enough to include here. In the sections that follow, we'll discuss how to extract requirements from expectations and limitations, then consolidate those requirements into composite requirements -- or what some people would refer to as performance test cases. Speed requirements

If you've gone through the exercise of collecting expectations and limitations, I'm sure that you have lots of information about speed. Remember that we want to focus on end user response time. There may be individual exceptions in this category for things like scheduled batch processes that must complete in a certain window, but generally, don't get trapped into breaking speed requirements down into subcomponents or tiers. I like to start by summarizing the speed-related information I've collected verbally -- for example:

• • • • •

normal pages -- typical to fast reports -- under a minute exception activities (list) -- fast to very fast query execution -- under 30 seconds nightly backup batch process -- under an hour

You'll see that some of that information is fairly specific, while some isn't. For this step, what's important is to ensure that all activities fall into one of the categories you specified. You don't want every page or activity to have a different speed requirement, but you do want the ability to have some exceptions to "typical" performance. Now we must assign values to the verbal descriptions we have and extrapolate the difference between goals and requirements. You may recall from the Performance Engineering Strategy document that performance requirements are those criteria that must be met for the application to "go live" and become a production system, while performance goals are desired but not essential criteria for the application. Table 1 shows the speed requirements and goals derived from the descriptions above. Activity type

Requirement

Goal

Normal pages

5 sec

3 sec

Reports

60 sec

30 sec

Exception activities (listed elsewhere)

3 sec

2 sec

Query execution

30 sec

15 sec

Nightly backup

1 hour

45 min

Table 1: Speed requirements and goals example Of course, speed alone doesn't tell the whole story. Even getting agreement on a table like this doesn't provide any context for these numbers. To get that context, you also need scalability and stability requirements. Scalability requirements Scalability requirements are the "how much" and "how many" questions that go with the "how fast" of the speed requirements. These might also be thought of as capacity requirements. Scalability and capacity are used interchangeably by some people. Much of the information that we need in order to extract specific scalability requirements is contained in "User experience, not metrics, part 4: Modeling groups of users." Please refer to that article for more detail. Here's an example of scalability requirements that go with the speed requirements above:

• • • • •

peak expected hourly usage -- 500 users peak expected sustained usage -- 300 users maximum percentage of users expected to execute reports in any one hour -- 75% maximum percentage of users expected to execute queries in any one hour -- 75% maximum number of rows to be replicated during nightly backup -- 150,000

As you can see, we now have some context. We can now interpret that the system should be able to support 300 users with about a 3-second typical response time, and 500 with an under-5-second typical response time. I'm sure you'll agree this is much different t from single users achieving those results.

Another topic related to scalability is user abandonment. We'll discuss user abandonment in detail in Part 4 of this series; for now, suffice it to say that a general requirement should be to minimize user abandonment due to performance. Stability requirements Stability covers a broad range of topics that are usually expressed in terms of "What will the system do if . . . ?" These are really exception cases; for instance, "What is the system required to do if it experiences a peak load of double the expected peak?" Another, broader term for these types of requirements is robustness requirements. Ross Collard, a well-respected consultant, lecturer, and member of the quality assurance community, defines robustness as "the degree of tolerance of a component or a system to invalid inputs, improper use, stress and hostile environments; . . . its ability to recover from problems; its resilience, dependability or survivability." While robustness includes the kind of usage stability we're focusing on, it also includes overall system stability. For our purposes we'll focus on usage stability and not on topics such as data recovery, fail-over, or disaster recovery from the system stability side. Some examples of stability requirements are as follows:

• • • •

System returns to expected performance within five minutes after the occurrence of an extreme usage condition, with no human interaction. System displays a message to users informing them of unexpected high traffic volume and requests they return at a later time. System automatically recovers with no human interaction after a reboot/power down. System limits the total number of users to a number less than that expected to cause significant performance degradation.

Now, let's put these together into some real requirements. Composite requirements The types of requirements discussed above are very important, but most of them aren't really testable independently, and even if they are, the combinations and permutations of tests that would need to be performed to validate them individually are unreasonable. What we need to do now is to consolidate those individual requirements into what I term composite requirements. You may know them as performance test cases. The reason I shy away from calling these performance test cases is that my experience has shown that most people believe that once all the test cases pass, the testing effort is complete. I don't believe that's always the case in performance engineering, though it may be for performance testing. Meeting the composite requirements simply means the application is minimally production-ready from a performance standpoint. Meeting the composite goals means that the application is fully production-ready from a performance standpoint according to today's assumptions. Once these composites are met, a new phase begins that's beyond the scope of this series -- capacity planning, which also makes use of these composite requirements but has no direct use for test cases. Let's look at how the individual requirements we came up with map into composite requirements and goals. Composite requirements 1. 2. 3. 4.

The system exhibits not more than a 5-second response time for normal pages and meets all exception requirements, via intranet, 95% of the time under an extended 300-hourly-user load (in accordance with the user community model) with less than 5% user abandonment. The system exhibits not more than a 5-second response time for normal pages and meets all exception requirements, via intranet, 90% of the time under a 500-hourly-user load (in accordance with the user community model) with less than 10% user abandonment. All exception pages exhibit not more than a 3-second response time 95% of the time, with no user abandonment, under the conditions in items 1 and 2 above. All reports exhibit not more than a 60-second response time 95% of the time, with no user abandonment, under the conditions in items 1 and 2 above.

5.

All reports exhibit not more than a 60-second response time 90% of the time, with less than 5% user abandonment, under the 75% report load condition identified in our scalability requirements. 6. All queries exhibit not more than a 30-second response time 95% of the time, with no user abandonment, under the conditions in items 1 and 2 above. 7. All queries exhibit not more than a 30-second response time 90% of the time, with less than 5% user abandonment, under the 75% report load condition identified in our scalability requirements. 8. Nightly batch backup completes in under 1 hour for up to 150,000 rows of data. 9. The system fully recovers within 5 minutes of the conclusion of a spike load. 10. The system displays a message to users starting with the 501st hourly user informing them that traffic volume is unexpectedly high and requesting that they return at a later time. 11. The system limits the total number of users to a number less than that expected to cause significant performance degradation (TBD -- estimated 650 hourly users). 12. The system automatically recovers to meet all performance requirements within 5 minutes of a reboot/power down with no human interaction. Composite goals 1. 2. 3. 4. 5. 6. 7.

The system exhibits not more than a 3-second response time for normal pages and meets all exception requirements, via intranet, 95% of the time under a 500-hourly-user load (in accordance with the user community model) with less than 5% user abandonment. All exception pages exhibit not more than a 2-second response time 95% of the time, with no user abandonment, under the conditions in item 1 above. All reports exhibit not more than a 60-second response time 95% of the time, with no user abandonment, under the conditions in items 1 and 2 above. All reports exhibit not more than a 30-second response time 95% of the time, with no user abandonment, under the conditions in item 1 above. All reports exhibit not more than a 30-second response time 90% of the time, with less than 5% user abandonment, under the 75% report load condition identified in our scalability requirements. All queries exhibit not more than a 15-second response time 95% of the time, with no user abandonment, under the conditions in item 1 above. All queries exhibit not more than a 15-second response time 90% of the time, with less than 5% user abandonment, under the 75% report load condition identified in our scalability requirements.

These requirements may be more detailed than you're used to, but I hope you can see the value of insisting upon explicit composite requirements such as these. Beyond performance testing part 4: Accounting for user abandonment Scott Barber, Performance testing consultant, AuthenTec Summary: Although users routinely abandon Web sites when they get tired of waiting for a page to load, user abandonment isn't often discussed in connection with performance testing. Learn here how to model abandonment and adapt VU scripts accordingly. Have you ever gotten tired of waiting for a Web page to load and exited the site before completing the task you had in mind? No doubt you have, just like every other Web user. Although user abandonment is a routine occurrence, it isn't commonly discussed in the context of developing and analyzing performance tests. Here in the fourth article of the "Beyond performance testing" series, we'll explore performance-testing issues related to user abandonment and how to account for these issues using the IBM® Rational Suite® TestStudio® system testing tool. So far, this is what we've covered in this series:

• • •

Part 1: Introduction Part 2: A performance engineering strategy Part 3: How fast is fast enough?

This article is intended for all levels of Rational TestStudio users and will be particularly useful to advanced VU scripters. It discusses why users abandon a site, explains why abandonment must be accounted for in performance testing, describes in detail how to create an abandonment model for your Web site, and outlines how to adapt VU scripts to correctly handle abandonment using the model you've created. Before we get started, I want to acknowledge the work of Alberto Savoia, a true thought leader in the field of performance testing and engineering. Before reading his article "Trade Secrets from a Web Testing Expert," I hadn't been considering user abandonment in my testing. After reading his article, I expanded on what I read and applied it in my VU scripting. What you're about to read paraphrases and extends Savoia's ideas about user abandonment. The application to Rational VU scripts is my own work. Why do users abandon a site? User abandonment occurs when a user gets frustrated with a site, usually due to performance, and discontinues using that site either temporarily or permanently. As we discussed in the previous article in this series, "Part 3: How Fast Is Fast Enough?," different users have different tolerance levels for performance. The reasons users abandon involve some of the same factors that we discussed when we were talking about determining performance requirements - namely, user psychology, usage considerations, and user expectations. When we get to the section about how to create an abandonment model, I'll show you a method for quantifying each of these factors to account for user abandonment. User psychology Abandonment is all about user psychology. And when it comes to abandonment, user psychology is more than just "Why do we abandon sites?" That answer is simple: "Because we get tired of waiting." The answer to the follow-up question -- "Why do we get tired of waiting?" -- is not so simple. Why do I get tired of waiting and subsequently decide to abandon? Here are a few of my common reasons:

• • • • • •

The site is just painfully slow. I lose interest while waiting for a page to download. I get distracted during download. I figure I can find what I need somewhere else faster. I just plain get bored. While waiting I check my e-mail and forget to go back.

While all of those reasons can be summarized as "the site is too slow," what "too slow" means to me can vary depending on the particular reason? For instance, after about 8 or 10 seconds I start getting bored and may move to another browser window, even though 8 to10 seconds isn't painfully slow. As another example, it seems that almost every night I manage to be on the computer when dinner's ready. If I'm waiting for a page, I get right up and proceed to the dining room. If I'm reading a page, I finish what I'm reading. If what I'm reading continues on to the next page, I click the link. If it comes up very quickly, I read that one too, but if it hesitates, I go to dinner. Once I go to dinner, I almost never come back to what I was reading. While I'm sure my wife appreciates those slow sites because I show up for dinner sooner, those sites are definitely losing me as a viewer based on performance -- and not excessively poor performance at that. Usage considerations Usage considerations are even more relevant to user abandonment than they are to determining requirements. In Part 3, I mentioned that in filling out my federal income tax return online I was more tolerant of slowness because my expectations had been set properly. While this expectation setting may have kept me from getting annoyed as quickly, the truth is that I wasn't going to abandon that site no matter how slow it got. The thought of losing my return and having to start over or of having an incomplete return submitted that might lead to my being audited would have kept me from abandoning. In contrast, when I'm just catching up on the news I'll abandon one site and check another long before I'll wait for a slow site. When thinking about usage considerations, we're really examining how much longer than a user's typical tolerance she's willing to wait before abandoning a task she considers very important. The perceived importance of the activity

being performed online is the key factor here. An activity's importance is usually perceived as higher in cases that involve the following:

• • • •

real or perceived loss of money as a result of not completing the transaction concern that inaccurate information will be submitted if the transaction isn't completed lack of an alternative means to accomplish the task at hand existence of a real or perceived deadline for completing the transaction

Later we'll discuss how to determine and apply this "importance factor" to our abandonment model. User expectations I mentioned expectations above. Abandonment is unquestionably affected by user expectations, as demonstrated by my experience in preparing my tax return. Users abandon sites when their expectations aren't met, or when their expectations haven't been set properly in advance. Our concern in performance testing isn't to manage user expectations but rather to ensure that their expectations drive our requirements. We probably don't have the ability to poll the potential users of the system with questions like these:

• • •

What speed connection do you use? How fast do you expect this Web site to be? How long will you wait before abandoning your session?

If polling a subset of potential users is possible, we should probably do it. But if we can't, how do we account for user expectations? Our best bet is to evaluate what potential users are employing now to accomplish the same activity and assume that they'll become frustrated if our solution is noticeably (let's say 20% or more) slower than that. This isn't a hard science, but doing this will give us a place to start while deriving our abandonment model. Ideally, this thought process was part of your requirements gathering and contributed to your overall performance requirements. If you collected requirements well, I submit that your users' expectations should be (statistically speaking) equivalent to the performance requirement associated with each page, meaning that if a requested page downloads within the time specified in the performance requirements, users won't abandon the site for performance reasons. Why account for user abandonment? Before we create a user abandonment model based on the factors I just discussed, let's take a step back and talk about why we should be concerned about accounting for user abandonment in performance testing. Alberto Savoia says this in "Trade Secrets from a Web Testing Expert": "You should simulate user abandonment as realistically as possible. If you don't, you'll be creating a type of load that will never occur in real life -- and creating bottlenecks that might never happen with real users. At the same time, you will be ignoring one of the most important load testing results: the number of users that might abandon your Web site due to poor performance. In other words, your test might be quite useless." One of the great things about most Web sites is that if the load gets too big for the system/application to handle, the site slows down, causing people to abandon it, thus decreasing the load until the system speeds back up to acceptable rates. Imagine what would happen if once the site got slow, it stayed slow until someone "fixed the server." Luckily, abandonment relieves us of that situation, at least most of the time. Assuming that the site performs well enough with a "reasonable" load, performance is generally self-policing, even if at a cost of some lost customers/users. So one reason to correctly account for user abandonment is to see how your system/application behaves as the load grows and to avoid simulating bottlenecks that might never really happen. A second reason to correctly account for abandonment is that it provides some extremely valuable information to all of the stakeholders of the system. You may recall that we specified acceptable abandonment rates in our composite requirements in Part 3. Correctly accounting for abandonment allows you to collect data to evaluate those requirements.

Another way to think about why correctly accounting for user abandonment is important is to ask what happens if we don't account for abandonment at all or if we misaccount for abandonment. Not accounting for abandonment If we don't account for abandonment at all, the script will wait forever to receive the page or object it requested. When it eventually receives that object, it will move on to the next object like nothing ever happened. If the object is never received, the script never ends. I can think of no value this adds to the performance-testing effort, unless you have a need to show some stakeholder, "Under the conditions you specified, the average page-load time was roughly 2.5 hours." Unfortunately, we do occasionally have to make a point by generating ridiculous and meaningless numbers, but that's not a performance test, and it doesn't get us any closer to delivering a quality, well-performing application. On the other hand, if you don't account for abandonment and virtually all of your pages load faster than your requirements, then your test was perfectly accurate (in terms of abandonment simulation) . . . by accident. Don't settle for being correct by accident. Take the extra few minutes to include abandonment in your performance tests and have the confidence that your results are honestly accurate as opposed to accidentally representative. It should be pointed out that there are some cases where not accounting for abandonment is OK. Many Web-based applications that my coworkers and I have tested are exclusively for internal audiences that have no choice but to wait. For example, my current client has a policy that all employees (and consultants) must enter the hours they worked that week between noon and 5 p.m. every Friday -- unless they aren't working that day, in which case they're required to notify their managers of their hours so the managers can enter this information into the system. With roughly 3500 employees and consultants accessing this system during a five-hour period on top of the typical traffic, the system gets very slow. Under other circumstances, users might abandon the site and try later or go somewhere else. In this case, they have no choice but to wait, so abandonment isn't an issue. Misaccounting for abandonment Misaccounting for abandonment is what load-generation tools do by default. They assume that all users abandon at a predetermined time (in TestStudio, that default time works out to be 240 seconds) yet still continue on, requesting the following page like nothing happened. Of course, you can change settings to improve that by changing the time limit or having the virtual user actually exit, but that's still not context specific by page. If you were really motivated, you could change the parameters before every request based on how long you thought a user would wait for that particular object, but that still wouldn't account for the page as a whole. Improper accounting for abandonment can cause results that are even more misleading than if abandonment weren't accounted for at all. To support this statement, let's consider a couple of examples and their side-effects. "At 240 seconds, stop trying to get this object, log a message, and move on to the next object" (TestStudio default) -- If you have objects taking more than 240 seconds to load, this may cause unexpected errors in situations where subsequent objects need the former objects to have loaded successfully, because the tool will now be "forcing" the application to serve pages that a real user couldn't reach. This will also skew page and object load times, because you don't actually know how long the object would have taken to load, yet that 240 seconds is calculated as if the download were successful. Worst of all, the subsequent errors normally mask the initial cause, making it appear as if the script is flawed. This is all not to mention that the additional load applied after the time-out (that a real user wouldn't be applying) may skew your results. (Note that TestStudio's default action, which is to ignore failures and continue playback, can be modified with the Timeout_act variable. The default string setting is "IGNORE" but this can be changed to "FATAL" to cause virtual users to bail when an error takes place.) "Just log when people would have abandoned for analysis but don't actually exit the virtual user" -- While this may be useful during early testing (which I'll say more about in a minute), it paints a very inaccurate picture of the actual abandonment rate for a laundry list of reasons, including these:

• •

Once a VU gets one page slow enough to abandon, they usually get more if not exited, resulting in statistics showing an artificially high abandonment rate. Allowing a VU to continue that would have abandoned keeps the current load on the system rather than reducing the total load for others, which is going to cause other VUs to experience response times in the abandonment range, once again resulting in statistics showing an artificially high abandonment rate.



If the abandonment level response time was actually due to an error, subsequent page requests may also produce errors, making the actual problem (one slow page) much more difficult to detect.

Please note that we're talking here about grossly misrepresenting real abandonment. Mismodeling the abandonment range by a few seconds isn't going to cause this kind of problem. Your abandonment model needs to be reasonable, not perfect. I mentioned a minute ago that just logging potential abandonment and not exiting the VU may be useful during early testing. This is true for several reasons. For example, suppose you have an abandonment model that says all users will abandon if they encounter a page-load time of 30 seconds, but while your site is under development it's taking an average of 45 seconds to return a page, even at very low user loads. You'll still want your scripts to run all the way through to gather information and create system logs to help track down the reason the times are so slow. In this situation, abandoning all of the VUs when they hit the home page gives you no information to help tune the system. Use your best judgment early in testing about whether to just log or actually exit users when they reach the abandonment response time, but always exit users when you're executing a test intended to predict the experience of real users in production.

Building a user abandonment model Now I'll show you my approach to building representative user abandonment models for a particular Web site. For each type of page on the site, you'll want to establish these four parameters:

• • • •

the abandonment distribution the minimum abandonment time the maximum abandonment time the importance level

These parameters relate to the factors discussed earlier that affect user abandonment. You'll initially organize these parameters in a table similar to Table 1. This sample table doesn't model any site in particular but contains values for five page types that have different abandonment parameters. Later on you'll adjust the performance requirement parameter and the absolute abandonment time parameter based on each page's importance level and your appraisal of the context in order to arrive at the model you'll use in your VU scripts.

Table 1: Sample abandonment parameters for five different page types The abandonment distribution parameter Earlier I listed some reasons that users might get tired of waiting for a page to load. These reasons can be extremely variable, not only between different users but also between visits by the same user. A user's tolerance for waiting might change dramatically from session to session. While this makes it hard to predict when an individual might abandon, it also means that abandonment is likely to follow standard distribution models such as normal or negative exponential distributions. Our task is to determine for each page type in our table which distribution model best represents the likely abandonment behavior of users.

You might want to review my discussion of distribution models in "User experience, not metrics, part 2: Modeling individual user delays" at this point, because I'm not going to go into much detail about distribution models here. Suffice it to say that either a normal (bell curve) or a uniform (linear) distribution will be most accurate in the majority of cases. If you've ever taken a statistics or psychology course you know that almost everything that real human beings do (over a large enough sample) can be represented by a bell curve. You may also recall that the key to an accurate bell curve is the standard deviation. We know two things about standard deviations when it comes to Web usage: (1) they're exceptionally large (statistically) in comparison to the range of values, and (2) they're almost impossible for no mathematicians to calculate accurately. What that means is that in most cases we actually end up with a very flat bell curve that, in effect, approaches a linear distribution. Statistics aside, if you don't have a strong reason to do otherwise, choose between either a normal or a uniform distribution based on your best judgment. The negative exponential or negexp (logarithmic or one-tailed) distribution is much less common. It applies in cases where most users will behave one way but a few will behave in an opposite manner. I'll give two examples.





As I've mentioned, I was willing to wait as long as it took to prepare my taxes, but if the performance had gotten bad enough, eventually the system would have ceased responding and I would have effectively abandoned. While it's likely that I would actually have waited until the system timed out, I might have abandoned sooner if I'd believed the system wasn't responding. This situation is represented by a one-tailed distribution where a few users may abandon in a short period of time but most users hang in there as long as possible. On some Web sites, fields in a form are automatically populated when the user starts entering data and, for example, chooses a value from a list box. Usually we don't even expect this to happen, but as long as it happens quickly (and the values that appear are correct) we don't mind -- or at least, I don't. But if it doesn't happen quickly all we know is that our page is frozen and we can't enter data in the next field (or worse, we can enter data, but it gets erased when the screen finally does refresh). That situation will cause users to abandon a site faster than any other situation I can think of. Thus, I represent user abandonment of field population with a logarithmic distribution that has most people abandoning quickly and only a few people hanging in there.

If you really don't know which distribution is best, use a random distribution. As Savoia put it, "It's important to realize that even the most primitive abandonment model is a giant leap in realism when compared to the commonly used 60- or 120-second timeouts." The minimum abandonment time parameter The minimum abandonment time parameter is our estimate of the amount of time that users expect to wait for a page to load. This is the minimum amount of time we think they'll wait before abandoning a site. Recall that I said above that users' expectations should be (statistically speaking) equivalent to the performance requirement associated with a page. To supply this parameter, simply copy the performance requirement from your Performance Test Strategy document. If you have multiple requirements based on user connection speed, you'll want to follow this process and ultimately create an abandonment model for each connection speed. The maximum abandonment time parameter No matter how patient a user may be, sometimes a Web site simply fails to respond due to circumstances like these:

• • •

secure session time-out (requiring users to either abandon or start over) browser "page cannot be displayed" time-out errors temporary (or permanent) Internet connectivity interruptions on either end

While this doesn't technically count as user abandonment (since the Web site abandoned the user and not the other way around), it does provide the upper bound for how long users could potentially wait before ceasing their Web-surfing session. Most load-generation tools (including Rational TestStudio) assume this category is the only time abandonment occurs, but as you've seen, this is highly misleading when trying to predict production-level performance accurately.

The maximum abandonment time parameter is the most scientific of the parameters. This is simply the time after which either your browser stops waiting for a response (in our example case, 120 seconds) or your secure session expires and you have to start your session over (in our example case, 900 seconds or 15 minutes). To determine these numbers, simply ask the architect/developer responsible for the presentation tier (or Web server) to provide you with the information. The importance level parameter The importance level parameter is simply a commonsense assessment of the perceived importance of a particular activity to the typical user. You can certainly poll users and stakeholders to obtain this information, but I wouldn't spend the time to get more scientific than the four-tier rating system (low, medium, high, and very high) used here. What we're going to do with these importance levels is to assign them values that we'll use to adjust the abandonment min and max times. Remember, users are likely to wait longer to abandon a page when the perceived importance of completing the task on that page is high. Adjusting the parameters to create the abandonment model Now you'll make the necessary adjustments to the parameters table to create the abandonment model you can use in your VU scripts. Table 2 shows the values I suggest using to adjust the minimum and maximum abandonment times in your table for each level of importance. Allow me to stress that these are guidelines; be sure to apply a healthy dose of common sense when applying these factors and take the context of your particular situation into account.

Table 2: Factors for adjusting minimum and maximum times for importance There are two notes I want to make about these guidelines. First, you'll notice that the min time factor for low importance and the max time factor for very high importance are both 1. This is by definition in our model. If you find yourself wanting to change those factors, consider reassessing your parameters rather than the importance factor. Second, you'll notice that for small ranges, applying these factors blindly could result in the minimum time being larger than the maximum time. If this happens, simply revise the importance factors and recalculate until your minimum is once again smaller than your maximum. Applying those factors to our example parameters, we come up with the abandonment model shown in Table 3. Take a moment to review the table and form your own opinions about the times as they're listed.

Table 3: Preliminary abandonment model As I review Table 3, I'm quite comfortable with those values . . . except two. I don't believe that anyone is going to wait 450 seconds (7.5 minutes) for confirmation of a bill payment. This is probably because the session keep-alive is configured for longer than it needs to be (15 minutes versus maybe 5 minutes), but assuming that configuration is nonnegotiable, I'm simply going to arbitrarily change the abandonment max time value for the Pay Bill page to 240 seconds. I also don't believe anyone is going to wait 60 seconds for some fields to dynamically populate, so I'm going to change that value to 20 seconds. Does that mean that I disagree with my own guidelines? I don't think so. I think it just means that I'm taking a step back to look at my model in the context of the unique aspects of the system I'm modeling. (I strongly recommend that you take the time to do that throughout your projects.) Finally, Table 4 shows the model that we'll apply to our example script.

Table 4: Final abandonment model Back to top Adapting VU scripts to correctly handle abandonment With our abandonment model in hand, we're ready to accurately account for user abandonment using Rational TestStudio. I suggest you either record a short script with blocks or timers, or pick an existing script, and follow along. Any script that's currently working on an active site will do. Adding the abandonment procedure In "User experience, not metrics, part 2: Modeling individual user delays" I introduced the normdist function to create bell curve distributions. The first thing you'll need to do is add that function, shown in Listing 1, to the top of your script as we did in that article, since it's used in the abandonment procedure we're about to discuss.

#include int func normdist(min, max, stdev) /*Specifies input values for normdist function.*/ /* min: Minimum value; max: Maximum value; stdev: degree of deviation */ int min, max, stdev; { /*Declare range, iterate, and result as integers - VuC does not support floating-point math. */ int range, iterate, result; /*Range of possible values is the difference between the max and min values.*/ range = max - min; /*This number of iterations ensures the proper shape of the resulting curve.*/ iterate = range / stdev; /*Integers are not automatically initialized to 0 upon declaration.*/ result = 0; /*Compensation for integer vs. floating-point math.*/ stdev += 1; for (c = iterate; c != 0; c--) /*Loop through iterations.*/ result += (uniform (1, 100) * stdev) / 100; /*Calculate and tally result.*/ return result + min; /*Send final result back.*/ } Listing 1: The normdist function Immediately after the normdist function but before anything else in your script, you'll need to insert the abandonment procedure shown in Listing 2. If you're not familiar with C, this procedure does nothing more than take in the parameters you've created in your model, determine if the abandonment threshold has been met, and if it has, print a message to the log file and exit the virtual user. For cases like we discussed above where you want to just log the abandonment and not exit the user, simply comment out the user_exit line by preceding it with a double slash (//).

int int_tempstart, int_tmpend, int_tmptime; /*Variables used to calculate page load times*/ proc abandon(int_tmptime, int_min, int_max, str_distro) /*Specifies input values for abandonment procedure.*/ /*int_tmptime: time it took the last page to load; int_min, int_max and str_distro: abandonment model parameters*/ int int_tmptime, int_min, int_max; string str_distro; { int int_abandon = 0; /*This block determines the abandonment threshold based on the passed parameters.*/ if (str_distro == "norm") int_abandon = normdist(int_min, int_max, (int_max-int_min)/3); else if (str_distro == "unifom") int_abandon = uniform(int_min, int_max); else if (str_distro == "negexp") { int_abandon = negexp(int_min) + int_min; if (int_abandon > int_max) int_abandon = int_max; } else if (str_distro == "invneg") { int_abandon = int_max - negexp(int_min); if (int_abandon < int_min) int_abandon = int_max; } else int_abandon = = ((rand() / 32767) * (max - min)) + min; /*If the threshold has been met, write a message to the log.*/ if (int_tmptime > int_abandon){ testcase \\["Abandon"\\] 0, "", "Virtual User " + itoa(_uid) + " Abandoned after Command_id " + _cmd_id + " after " + itoa(int_tmptime) +" milliseconds"; user_exit(0, "User Abandoned " + itoa(int_tmptime)); } } Listing 2: Basic abandonment procedure You may choose to add the following line to this procedure between the testcase and user_exit lines to create a separate output file that just contains messages about abandonment: printf( "Virtual User " + _UID + " Abandoned after Command_id " + _cmd_id + " after " + itoa(int_tmptime) +"milliseconds"); Once you've completed this (and recompiled your scripts to make sure you didn't copy in anything extra -- like I did when I was reviewing this), you're ready to take the next step. You may also choose to save these into a separate .h file and include that file instead. If you want to do that, copy and paste the code above into a text editor (such as Notepad) and save the file as common_functions.h (for example) into the include folder below the TMS_Scripts folder. Then, at the top of the script you want to apply the abandonment procedure to, simply add the line: #include immediately following #include . This works exactly the same way and saves you from having to copy and paste that code to the top of all your scripts. Adapting existing scripts

Adapting the rest of the script is actually very simple. Let's say your script looks like the one in Listing 3. start_time \\["tmr_home"\\]; hq2unx144_fhlmc_com_1 = http_request \\["AP_top_~1.016"\\] "www.yahoo.com", HTTP_CONN_DIRECT, . . . http_nrecv \\["AP_top_~1.018"\\] 100 %% ; /* 3119 bytes */ http_disconnect(hq2unx144_fhlmc_com_1); stop_time \\["tmr_home"\\]; Listing 3: Unadapted script All you need to do is capture the actual start and stop times, calculate the difference between those times, and then pass that difference and the rest of your abandonment model parameters to the abandonment procedure. See the adapted script segment in Listing 4. int_tmpstart = start_time \\["tmr_home"\\]; hq2unx144_fhlmc_com_1 = http_request \\["AP_top_~1.016"\\] "www.yahoo.com", HTTP_CONN_DIRECT, . . . http_nrecv \\["AP_top_~1.018"\\] 100 %% ; /* 3119 bytes */ http_disconnect(hq2unx144_fhlmc_com_1); int_tmpend = stop_time \\["tmr_home"\\]; int_tmptime = int_tmpend-int_tmpstart; abandon(int_tmptime, 5000, 30000, "norm"); Listing 4: Adapted script Remember when entering the parameters into the abandonment procedure call to convert seconds to milliseconds and to put the name of the distribution in quotes. As written, the abandonment procedure accepts the following distribution parameters:

• • • • •

"norm" for normal or bell curve distributions (with a standard deviation of one third the range) "uniform" for uniform or linear distributions "negexp" for negative exponential or logarithmic distributions weighted toward the minimum value "inv_negexp" for negative exponential or logarithmic distributions weighted toward the maximum value. any other entry in quotes for a random distribution between the minimum and maximum values

Making a couple of these replacements by hand isn't too bad, but for long scripts I copy the scripts into a text editor with good search/replace functionality to modify the scripts for me. Then I just go back and adjust the abandonment parameters according to the model. Interpreting results Most of the rest of this series is about interpreting results, so I won't spend a lot of time talking about it here. But here are a few things I wanted to point out:

• • •

Check your abandonment rate before you evaluate your response times. If your abandonment rate for a particular page is less than about 5%, look for and handle outliers. If your abandonment rate for a particular page is more than about 5%, you probably have a problem worth researching further on that page. Check your abandonment rate before drawing conclusions about load. Remember, every user who abandons is not applying load. Your response time statistics may look good, but if you have 25% abandonment, your load may be 25% lighter than you were expecting. If your abandonment rate is more than about 20%, consider commenting out the user_exit line and reexecuting the test to help gain information about what's causing the problem.

For advanced users If you're thinking "That's great . . . but once a user closes his browser (abandons) it's not going to request the remaining objects, where this procedure will," keep reading. To take the next step and model abandonment down to the individual request/receive pair, you'll need to replace the abandonment procedure I gave you above with the one in Listing 5.

proc abandon(cmd_id, int_tmptime, int_min, int_max, str_distro, int_wholepage) ) /*Specifies input values for abandonment procedure.*/ /*cmd_id, read-only variable identifying the commands; int_min, int_max and str_distro: abandonment model parameters; int_wholepage identifies if the abandonment being evaluated is for the whole page or a single request/receive pair*/ int int_tmptime, int_min, int_max, int_wholepage; string cmd_id, str_distro; { int int_abandon, total_bytes = 0, nrecv_total = 1; /*This block determines the abandonment threshold based on the passed parameters.*/ if (str_distro == "norm") int_abandon = normdist(int_min, int_max, (int_max - int_min) / 3); else if (str_distro == "uniform") int_abandon = uniform(int_min, int_max); else if (str_distro == "negexp") { int_abandon = negexp(int_min) + int_min; if (int_abandon > int_max) int_abandon = int_max; } else if (str_distro == "invneg") { int_abandon = int_max - negexp(int_min); if (int_abandon < int_min) int_abandon = int_max; } else int_abandon = ((rand() / 32767) * (int_max - int_min)) + int_min; /*If this isn't a whole page - receive the requested data and time it.*/ if (int_wholepage != 1) { /*Get how much data should be transmitted.*/ total_bytes = atoi(http_header_info("Content-Length")); /*Loop until the bail time is reached or everything is downloaded.*/ while ((time() - _lr_ts < int_abandon) && (nrecv_total >= total_bytes)) { if (n = sock_isinput()) http_nrecv \\[cmd_id\\] n; nrecv_total += _nrecv; } } /*If everything wasn't downloaded in time or the page took too long, abandon.*/ if ((int_tmptime >= int_abandon) || (nrecv_total < total_bytes)) { testcase \\["Abandon"\\] 0, "", "Virtual User " + itoa(_uid) + " Abandoned during Command_id " + cmd_id + " after " + itoa(int_abandon) +" milliseconds"; /*And exit the user (may be commented out when appropriate).*/ user_exit(0, "User Abandoned " + itoa(int_abandon)); } } Listing 5: Advanced abandonment procedure Without explaining the code, I'll say that the significant difference between these two procedures is the input parameters. Look at how the new procedures are called in Listing 6, a modified version of the script from Listing 3, and then I'll explain the new parameters.

int_tmpstart = start_time \\["tmr_home"\\]; hq2unx144_fhlmc_com_1 = http_request \\["AP_top_~1.016"\\] "www.yahoo.com", HTTP_CONN_DIRECT, . . . //http_nrecv \\["AP_top_~1.018"\\] 100 %% ; /* 3119 bytes */ abandon(_cmd_id, 0, 5000, 30000, "norm", 0); http_disconnect(hq2unx144_fhlmc_com_1); int_tmpend = stop_time \\["tmr_home"\\]; int_tmptime = int_tmpend-int_tmpstart; abandon(_cmd_id, int_tmptime, 5000, 30000, "norm", 1); Listing 6: Adapted advanced abandonment script The first thing we see is that we're now replacing every http_nrecv command with a call to the abandonment procedure. The syntax is: abandon(_cmd_id, 0, \\[min\\], \\[max\\], \\[distribution\\], 0); where \\[min\\], \\[max\\], and \\[distribution\\] are the same parameters we've already discussed, _cmd_id is the command ID read-only variable to identify which specific command we're referring to, the first zero is a hard-coded int_tmptime that's not used in this instance, and the final zero is used as a flag to show that this isn't an entire page-load abandonment call. After our stop time, the abandonment procedure has the same general parameters but with different default values. The syntax is: abandon(_cmd_id, int_tmptime, \\[min\\], \\[max\\], \\[distribution\\], 1); where _cmd_id, int_tmptime, \\[min\\], \\[max\\], and \\[distribution\\] are the same parameters we've already discussed and the final parameter serves as a flag set to 1 to indicate that this is a full-page abandonment call. Beyond performance testing part 5: Determining the root cause of script failures Scott Barber, Performance testing consultant, AuthenTec Summary: The obvious symptom is very rarely the actual cause of a script failure. This article teaches you how to analyze script failures with the intent of finding their root cause so you can debug your scripts effectively. The obvious symptom is very rarely the actual cause of a script failure. This article teaches you how to analyze script failures with the intent of finding their root cause so you can debug your scripts effectively. Beyond performance testing part 5: Determining the root cause of script failures Barber, Scott In the forums on performance engineering that I participate in and moderate, I get questions like these almost daily:

• •

"I recorded my script and it worked just fine for one user, but when I tried two users they both failed. What's wrong with my Web server?" "My scripts pass but the database doesn't get updated. What's wrong with my database?"



"My scripts work for five users, but when I play back more than ten users my scripts time out. Is my application overloaded already?!"

In each case, the author assumes that since the script passed in some scenarios, or since the IBM® Rational® TestManager software showed a Pass result, the problem must be with the application. While it's true that the application isn't performing as expected and while it may be true that the application is at fault, "appearances often are deceiving," as Aesop (620&560 BC) warned. An application can act in unexpected ways because scripts present it with situations that real users could never create. Besides, the obvious symptom is very rarely the actual cause of a script failure. Here in the fifth article of the "Beyond performance testing" series, we'll take a look at how to analyze script failures with the intent of finding their root cause so we can debug our scripts effectively and then build stable, robust, reusable scripts to test our applications with. We'll explore some common scripting issues that can cause failures (both true failures and false failures) and I'll show you some methods for collecting information about script failures. So far, this is what we've covered in this series:

• • • •

Part 1: Introduction Part 2: A performance engineering strategy Part 3: How fast is fast enough? Part 4: Accounting for user abandonment

This article is intended for all levels of IBM® Rational TestStudio® VU scripters. Recognizing and minimizing false failures (and passes) In the Test Log for a VU script, TestManager classifies executed commands as having either passed or failed, but these classifications don't always mean what you might assume they do. Some of the Fail results that show up there aren't actually indicative of problems with the script or application and can be prevented by adjusting HTTP return code or file size parameters. By the same token, some of the Pass results that appear actually mask critical errors. We'll consider how to minimize these false failures and passes first. You should evaluate these areas when developing your scripts as well so you don't waste time troubleshooting in other areas only to find out that you've been misled by the Test Log. Failures based on HTTP return codes HTTP return codes are sent back to the client for each object retrieved from the Web server when you record a script. These are the code numbers you'll see in the http_header_recv lines of each block that look like the following: http_header_recv \["script~1.017"\] 200; http_header_recv \["script~1.025"\] 304; http_header_recv \["script~1.042"\] 302;

/* OK */ /* Not Modified */ /* Moved Temporarily */

By default, when you play back your scripts TestManager expects to get that same code when the same object is requested and will only classify the executed command as having passed if that's the case. This can be either a good thing or a bad thing. It's a good thing in that you won't get a passing result if you're expecting a 200 (OK) and you get a 404 (File Not Found). On the flip side, if you're expecting a 200 (OK) and you get a 304 (Not Modified) -- which implies that the request was filled from the cache -- you get a failure. In most cases this will be a false failure, because you do want to be able to retrieve objects from the cache. You can help eliminate some of these false failures by changing the defaults to allow redirects (301, 302, and possibly 303 in place of an expected 200 or 304) and/or cache responses (304 in place of an expected 200, 301, 302, or possibly 303). There are a couple of ways to do this. One way is to modify the push HTTP_control line in each script to include the HTTP_CACHE_OK parameter if you want to allow cache responses and the HTTP_REDIRECT_OK parameter if you want to allow redirects. That modification has been made to the line below.

push Http_control = HTTP_CACHE_OK | HTTP_REDIRECT_OK; Alternatively, through TestManager you can modify the parameters for all of the scripts in a suite at once. (Note that these changes will only affect scripts when they're played back as a part of that suite.) First, open the desired suite and choose Suite > Edit Settings from the menu bar. You'll see a screen like the one in Figure 1.

Figure 1: Settings screen for a suite Then click the TSS Environment Variables button for all user groups (indicated by the red arrow in Figure 1) and click the VU HTTP tab. This brings you to the screen in Figure 2.

Figure 2: VU HTTP tab, TSS Environment Variables dialog box (default) You can see that the default is to not allow the HTTP Control options. To change that, simply check the desired boxes, as shown in Figure 3.

Figure 3: VU HTTP tab, TSS Environment Variables dialog box (modified for return codes) False passes based on HTTP return codes can also be a problem during playback. Sometimes, for example, a 200 is both expected and received, but in fact what's been received is the wrong thing. For instance, when you click a link and get a message that says "I'm sorry, the site is down for maintenance, please come back tomorrow," you've received a valid page with a 200 code, but you haven't received the page you were looking for. In general, detecting false passes is a matter of spot checking responses. Pages with embedded error messages are discussed in more detail below. Failures based on file sizes File sizes are another way that TestManager determines if a request passed or failed during playback. In your scripts you'll see lines like the following: http_nrecv \["script~1.003"\] 100 %% ; /* 56304 bytes */ http_nrecv \["script~1.006"\] 100 %% ; /* 2103 bytes - From Cache */ http_nrecv \["script~1.012"\] 2048; /* 2048/2048 bytes */

While there are several variations on how file sizes are evaluated for correctness during playback, they really boil down to either allowing response sizes to vary or not. In the first case, you ensure the script shows 100%%; downloading a file of any size bigger than 0 bytes counts as a Pass. In the second case, you ensure the values in the command match the number of bytes when recorded; any file that's downloaded with a size different from that number of bytes counts as a Fail. In most cases, allowing response sizes to vary will keep you from getting tons of false failures, since anything dynamic on an HTML page -- even if you can't see it on the screen -- can cause the file size to change and thus a failure to occur if responses sizes aren't allowed to vary. You may choose to dictate specific sizes for certain graphics or screens if, for example, you know that the correct screen/graphic will always be one size and that the error screen/graphic will always be a different size. Other than modifying the code line by line, you can set the parameter globally like we did for the HTTP return codes above. For an individual script, you can do this by adding or deleting the HTTP_PARTIAL_OK parameter in the push HTTP_control line. For an entire suite, you can do this through TestManager by returning to the VU HTTP tab of

the TSS Environment Variables dialog box and checking or unchecking the "Allow partial responses" option (see Figure 4). Once again, this is a useful tool to help minimize false passes and fails.

Figure 4: VU HTTP tab, TSS Environment Variables dialog box (modified for sizes) Passes returned for embedded errors Embedded errors are those red words on a page that's substituting for the page you actually want, telling you that you failed to get to the page you were looking for. So instead of the Web server returning a 403 or 404 error, it redirects you to a custom error page that tells you why your request didn't work. These pages are generally thought of as user friendly, but they're also performance test scripter unfriendly. As mentioned above, TestManager (and virtually all other load-generation tools) expects errors in code, not text. The bottom line is this: embedded errors are very good at presenting results that, on the surface, appear to be passing results, when in fact your script is just being redirected to the same custom error page over and over again. One indicator that this may be occurring during your test execution is that page response times are much faster than expected and don't seem to slow down even under relatively extreme loads. Embedded errors really deserve an article all their own, but allow me to at least explain conceptually how to detect and handle them. We'll discuss other methods of detecting this issue later. If you have an application that uses embedded errors rather than HTTP return codes to signify errors, you'll have to write custom functions to read the actual HTML of the response from the _response file and look for those error messages. If your custom function finds one of those error messages, it must return an error using the testcase command and/or terminate the virtual user with the user_exit command. These concepts have been presented previously in different contexts in several "User Experience, Not Metrics" and "Beyond Performance Testing" articles.

Addressing common causes of script failures

Now we'll look at some common scripting issues that can cause real failures: datapools, navigation, data correlation, and authorization/authentication. I'll suggest ways to approach script failures when any of these issues might be the cause. Datapools Datapools, particularly when used for user names and passwords, often cause script failures that are deceiving. Without going into detail about creating or maintaining datapools, I do want to point out these key areas to verify when a script failure occurs:

• • •

Ensure that in the datapool section of the script, the variables you want to be read from the datapool are marked as INCLUDE, not EXCLUDE. This is a very easy mistake to make. Ensure that the data in the datapool is correct. Ensure that you have enough data or you specify DP_WRAP_OFF if overlapping/duplicate data is a potential issue.

Remember always to check the request or requests before the first identified failure to verify input data. It often takes several requests past the actual script error to generate an error that TestManager recognizes as a failure. If you're unsure how to check what data was actually passed to the application for a particular virtual user, see the sections below under "Collecting Information About Script Failures." Navigation If you're using split scripts of any sort, you can end up with some extremely strange-looking errors that turn out to be the result of requesting pages in the wrong order. This can even happen with entire-path scripts when previous requests don't process properly. If you suspect that navigation could be involved in an error at either the script or the application level, I suggest the following approach:

• • • •

Ensure that the navigation path your virtual user is following is valid by testing it manually. Ensure that the last page that showed a Pass result in TestManager is the correct page and has the correct content (see "Viewing Returned Pages After the Test Run" for how to do this). Ensure that the data being passed in the request for the failing page is both correct and properly formatted (see "Viewing Native Logs" for how to do this). Step through the script one page at a time, evaluating both requests and responses, until you can narrow down the cause of error to one request/response pair (see "Stepping Through Scripts" for how to do this).

A significant majority of errors appear on the surface to be navigation related. The ones that actually turn out to be navigation related are usually due to either split scripts or the application. Even though issues related to split scripting are fairly common, they're beyond the scope of this article. Data correlation Data correlation is another area that commonly leads to both script and application errors. Data correlation is just a fancy term for "I need to grab some data from the last page that changes every time someone accesses that page." Part 11 of the "User experience, not metrics" series discussed data correlation as it relates to authentication and session tracking. Data correlation issues are often fairly difficult to detect. By default, Robot doesn't correlate any values during script generation. This may or may not be acceptable for your application. If you record a script and it doesn't work, I recommend regenerating that same script with the recording options set to correlate values. First go to Tools > Session Record Options, click the Generator per Protocol tab, and change the "Correlate variables in response" item to "All" (see Figure 5).

Figure 5: Generator per Protocol tab, Session Record Options dialog box Then regenerate the script in Robot by going to Tools > Regenerate Test Scripts from Session. If the script works this way, you have two options. The first is to leave it as is and the second is to evaluate exactly which variables need to be correlated for the script to work properly. The latter is a tedious process no matter how you do it and is beneficial only if (1) conserving small amounts of script overhead is more important than hours of your time, or (2) making your code look "clean" is very important. The one good thing about data correlation issues is that they almost always present themselves the first time the script is run. The script will almost never pass, even with a single virtual user, if you have a data correlation issue. Authorization/authentication Authorization and/or authentication issues are really just special cases of datapool and data correlation issues. All I want to add here is a list of some common indicators that an error may be related to either authorization or authentication:

• • • •

The logon page appears, but every subsequent page yields an error for one or more virtual users. The HTTP return code on the failing pages is either 401, 403, or any code in the 500s. The script seems to be working, but back-end data isn't being updated. The script seems to be working, but searches are returning no data when you know the data exists and the request for data is formatted correctly.

In these cases, first make sure that all of your user credentials are valid for the actions you're assigning users to. Next ensure that all session identification information is being correlated properly (see Part 11 of the "User experience, not metrics" series for more information). Finally, get the member of the development team who's responsible for security involved. Security can be done so many different ways that it's simply impossible for me to make further generalizations about how to debug these issues.

Collecting Information About Script Failures

There are probably more ways to collect information about performance script failures than there are performance testers in the world. Needless to say, there's simply no way I can discuss all of them in this article. Besides, it would be terribly dishonest of me to lead you to believe that I know them all -- or even most of them! What I can do, however, is discuss some of the common methods at our disposal through TestManager plus a few custom methods that aren't native to TestManager, some of which you may be familiar with if you've used other industry-leading tools. Please note that although I'm going to discuss only information available through TestManager, I'm not advocating ignoring other sources of information. On the contrary, I strongly encourage you to use all sources of information available to you, which may include but not be limited to the following:

• • • • • • •

Web server logs application server logs network or application monitoring software database logs and/or queries system admins /developers system documentation RFC documentation

Viewing Native Logs By default, TestManager captures and saves in log files all of the request and response data transmitted during a script execution. While the logging settings can and should be changed (on the Logging tab of the TSS Environment Variables dialog box) after scripts are fully developed and large loads are being simulated, the default settings are correct for determining the cause of script failures. If you're already familiar with these logs, feel free to jump to the next section. These log files can be viewed in two different ways. The first way is to right-click a Fail (or a Pass) result in the Test Log in TestManager and then choose Properties (see Figure 6).

Figure 6: Test Log window in TestManager Clicking on the Virtual Tester Associated Data tab will show you the view demonstrated in Figure 7.

Figure 7: Virtual Tester Associated Data tab of Log Event window As you can see in this case, the emulation command failed because TestManager was expecting an HTTP return code of 200 but received a return code of 500 instead. Viewing the logs this way allows you to easily navigate between requests and responses associated with a specific virtual user. Clicking the General tab gives you all of the information to identify that virtual user and the specific command ID for the request or response you're viewing. All of the information that's viewable through this interface in TestManager is stored in flat files in the project repository, which brings us to the other way to view these logs. The second way to view native logs is to open the files that hold the data presented in the TestManager interface directly. Those files are located in the following directory: \[Drive\]:\\[RepositoryName\]\TestDatastore\TMS_Builds\\[BuildName\]\\[SubBuild\]\ \[TestRun\]\perfdata\

Three types of log files may be included in that directory:

• • •

d00# -- file(s) containing most of the data viewable in TestManager, one for each VU e00# -- file(s) containing script execution errors, one for each VU o00# -- file(s) containing custom output (discussed in the next section), one for each VU

(Note that unless you've edited your scripts to provide custom output, you won't find any o00# files in the directory.) To view any of these files directly, simply right-click, select "View With" (or "Open With" in XP), and select your favorite text editor (Notepad, Wordpad, and Microsoft Word all work fine, though sometimes the files are too big for Notepad's buffer). While you won't find any new information in these files, it does present the information in a format that lends itself to searching, copying and pasting, and such. I use both viewing methods, depending on what I'm looking for. Creating custom output Creating custom output is a powerful yet simple way to evaluate what's really happening in a script. The easiest way to demonstrate custom output is with an example. Let's say you have a script that logs a user on to your site, navigates content for a while, and logs off. Your script works just fine for one user, but at a hundred users you realize that 6% of your users are failing to log on. You suspect that six of the user names in your datapool are invalid, but you don't know which ones. You start scrolling through the test logs

to find out which users got the logon error, then start reading through the request data to see which user names they have. This is a rather tedious process, as you can imagine. The alternative is to create custom output. Immediately following the stop_time command for your logon page (arbitrary decision), place the following line of code: printf("Virtual User # "+ itoa(_uid) +" was username "+datapool_value(DP1, 'username'));

Then compile the script and run it again. Upon completion, you'll find 100 o0# files in the ./perfdata/ directory. Each one of these files will contain text like this: Virtual User # 17 was username Jamet

A quick glance at the User Start (User Group? line that you had to expand to see the individual command failures in the Test Log will show you which virtual user numbers were associated with the six failures. Luckily enough, you'll notice that the o0# file for virtual user #17 is o017. So all you have to do is open the o0# file that matches the virtual user number associated with the failed log on, and you've got the user name that you can now test manually in order to see if it was the cause of the error. In our example, the user name was supposed to be Janet, not Jamet. Though this is a simple example, I'm sure you can immediately see how this kind of custom output can be extremely useful in debugging scripts -- especially scripts with custom code, variables, and match statements like we've discussed in previous articles. Viewing Returned Pages After the Test Run One of the most frequent questions I get while demonstrating or teaching VU scripting in Robot is "How do I see the pages to know if they came back correctly?" In fact, I remember asking that question myself not too awfully many years back. While the technique I'm about to share with you for viewing HTML pages returned during a performance test run after that test has been completed is a little tedious, it's quite simple and I use it all the time while debugging scripts. I suggest that you follow along by doing this exercise with me. 1. 2.

3.

Launch Notepad and Internet Explorer, then return to TestManager. Identify the page you want to view in the Test Log (generally by clicking a Timer Start or Timer End to view the timer name and verify it's the page you want). Within that timer, there will be several Emulation Command and Env Variable Change entries. Click each one and look at the Virtual Tester Associated Data tab in the resulting dialog box to find the emulation command containing the main HTML of the page, normally starting with (see Figure 8). This is generally (though not necessarily) the second emulation command after the Timer Start entry. The first emulation command after the Timer Start entry is usually the request and the second is usually the response containing the main HTML.

4. 5.

6. 7.

Figure 8: Virtual Tester Associated Data tab for an HTML page Click somewhere inside the display box containing the HTML, then press CTRL+A to select all of the text and CTRL+C to copy it. Navigate to Notepad and click the text area. If there's text there, press CTRL+A to select the text (you can omit this step if this is a new file), then press CTRL+V to paste the text from the Test Log into Notepad. Press CTRL+S to save the file. When you do this, be sure to save it with a .htm or .html extension. I usually just name the file tmp.htm and save it to the root directory of my C:\ drive. Navigate to Internet Explorer. On the menu bar, choose File > Open. Browse to the file you just saved and click OK. The Web page (minus graphics) will appear in your Web browser.

Figure 9 shows part of a PeopleSoft 8.4 data entry screen that was viewed in this fashion.

Figure 9: Web page generated from the Virtual Tester Associated Data This page isn't very pretty, but I'm sure you can see its usefulness. Even if you don't know PeopleSoft, you can see that this display gives you enough information to tell if (1) the correct page is being displayed, (2) the fields are being populated with the correct data, and (3) an error message is being displayed on the page. As a rule of thumb, I generally find pages that generate a Fail result (or that I suspect of being incorrect) and view those pages first. Often such a page isn't very informative, as it typically shows "404 -- File Not Found" or something similar. Then I start moving up the Test Log one page at a time looking at the immediate previous pages and searching for possible causes of the error. Very often, the page or two immediately preceding an error that generates a Fail in the Test Log holds the actual cause of the error. Take the example of entering bad data into a field. If your application validates data and you enter bad data and then try to force the application to bring up a subsequent page, it will return an error instead. This is correct behavior on the part of your application but looks like an application error. While you could detect this error by trying to find it in the code or viewing the information directly in the Test Log, it's a lot easier to see the error in a browser. For instance, in Figure 9 you can see that the "From Period" and "To Period" values are both 1. Having the same values in both fields causes a 500 error on the next page. Listing 1 is a (small) snippet of the code surrounding the bad value in the script.

re2unx23_fhlmc_com_13 = http_request \["gl_nVis~2.007"\] str_url, (ssl), "POST /psc/"+ str_instance +"/EMPLOYEE/ERP/c/FM_GL_RPT.FM_RUNSQR_FMGL0035.GBL HTTP/" "1.1\r\n" "Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, applicat" "ion/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */" "*\r\n" "Referer: "+ str_referer +"\r\n" "Accept-Language: en-us\r\n" "Accept-Encoding: gzip, deflate\r\n" "Content-Type: application/x-www-form-urlencoded\r\n" "User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; T312461)" "\r\n" "Host: "+ str_url +"\r\n" "Content-Length: 669\r\n" "Connection: Keep-Alive\r\n" "Cache-Control: no-cache\r\n" "Cookie: SignOnDefault=c05182; re2unx23.fhlmc.com-7620-PORTAL-PSJSESSION" "ID=PeOThAJvP5eASjef; http%3a%2f%2fre2unx23.fhlmc.com%3a7620%2fpsp%2fgde" "v8%2femployee%2ferp%2frefresh=list:; ExpirePage=http://re2unx23.fhlmc.c" "om:7620/psp/"+ str_instance +"/; PS_TOKEN=AAAApAECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAg" "BObQgAOAAuADEAMBQnKJy2eec4oR+w+pMfCRv20gaknwAAAGQABVNkYXRhWHicHYs7CoAwF" "AQnUazEmxiS+L2A2klAewtrb+jh3LgP5sEyCzzGFiUGxb6ZNTeegcBMrFjY2RoSBysnl4rU" "E6V4sZUWxMgo5pmj03dkx/1WvkntB0d/CvA=; PS_TOKENEXPIRE=Tue Nov 26 10:38:4" "9 GMT-05:00 2002; SignOnDefault=c05182\r\n" "\r\n" "ICType=Panel&ICElementNum=0&ICStateNum="+itoa(int_icstate)+ "&ICAction=%23ICSave&ICXPos=0&" "ICYPos=0&ICFocus=&ICChanged=1&ICFind=&PRCSRUNCNTL_LANGUAGE_CD=ENG&GL_SQ" "R_ADJP_ACCOUNTING_PERIOD%240=0&RUN_GL_LEDRPT_BUSINESS_UNIT=CFMAC&RUN_GL" "_LEDRPT_LEDGER=ACTUALS&RUN_GL_LEDRPT_FISCAL_YEAR=2002&RUN_GL_LEDRPT_CUR" "RENCY_CD=USD&RUN_GL_LEDRPT_PERIOD_FROM=1&RUN_GL_LEDRPT_PERIOD_TO=1&RUN_" "GL_LEDRPT_SHOW_ERRORS_ONLY%24chk=N&RUN_GL_LEDRPT_SHOW_DETAIL_JRNL%24chk" "=Y&RUN_GL_LEDRPT_SHOW_DETAIL_JRNL=Y&RUN_GL_LEDRPT_DISPLAY_26_DIGITS%24c" "hk=N&RUN_GL_LEDRPT_FM_CD_COMBTN_OWNR=JV&SEQ_NBR_C%240=&CF_SELECT_OPT%24" "chk%240=Y&CF_SELECT_OPT%240=Y&CF_SUB_TOTAL_OPT%24chk%240=Y&CF_SUB_TOTAL" "_OPT%240=Y&CHARTFIELD_VALUE%240="; Listing 1: Segment of HTML request script for Web page in Figure 9 Would you rather try to find the bad value in this script, or in a browser window? If you want to get fancy, you can use the output_html procedure in Listing 2 to actually save the HTML files independently while the script is running. The procedure creates files named by the user ID followed by the command ID of the request that generated the response and saves the contents of the _response (if it's in HTML) into that file. First you'll want to create a new directory called pagefiles in the TMS_Scripts directory. Here's the resulting path: \[Drive\]:\\[RepositoryName\]\TestDatastore\DefaultTestScriptDatastore\ TMS_Scripts\pagefiles

Then you'll put this procedure either immediately following the #include line or in an included header file.

#include proc output_html() { if ((match('html', _response)) || (match('HTML',_response))){ z = open ("pagefiles\\"+_cmd_id+"_"+itoa(_uid)+".html", "w"); fprintf (z,"%s", _response); close(z); } } Listing 2: The output_html procedure In your script, you'll want to call the output_html procedure after every http_nrecv command. http_nrecv \["perftes~003"\] 100 %% ;

/* Internally Generated */

output_html();

Now after you run your script, you can navigate to the ./pageview/ directory and double-click any file to view the page (or frame) in a browser window. Please note that neither the procedure nor the Test Log assembles the frames into complete pages or includes graphics files in the pages you'll be viewing. Don't waste any time wondering if this is an error. Viewing pages during the test run I had been wondering when TestStudio would have the capability to view VU scripts while they're running. Then my office buddy, Chris Walters, sent me an e-mail with some files and a list of instructions. I followed the instructions, and there I was watching my script run real-time in a browser! The instructions went something like this:

1.

Create a directory called perflive in the TMS_Scripts directory. \[Drive\]:\\[RepositoryName\]\TestDatastore\DefaultTestScriptDatastore\ TMS_Scripts\perflive

2. 3.

Copy these files into it: options.html, perfpage1.html, perfview.html. The files can be found on the bpt5.zip file. Copy the view procedure in Listing 3 into your script (or header file, as we've done previously).

#include proc view() { if ((match('html', _response)) || (match('HTML',_response))){ string filename; sprintf(&filename, "perflive\\perfpage%d.html", _uid); file = open(filename, "w"); fprintf (file, "%s", _response); close (file); } } Listing 3: The view procedure

4.

Call view immediately following every http_nrecv in your script.

http_nrecv \["perftes~003"\] 100 %% ;

/* Internally Generated */

view(); 5.

6. 7. 8.

Open perfview.html in Internet Explorer. Run your script. Watch the Web browser.

The final page I saw looked something like Figure 10.

Figure 10: Sample view of perfview.html It's pretty self-explanatory, really. Now all you have to do is decide which user you want to monitor, type in the user number, and watch this user go. This is obviously very helpful in debugging scripts as well as visualizing errors in the application. If you use this in conjunction with the Test Log and/or the output_html procedure, you can both see what's happening as it happens and review the code or view the HTML in a browser at your leisure after the completion of the test execution. I did add one fancy thing to Chris's code after using it for a while. I wanted to be able to make the view and/or the output_html calls optional and easy to enable or disable before any test execution for any script. Immediately following the common declarations in the script, I added two integers that I use as flags, view_flag and output_flag. See Listing 4. push Timeout_val = Min_tmout; push Think_avg = 0; int view_flag = 1;

/*1 for view during script execution, 0 for do not view */ int output_flag = 1; /*1 for output during script execution, 0 for do not save output files */ Listing 4: Flag declarations Then I changed all the view and output_html procedure calls to be part of if statements. if (view_flag == 1) view(); if (output_flag == 1) output_html();

This way all I have to do is change the value of the flags (from 1 to 0 or vice versa) to either enable or disable the desired functionality based on my needs for a particular test run.

Stepping Through Scripts There's one last debugging feature I'd like to present to complete our discussion -- stepping through scripts. If you're familiar with GUI scripting in Robot, you're probably very used to stepping through scripts line by line as a debugging technique. VU scripting has that same functionality, but it's rarely used because you can't really see what the response from the emulation command was to know if you've just stepped into (or right past) the error. This functionality has new value when used in combination with our view procedure. To step through a script with the view procedure enabled, simply set up your script for view and execute it with a perfview.html open in a browser as we did earlier in the article. As soon as TestManager finishes initializing your script, choose Monitor > Suspend Test Run from the menu bar. See Figure 11.

Figure 11: Monitor menu options in TestManager Then choose Test Script from the Monitor menu to display the Test Script View. From there you can resume running your script and suspend it again after resuming or step through the script using either the single-step or multi-step option. See Figure 12.

Figure 12: Test Script View in TestManager For our purposes, we want to click Single Step (which causes the script to execute the next emulation command, then wait for our input), then check our browser with perfview.html. As we continue to click and check, we see that now we have complete control over when the script takes the next step. We can see the emulation command that's being executed and the returned HTML all at the same time. Once again, I'm sure I don't have to spell out how useful this combination is in determining whether an error is due to the script or the application, and if it's due to the script, how easy this makes it to narrow down the possible causes of that script error.

Summing it up Historically, it's often been difficult to determine the actual root cause of a script failure from the obvious symptoms. Thinking about some of the common causes for failure can help when trying to track down that root cause. Coupling that thought process with the added ability to view the relevant Web pages in a browser both in real time and subsequent to a test run makes the process significantly easier. Beyond performance testing part 6: Interpreting scatter charts Scott Barber, Performance testing consultant, AuthenTec Summary: Scatter charts are by far the single most powerful visual evaluation tool at a performance engineer's disposal. Learn here how to create and analyze this type of chart and greatly shorten your search for bottlenecks.

l evaluation tool at a performance engineer's disposal. Wise use of this simple-seeming type of chart will greatly shorten your search for those often-elusive bottlenecks. This article, the sixth in the "Beyond Performance Testing" series, is an expanded discussion of the scatter chart introduced in Part 6 of the "User Experience, Not Metrics" series. So far, this is what we've covered in this series:

• • • • •

Part 1: Introduction Part 2: A performance engineering strategy Part 3: How fast is fast enough? Part 4: Accounting for user abandonment Part 5: Determining the root cause of script failures

This article is intended for individuals who either report on or analyze performance-related data. I want to preface it by noting that Edward Tufte's seminar and books are the foundation for my thoughts on graphically presenting information. If you aren't familiar with his work, I highly recommend that you visit his Web site. In the "Ask E.T." forum on his site, he has stated: "The purpose of analytical displays of information is to assist thinking about evidence." The scatter chart, especially the overlaid version I'll show you near the end of this article, is the best aid I know of to thinking about the evidence provided by a performance test. Creating and reading scatter charts -- a refresher Let's start this discussion with a review of the basics. If you're already familiar with how to create and read a basic scatter chart, feel free to jump down to the next section, "Analyzing basic scatter charts." If you'd like a quick refresher, keep reading. Creating a scatter chart in TestManager You can view a basic scatter chart right in the IBM® Rational® TestManager software quite simply. After any test execution, click the Resp vs. Time button. Once a chart is displayed, choose View > Settings from the menu bar to get the window shown in Figure 1.

Figure 1: Selecting the scatter graph type Click the Scatter radio button in the Graph Type list and then click OK. I recommend trying out the different Graph Style options available in this window to see which versions of the chart work best for you. I often edit graph labels on the tab of the same name and select specific command IDs to view separately, but we'll discuss that further below. Recreating a scatter chart in Excel Details on recreating the basic scatter chart in Excel are given in Part 6 of the "User experience, not metrics" series. Here's a summary of the steps: 1. 2. 3.

Copy the data from the first three rows (Cmd ID, Ending TS, and Response) of the table below the scatter chart in TestManager and paste that data into an Excel worksheet. Highlight columns B and C (Ending TS and Response), then choose Insert > Chart from the menu bar. On the Standard Types tab, in the "Chart types" list, select XY (Scatter), and click Finish. Customize the chart using the options available in Excel.

We'll be adding to this basic chart later in the article. Reading a scatter chart To refresh your memory about how to read a scatter chart, I'll describe what you see in Figure 2.

Figure 2: Basic scatter chart Vertically on the left side is the response time in milliseconds. Across the bottom of the chart is the time since the start of the test run, also in milliseconds. Each red dot represents a command ID or timer measurement. Each measurement is placed on the chart at the intersection of the time since the beginning of the test execution that the timer completes, and the length of time that timer captured (that is, how long it took to receive a response). In other words, in Figure 2, each red dot represents a page. The point where the dot lines up on the y-axis (vertical) is the number of milliseconds it took that page to load, and the point where the dot lines up on the x-axis (horizontal) is the number of milliseconds from the start of the test until that page load was complete. Thus, the dots appear sequentially in time from left to right.

Analyzing basic scatter charts Analyzing any kind of scatter chart, basic or not, isn't a simple process. Nor is it easy to explain in an article. I do a half-day interactive seminar on creating and analyzing scatter charts, and that's just an introduction. It took my mentors about a year and a half to convince me of the value of these charts and teach me how to glean information from them. Still, what follows should be enough to convey the most important concepts: Use scatter charts to identify patterns in performance over time. Once you've identified a pattern visually, use all available resources to determine what caused the pattern. I'll say a little bit about the analysis process and what you're looking for just to get you started. Then I'll give examples of some of the most common types of patterns I've encountered. But as you can imagine, the number of possible patterns we could run across in a scatter chart, and the number of possible causes for those patterns, is virtually unlimited. Again, the main thing for you to take away from this is that analyzing scatter charts is really about identifying patterns, determining what symptoms these represent, speculating about the causes of those symptoms, and finding a way to validate those speculations. The analysis process To have a scatter chart to analyze, we must first execute a performance test. While that may seem obvious, there are some important points to make about this initial step. To be able to analyze a scatter chart with any reasonable level of effectiveness, we must have a thorough understanding of the test we're executing. While I may be able to look at a scatter chart from a test that I know nothing about and successfully guess what the cause of the visible pattern was, it's still just a guess, no matter how often I may guess correctly. That may count as good intuition, but it doesn't qualify as analysis. Some of the things you need to be intimately familiar with to begin effective analysis of a scatter chart are as follows:

• • • •

user arrival rate and distribution user delay times and distributions activities being performed and distribution of those activities single-user (and multiuser if available) performance benchmarks for each activity



number and type of architectural tiers in the system under test

This isn't an all-inclusive list, but as you'll see in the sections to follow, knowledge of each of these items can change the conclusions drawn about a particular chart dramatically. The next part of the analysis process is all about pattern detection. Detection of patterns (and technically antipatterns) is the cornerstone of all types of performance analysis, not just scatter chart analysis. In the case of scatter charts, there are several types of visual patterns that we want to evaluate. I'll demonstrate these with examples below. The types of patterns we'll be looking for include the following:

• • • •

patterns by time slice -- Are response times different during the beginning, middle, and/or end of the test execution? patterns by activity -- Are the response times for a particular activity consistent throughout the test, or do they vary according to some pattern? patterns by groups of activities -- What groups of activities have response times following similar patterns? combinations of the above

These visual patterns are representations of the symptoms we want to analyze. The fact that these patterns are identified visually is what gives the scatter chart its power. While we could scan through tables of thousands, or even millions, of datapoints or create mathematical algorithms to try to detect those patterns, the human eye/brain combination is orders of magnitude quicker and more accurate than any algorithm could be at this task. Once a pattern or point of interest has been identified visually, the next step is to isolate that pattern and remove the remaining "chart noise." In this context, chart noise includes all of the datapoints that are of no interest because they look like we expect them to. Removing the chart noise allows us to more clearly evaluate the pattern we're interested in. The first thing I always do to minimize chart noise is to include only user-defined timers (that is, page load times represented by the timers I've inserted) and not every collected component response (that is, page component times represented by command IDs) in the scatter chart. During analysis, I'll often eliminate timers from the scatter chart that are extremely consistent and not adding value to my analysis. To do this in TestManager, simply return to the Report Output Settings window, go to the Select Command IDs tab, and ensure that only the command IDs and/or timers that you want to view are included in the list box on the right. See Figure 3.

Figure 3: Selecting the timers and/or command IDs to be viewed If you export your data to Excel, you can even make each timer appear as a different color, as we discussed in Part 8 of the "User Experience, Not Metrics" series. This can also be useful in pattern detection, but it's a very time-consuming manual process. That's enough theory. Now let's look at some examples of patterns you might encounter in a scatter chart and how to interpret what they mean. A "good" pattern We'll start with what I call a "good" pattern. This is a good pattern not because the results have met any requirements or performance is good, but rather because there are no performance anomalies to explore. For example, look at the chart in Figure 4 and you'll see that the performance is consistent throughout the test. If you were to look at the performance report output table, you would expect to see very small standard deviations on response times.

Figure 4: A "good" pattern When you see a chart with a pattern like this one, you'll want to verify that your test run was essentially error free and then move on to comparing actual loads, response times, and component metrics with requirements and expectations and benchmark measurements. A good pattern actually provides little value in terms of analysis, but it provides a lot of value in validating your test. A "banding" pattern A "banding" pattern is one that displays obvious horizontal bands of response times. These bands almost always correspond to individual pages. In Figure 5, there are three obvious bands. The top band corresponds to users logging on to the system and displaying the home page. The middle band corresponds to the initial page that displays the areas where users enter and submit their user names and passwords. The bottom band corresponds to logging off from the home page. I was able to determine these things by (1) knowing that those were the only three activities being performed in this test (a benchmark of the logon activity) and (2) using the scatter chart filtering options to display the timers one at a time.

Figure 5: A "banding" pattern In this case, a banding pattern was what I was hoping for, although the space between the bands was greater than desired, identifying an area for performance improvement. This scatter chart demonstrated that the logon function took about 2.5 seconds under this particular load (considered "very good" by the client). It also demonstrated that the home page took only about .5 seconds to load (also considered "very good" by the client). More important, though, it showed that the logon function took about 2 seconds longer than it took to load a static page. Since we knew that the logon function required the Web server to send a request to the authentication server to verify the user's credentials, we were able to speculate (and later confirm) that this round trip from Web server to authentication server took 2 seconds. This was determined to be "too long" by the client and became an area where tuning took place. As you can imagine, instrumenting your environment to collect that "inter-tier" response time is often not insignificant (we'll discuss how to do that in detail in Part 9 of this series). Analyzing the scatter chart allowed us to be reasonably certain that doing this instrumentation would be a valuable exercise.

An "outlier" pattern The "outlier" pattern is demonstrated in Figure 6. I defined outliers and discussed how to eliminate them from your analyses in Part 6 of "User experience, not metrics," so I won't spend much time on the topic here, but there are some comments I believe should be added for completeness.

Figure 6: An "outlier" pattern (click here to enlarge) As you can see, Figure 6 clearly has two datapoints that are candidates to be eliminated as outliers. But before we eliminate these datapoints, we should make some effort to determine why the response times were so far outside of the pattern for the test. The first thing we should do is reexecute the test under as close to identical conditions as possible to see if the datapoints appear consistently. If the candidate outliers are reproducible, it's likely that they aren't true outliers. The reason I mention this is that Figure 6 actually represents the same test as Figure 5, but with a heavier load. When I executed the test several times, I found that I regularly got one or two outliers, but they were often in different places during the test run. Upon further analysis, I found that I had one set of incorrect user credentials in my datapool. This candidate outlier was actually performing correctly in response to the improper data I had entered. In this case, analyzing the scatter chart showed us several things:

• • •

We had to double-check our datapool. Authentication failure took significantly longer than successful authentication. Failing to authenticate a second time (with the same user name) took less time than failing the first time (that is, the credentials for that user name were being cached on the server side even when authentication failed, which was counter to the stated requirements in this case).

In this case, my initial supposition about those two points being statistical outliers was incorrect, but my research to validate that supposition led to findings I may not have come across otherwise. A "caching" pattern Another common pattern for a scatter chart is what I refer to as the "caching" pattern. The test that resulted in the scatter chart in Figure 7 was consistent from start to finish, but the chart makes it appear as if something very different from the remainder of the test occurred at the beginning of the test.

Figure 7: A "caching" pattern (click here to enlarge) At the start of the test, we see very clustered, (relatively) slow response times even though we have well-distributed requests. Running this test a second time resulted in a "good" pattern. The explanation for this is quite simple. After the server had been rebooted, the server-side cache had been cleared and none of the JSP pages had been compiled. As a result, all the users to request a page before it was compiled or cached had to wait for that activity to be completed before the page could be served. This is quite common and should be accounted for in at least some of your performance testing if your servers will ever be rebooted, or have their cache cleared, in production without some type of precaching and/or precompiling mechanism in place. A "classic slowdown" pattern Figure 8 depicts what I call a "classic slowdown" pattern.

Figure 8: A "classic slowdown" pattern (click here to enlarge) This chart shows a steady-state workload and response time at the start of the run, with response times getting increasingly poor until they start to improve again. In this particular case, it turns out that every activity that accessed the application server started to slow down. What happened was that the application server became overutilized (both CPU and memory) under peak load. The response times began to improve as more users completed their transactions and exited the application (that is, as the total load decreased). This is typical behavior. While oftentimes your test won't run long enough at lower loads at the back end to see the subsequent increase in responsiveness, it's typically there. In the "Creating and reading overlaid scatter charts" section we'll demonstrate how to correlate CPU and memory utilization with response vs. time scatter charts. A "stacking" pattern The "stacking" pattern is one to keep a close eye out for. When a chart demonstrates this pattern, the test is likely to be invalid. That doesn't mean that you can't get a plethora of valuable information from a chart with this type of pattern, but it does mean that you have to analyze it differently. Let's start by looking at Figure 9.

Figure 9: A "stacking" pattern (click here to enlarge) In this chart the results appear to start out fine but then begin to display a spiking pattern shortly before 1,000,000 milliseconds into the test. You can see that pattern most clearly at around 2,000,000 milliseconds, where the dots form a pattern resembling sharks' teeth or a jagged mountain range. This pattern is extremely uncommon in real-life situations. What it represents is all the virtual users "stacking up" and requesting the same objects at almost exactly the same time. In Part 2 of "User experience, not metrics," we talked about the importance of simulating realistic (variable) think times for individual users to avoid load scenarios that are equivalent to "putting ten users in a room with ten identical computers with a coach yelling 'On your mark . . . get set . . . Click "Home Page" . . . wait . . . wait . . . Click "Page1."'" The stacks on the chart in Figure 9 indicate that this is exactly the scenario that we've created. This generally happens for one of two reasons:

• •

We didn't model our think times and/or abandonment correctly. At some point something happened to the application that caused all of the users to stack up.

In the case of this chart, the stacking was due to the second reason. The way we can tell is by locating the first stack (see Figure 10).

Figure 10: First stack from Figure 9 In Figure 10 you can see that the test ran for more than 500,000 milliseconds (8 minutes) before the first stack occurred. Our test scripts averaged about 10 minutes, so we were more than halfway through our first iteration when this occurred, and thus it wasn't the model. Rather, we had exploited a database connectivity bottleneck that caused users to stack up on one another. Once they stacked up, our delays weren't random enough to overcome it, so the remainder of the test simulated those coached testers we mentioned above. So why am I making such a big deal about this? The value of this test run ends at the top of the first stack, because everything that happened after that wouldn't happen in real life. Real users would abandon, hit the Refresh button, start over, and so forth. Our scripts simply aren't sophisticated enough to instantly morph from expected case to "weird exception case" in the middle of a test execution. Don't waste your time figuring out what happened after the first stack. Figure out (and fix) whatever caused the first stack, then run your test again. That should result in a chart with a pattern other than the stacking one.

A "compound" pattern There's one more pattern we should discuss before we move on to consider the overlaid scatter chart, and that's a "compound" pattern. Basically, a compound pattern is any combination of the patterns we've already talked about and/or of the patterns we haven't talked about. Figure 11 depicts a test run with "caching" at the front, a "good" run for a period after that, then a mostly "classic slowdown" toward the midway point. We also have what appears to be a "stack" at about the 600,000-millisecond mark. This may or may not be of interest since the pattern doesn't repeat itself, and it's obviously after other things have started happening that deserve further research.

Figure 11: A "compound" pattern (click here to enlarge) We can try to determine the causes of the compound pattern we see in this chart by overlaying various resource measurements on the chart. That's what I'm going to show you how to do next.

Back to top Creating and reading overlaid scatter charts What I'm calling overlaid scatter charts are really just basic scatter charts with other data such as CPU and memory utilization -- basically any measurement that could be collected by a program such as Perfmon (the Microsoft Windows resource performance monitoring program) -- overlaid on them. Creating one of these charts often takes much more effort, but this type of chart is very valuable in many circumstances. You'll see why when I show some examples of overlaying various resource measurements on our compound pattern after briefly describing how to create overlaid scatter charts in TestManager and Excel. Creating overlaid scatter charts in TestManager To collect resource statistics using TestManager, you need to have Rational TestAgent installed on the machine whose resources you want to monitor. This has proven to be a major stumbling block for the clients of mine who simply won't allow anything to be loaded onto their Web, application, and/or database servers. If that's your situation, feel free to skip to the next section. The basic steps to collecting resource statistics and displaying them in TestManager are as follows: 1.

2. 3.

Install the correct version of Rational TestAgent on each machine you want to collect statistics about. Ensure that the >Info Server command is included in your script(s) and is pointed to the correct machines. Ensure that when you run your suite, the "Monitor resources" check box is checked in the Run Suite window (see Figure 12).

Figure 12: Checking "Monitor resources" 4. After test execution, run the Resp vs. Time report with the scatter option selected, right-click the scatter chart, and choose Show Resource (Figure 13).

Figure 13: Choosing Show Resource 5. In the dialog window that appears, check the computers and resources you want to have overlaid on the scatter chart (Figure 14).

Figure 14: Checking the resources to be shown Once you've done all that, you'll have a scatter chart with overlaid resources. Figure 15 shows a simple scatter chart created using this method.

Figure 15: TestManager scatter chart with resource overlay (click here to enlarge) I intentionally made this chart simple so we could review how to read the chart without too many distractions. The blue line with yellow datapoints represents the percentage of local computer CPU total time used and the purple line with green datapoints represents the percentage of local computer memory used. A scale labeled "Resource Usage" has been added on the right to read the two new lines against, while the scatter chart dots are still read against the scale on the left. This particular chart shows us that the CPU on the local computer (the master station) gets utilized up to 60% while driving this particular test and that the memory usage stays pretty steady -- not very exciting news, but easy to read and understand.

Creating overlaid scatter charts in Excel If you can't put Rational TestAgent on your servers or you want to overlay some data that you can't collect with Rational TestAgent, you're going to have to build the chart yourself in Excel. The very simplified version of this process is as follows:

1. 2.

Copy in your timer and response data from TestManager as described earlier in this article, under "Recreating a Scatter Chart in Excel." Copy your resource utilization data into new rows, then sort by time from start of the test execution. See Figure 16.

Figure 16: Example Excel table for overlaid scatter chart 3. Highlight the entire table, then choose Insert > Chart from the menu bar. On the Standard Types tab, under Chart Types, select XY (Scatter). 4. Right-click any datapoint representing something other than page response time and choose Format Data Series (Figure 17) to get the Format Data Series window.

Figure 17: Choosing Format Data Series 5. On the Axis tab in the Format Data Series window, click the "Secondary axis" radio button (Figure 18).

Figure 18: Checking the "Secondary axis" option 6. Repeat steps 4 and 5 for each set of datapoints other than page response time. Once you've completed those basic steps, you'll have to customize the chart using Excel's standard options. I commonly add a trend line to the resource utilization datapoints. Examples of overlaid scatter charts Now we can get back to the compound pattern shown in Figure 11. As I mentioned, we'll overlay various resource measurements on that scatter chart to try to determine why response times slowed down approximately halfway through the test. The first place I looked was to the Web servers in the load-balanced system under test. While I looked at many different statistics, I chose to show the scatter chart overlaid with CPU utilization data for both Web servers. See Figure 19.

Figure 19: Scatter chart overlaid with Web server CPU utilization data In this case we can see that neither Web server's CPU utilization went above 50%, and it wasn't at a peak during the period with the results in question, so this wasn't the problem. Next, I checked the application server CPU utilization. See Figure 20.

Figure 20: Scatter chart overlaid with application server CPU utilization data In this figure we can see that the application server hovered around 80% CPU utilization throughout much of the center part of the test. This percentage isn't necessarily bad, and since CPU utilization reached 80% long before the poor performance started, I decided to look at one more resource, the CPU queue length. See Figure 21.

Figure 21: Scatter chart overlaid with application server queue length data Here we see that the CPU queue went above zero for the first time at the first indication of a performance problem, then continued to read above zero intermittently for the next section of the test. This was the actual cause of the poor performance in this test. After the queue raised above zero for the first time, the application server was unable to completely recover until the load was reduced near the end of the test. While this analysis could have been done without the use of these overlaid scatter charts, the overlays made what was going on much easier to see. And with very little explanation, even stakeholders with little technical background can see the problem visually. This alone is well worth the time spent to create a graph like this one.

Summing it up Scatter charts are extremely powerful analysis tools. No other tool puts more information in front of performance engineers in a way that they can process as quickly. Scatter charts are also a great way to display critical results to stakeholders in a way they can conceptually, if not technically, understand immediately. The keys to analyzing these charts are understanding both your system and your test, and identifying and determining the causes of patterns. Beyond performance testing part 7: Identifying the critical failure or bottleneck Scott Barber, Performance testing consultant, AuthenTec Summary: How do you gather and present information about potential bottlenecks in the application you¹re testing in a way that will enable you to work collaboratively with the development team to solve the problem? The tips and rules offered here will help. So you found an odd pattern in your scatter chart that appears to be a bottleneck. What do you do now? How do you gather enough information to refute the inevitable response, "The application is fine, your tool/test is wrong"? And how do you present that information conclusively up front so you can get right down to working collaboratively with the development team to solve the problem? Those are the questions we'll be addressing in this article. In addition, I'll be giving you eight rules about bottlenecks that I've found to be both significant and useful during my tenure as a performance test engineer. This kicks off a four-article theme I call "finding bottlenecks to tune." We've already explored the entry-level analysis that points us in the direction of a bottleneck or failure, and I've given some hints on how to ferret them out. Now it's time to get under the hood. By the conclusion of Part 10, you should be confident in your ability to work with the development team to identify and exploit these areas of concern in a way that adds significant value to the overall development process. So far, this is what we've covered in this series:

• • • • • •

Part 1: Introduction Part 2: A performance engineering strategy Part 3: How fast is fast enough? Part 4: Accounting for user abandonment Part 5: Determining the root cause of script failures Part 6: Interpreting scatter charts

This article is intended for mid- to senior-level performance testers and members of the development team who work closely with performance testers. If you haven't already read Parts 5 and 6 of this series, I suggest you do so before reading this article. What exactly is a bottleneck?

"If there were no mystery left to explore, life would get rather dull, wouldn't it?" -- Sydney Buchman (www.quotablequotes.net) Through years of experience in detecting bottlenecks, explaining bottlenecks, and teaching others how to do the same, I've learned something very important. It's not safe to assume that most people know what a bottleneck is. Even if you've been doing performance testing for a long time and use the term bottleneck regularly in everyday speech, I recommend that you don't skip this section, particularly since the distinction between a system failure, a slow spot, and a bottleneck will be central to our discussions throughout the rest of this series. The dictionary version Here's how the Webster's Millennium™ Dictionary of English (Lexico Publishing Group, 2003) defines bottleneck: n: a narrowing that reduces the flow through a channel \[syn: constriction\] v 1: slow down or impede by creating an obstruction; "His laziness has bottlenecked our efforts to reform the system" 2: become narrow, like a bottleneck; "Right by the bridge, the road bottlenecks" This definition is understandable and hints at the origin of the term (referring to the narrow part of a jar or bottle) but is most useful to us if we note what it doesn't say as well as what it does say: "reduces the flow," not "ceases the flow" "slow down or impede by creating an obstruction," not "stop by creating an obstruction" "become narrow," not "become impassable" The reason this is important is that it's the basis for the distinction between a performance bottleneck and a performance-related failure that we'll explore below. I summarize that point as ... Scott's first rule of bottlenecks: A bottleneck is a slowdown, not a stoppage. A stoppage is a failure. Something else the dictionary definition doesn't mention that will become relevant to us is load or volume. The fact that the definition above says nothing about the cause of a bottleneck suggests that it shouldn't be assumed that a bottleneck exists only under load or volume. In my experience, most folks assume that bottlenecks don't exist unless a certain "trigger load" is applied to a system. This is both contrary to the definition of bottleneck and often untrue when applied to software systems, thus bringing us to ... Scott's second rule of bottlenecks: Bottlenecks don't only exist under load. The hydrodynamics version Next let's take a look at bottlenecks from a hydrodynamics perspective. I have a BS degree in civil engineering, which means that I took roughly 16 credits of hydraulics and hydrodynamics in college long before I became a performance test engineer. What I've realized since then is that a useful comparison can be made, at least conceptually, between the flow of water through a pipe system and the flow of activity through a software system. Figure 1 shows the simplest possible version of a bottleneck in a pipe (you may notice that at first glance, the drawing resembles a bottle). Without getting into complex formulas, you can see that more water can flow through the section of pipe on the left than on the right, given a constant pressure, over time. The arrows in this diagram depict velocity, or the speed that the water is actually moving through the pipe; the shorter the arrow, the slower the flow of water.

Figure 1: A simple pipe bottleneck What you see here is that the water moves faster through the narrower section of pipe. This concept seems counterintuitive to most people at first, but the explanation is really rather simple. For the same total volume of water to move through a narrower section of pipe, that water has to move faster to make room for the water in the wider section of the pipe. This is a classic example of a queue. To illustrate the concept of a queue, think now of that pipe holding sand instead of water. Each grain of sand in the wider part of the pipe must stop and wait for an opening in the narrower section of the pipe, and thus moves very slowly until it reaches the "release point," roughly where the narrower section of the pipe begins. Once it reaches the release point, that grain of sand starts moving much faster. That "stop and wait" period is a queue. The bottleneck is the cause of the queue; it's not the queue itself. What's important to note here is that the place where the pipe narrows is the bottleneck, but the sand (or water) actually moves most slowly right before the pipe begins to narrow. This brings us to ... Scott's third rule of bottlenecks: The symptoms of the bottleneck are (virtually) never observed at the actual location of the bottleneck. In hydraulics, there's another useful concept: the "critical" bottleneck, defined as the one bottleneck that unless resolved will dictate the flow characteristics of a system. In Figure 2, you can see three sets of obstacles restricting the flow of water through the pipe. It's easy to see that obstacle 2 is restricting the flow the most. In this case obstacle 2 is the critical bottleneck, meaning that removing obstacles 1 and 3 won't actually improve the flow of water through the pipe.

Figure 2: Multiple bottlenecks More simply put ... Scott's fourth rule of bottlenecks: The critical bottleneck is the one bottleneck along a particular user path the removal of which will improve both performance and the ability to find other bottlenecks. Exploring critical bottlenecks introduces us to the concept of multiple paths through a system. When you extend a system beyond a single pipe into a closed system, you often add alternate paths through that system. Figure 3 is an example of a closed hydraulics system.

Figure 3: A closed hydraulics system The difficulty of detecting bottlenecks in a system increases nearly exponentially with the number of possible paths through that system. Glancing at Figure 3, you can see that if there were a bottleneck in the pipe on the right side, the water could flow through the pipe in the center instead. This could lead to the appearance of a bottleneck in the center pipe, even though the bottleneck isn't there (see the third rule). Thus, it's important to remember ... Scott's fifth rule of bottlenecks: If you have multiple paths through a system and think there's a bottleneck, you should isolate each path and evaluate it separately. In the system depicted in Figure 3, you can see some items other than pipes -- pumps, valves, and a reservoir. If you think of the pipes as your network and the other items as your hardware (Web server, routers, and so forth), you quickly come up with ... Scott's sixth rule of bottlenecks: The bottleneck is more likely to be found in the hardware than in the network, but the network is easier to check. The analogy between a closed hydraulics system and a Web-based application can actually go a lot further, but I think that's enough for now. The software version When people started using the term bottleneck, the concept of software hadn't even been invented. That fact alone should make us realize that the term probably has a special meaning when applied to software. Often the term bottleneck is used to refer to anything perceived to be slow in a software system, but this use of the term is imprecise and should be avoided. For instance, suppose one page on a Web site has several large graphic images on it. When a user loads this page, it may take a long time. But unless downloading the graphics causes some other activity in the system to slow down, it's not a bottleneck; it's just a slow page, or what I call a "slow spot." The following definition was taken from "Load Testing for eConfidence": A bottleneck is a point in a Web application where congestion and delay occur, slowing down the processing of requests and causing users to experience unacceptable service delays. The key to this definition is the word congestion. The next rule summarizes this example and definition. Scott's seventh rule of bottlenecks: Unless other activities and/or users are affected by the observed slowness or its cause, it's not a bottleneck but a slow spot. We'll discuss why this is significant in the next section.

Failure vs. bottleneck vs. slow spot In the course of defining bottleneck, we've made some distinctions that I'd like to spend a little more time on. The first is the distinction between a bottleneck and a failure. I think I've made it clear that a bottleneck is a slowdown, not a stoppage -- meaning that the expected outcome is eventually achieved, regardless of how long it takes. For example, if you wait a long time but the requested Web page does eventually display properly, you've encountered a slow spot or a bottleneck. If, however, you wait and eventually are presented with an error page instead of the requested Web page, this is a failure. The interesting twist to this distinction is that sometimes a very minor change can transform the failure back into a bottleneck. Consider the example above. It's entirely possible that in the second situation, a time-out (failure) occurred due to a Web server setting. Changing that setting and reexecuting your test may result in all activities being completed successfully but taking an unacceptable amount of time and slowing down all users (bottleneck). For our purposes, anytime an error occurs, whether caused by a bottleneck or not, that error is a failure (you may prefer to call it a bug, defect, issue, or area of interest) and should be reported as such. When that failure causes other users to be unable to complete their tasks in the expected manner, that's a critical failure. The main difference between a bottleneck and a slow spot is that a bottleneck has widely felt performance effects. A single large graphic can cause an annoying slow spot that may need to be resolved, but unless there are just a ton of people downloading that graphic (bottleneck caused by a popular activity) or your Web server is underpowered (bottleneck caused by insufficient infrastructure), it's just a slow spot with no real effect on the rest of the system. I'm making a big deal of these distinctions because as we go through this group of articles about bottlenecks, we'll continually find failures and slow spots while we're chasing bottlenecks, and we'll need to distinguish among them to be able to take appropriate action.

Identifying bottleneck suspects There are at least as many ways of identifying bottleneck suspects as there are people who've ever observed something being slow when working on a computer. It's our job as performance test engineers to identify as many of those suspects as possible and then sort, categorize, verify, test, exploit, and potentially help resolve them. Let's discuss some common ways to identify bottleneck suspects. For now, let's not worry about whether these suspects might turn out to be failures or slow spots instead of bottlenecks. Examine response vs. time charts/tables If you're already executing performance tests, the most obvious place to look for bottleneck suspects is the response vs. time charts and tables. I'm assuming that by now you're familiar with these charts and tables, but if you'd like a refresher, Parts 6, 7, 8, and 9 of the "User experience, not metrics" series discuss them in detail. By looking at the default charts that are displayed at the end of a test execution in Test Manager, you'll immediately be able to see which timers or command IDs are noticeably slower than the others. Every one of these is a bottleneck suspect. Additionally, every timer or command ID that has a very large standard deviation (for example, a mean time that's much smaller than the 90th percentile time) is a suspect. While it's more likely that each of these is a symptom, a slow spot, or a failure, they're all worth noting and evaluating further. If you've executed several tests under different loads, you could create the response time by test execution chart (described in "User experience, not metrics," Part 9) to see if you have any load-related suspects. In Figure 4, for example, we see that performance seems to degrade significantly when there are more than 150 users and when slower connection speeds are emulated. These are examples of strong bottleneck suspects.

Figure 4: Response time by test execution chart Study scatter charts Since the previous article in this series (Part 6) is devoted entirely to scatter charts, we won't spend much time on them here. In case you hadn't guessed it from reading Part 6, scatter charts are my favorite analysis tool, and the ease of identifying bottleneck suspects using them is one of the reasons. Simply put, any pattern that shows more than one dot (outlier) outside of your predefined acceptable performance levels is a potential bottleneck. The most likely suspects are patterns like the classic slowdown and banding patterns with bands above your acceptable performance level. Caching patterns are also good places to look, but as before, stacking patterns are more likely to result from bad test models or system failures. Rely on personal observation Personal observation is one of your best tools for identifying suspected bottlenecks. As you're creating scripts, you're using the application. You get to "feel" what performance is like, and you get a good idea of what types of activities cause the application to perform differently (better or worse) before you ever execute your first load test. These observations are extremely valuable, not only as a method of validating your scripts but also as a way to identify bottleneck suspects. Don't assume that your tool is better at detecting bottlenecks than you are. The ultimate users of the system are people, not load-generation tools; that in itself makes your opinion (based on observation) more valuable than the numbers the tool reports. Listen to third-party comments Ultimately, other people will start using the system -- generally while testing it. These folks will find all kinds of failures, slow spots, and bottlenecks, whether they realize it or not. It's important to talk to them and even observe them periodically to see and hear what they think of the system from a performance perspective. A simple comment like "I don't remember that search taking that long in the last version" is a big red flag that there may be a bottleneck hiding somewhere in the search activity. The best part about that flag is that the search may have seemed pretty fast to you, since you may never have used the last version. Don't assume that your personal observation will result in the same suspects as the observations of a casual user.

Confirming suspects

After identifying a list of bottleneck suspects, the next step is to confirm them. Confirming a suspect won't necessarily confirm that the suspect is a bottleneck but rather will only verify that you've encountered some kind of performance issue (that is, either a bottleneck, a slow spot, or a failure) that warrants further research. In some cases you'll know at the time of confirmation which it is, and in other cases you won't know until much later in the process. The key to confirming suspects is the ability to reproduce the results both exactly and manually. Until you can do at least one of those two things, the suspect is unconfirmed; and although unconfirmed suspects aren't necessarily invalid, they're generally given lower priority than confirmed suspects. Reproducing results with similar tests, with minimalist tests, and with not-so-similar tests can also offer clues to help you distinguish bottlenecks from other types of performance problems. Reproduce results exactly The single most important requirement for confirming a bottleneck suspect is the ability to reproduce the results that you or others have identified as indicating the suspect -- that is, the symptoms. If the symptoms can't be reproduced, it's often the case that the observed condition that led to identifying the suspect was caused by something unrelated to the test. For instance, while I'm building my scripts, I often end up with a whole list of bottleneck suspects from observation that I dismiss a week later when I can't reproduce them. The reason I dismiss them so easily is that I'm often developing my scripts against a development environment that's in a state of flux. If I can't reproduce my observation in the test environment, I do make a note to myself, but I've found that it's generally safe to assume that the development environment isn't stable enough to put much faith in my findings there. This is just one example. I'm not recommending blindly dismissing everything you observe in a development environment. What I'm saying is use common sense. If you know that the developers are refreshing the database, promoting code, and rebooting servers multiple times a day, you can feel pretty confident that your suspect bottlenecks are suspect. Reproduce results manually If the suspected bottlenecks were observed while you were using a tool, you need to do whatever you can to reproduce the symptoms of the potential bottleneck manually. It's always possible that your test is causing a symptom that real users wouldn't encounter. Validate the accuracy of your scripts before considering a suspect bottleneck that was detected by a script to be confirmed. Even then, you'll want to try to reproduce that suspected bottleneck manually both while no one else is on the system and while the test is executing with the load at which the suspect bottleneck was first detected. The ability to observe the symptoms under one or both of those scenarios confirms a bottleneck suspect. If the suspected bottleneck was identified through a third-party comment, try to reproduce the symptoms yourself. If you can't reproduce the symptoms, try to get the person who made the comment to reproduce the symptoms for you. If you and the person who made the comment have trouble reproducing the symptoms, take the time to try to determine what other factors could have contributed to the observation -- for instance, the application being run on a different environment or a patch being applied that day. Reproduce results with similar tests Without beating a dead horse, if you observe symptoms of a bottleneck using tools, be sure you can reproduce those symptoms with the same or similar tests -- preferably with some variances, such as time of day, load, varying data, or additional activities that seem to be unrelated to the symptoms. The ability to reproduce the symptoms in similar situations is a strong indicator that the issue deserves further research and is therefore a confirmed bottleneck suspect. Reproduce results with minimalist tests While you're confirming your suspects, you should try to reproduce the symptoms with the simplest test (manual or automated) possible. For instance, try to reproduce the symptoms without load, or without performing any other activities while logged in as that user. It's not absolutely necessary to be able to recreate the symptoms of the suspected bottleneck with a minimalist test for that suspect to be confirmed, but it will answer one of the first questions that the stakeholders are bound to ask and will aid in your ability to demonstrate the suspect.

Reproduce results with not-so-similar tests Just like reproducing results with minimalist tests, reproducing results or symptoms with not-so-similar tests will help you demonstrate the existence of the suspect. It's also a big step toward identifying the differences among a slow spot, a failure, and a bottleneck. For example, a not-so-similar test may show that searching for a book as well as searching for a store near you on a retail site are both slow. If only searching for a store near you were slow, you might be tempted to think that something specific to that search was slow (slow spot), but knowing that both types of searches are slow may lead you to think that the database is poorly tuned (bottleneck).

Reporting confirmed suspects I hinted at the importance of effectively reporting confirmed suspects earlier in this article. My experience shows that reporting suspected performance issues (failures, slow spots, and bottlenecks) is tricky business. You're often met with skepticism, disbelief, defensiveness, or dismissiveness... sometimes from different stakeholders in the same meeting! The first thing to remember is that you're not alone. Every performance test engineer who's ever reported a suspected issue has faced this. The second thing to remember is that if you've followed the approach outlined above, you have a confirmed, reproducible suspect to report. If you report it well, no one will be able to refute that it's a valid suspect. On the other hand, if you present your suspects poorly, overstate or understate them, or don't report them at all, they may never get addressed. It's our job as performance test engineers to ensure that these suspects get taken seriously and addressed appropriately. Over the next few paragraphs I'll share with you some hints that I've found useful when reporting suspected bottlenecks. Report verbally "Once long, long ago, in a galaxy far, far away," I was on my first performance-testing project as the performance test lead. I had developed and executed some tests that I was really proud of. I started analyzing the results of one test and found something. I was smart enough to execute that test again to verify that I could repeat it. As soon as I saw that I could repeat the pattern I'd found, I picked up the phone and called the lead architect. Me: "I just ran some tests -- you have a memory leak." Architect: "Your tests are wrong -- there's no memory leak." Me: "I can reproduce it, you monitor the box, I'll rerun the test." Architect: "OK, but there's no memory leak." Fifteen minutes later I called back. Me: "See, I told you there's a memory leak." Architect: "Huh? Memory usage hasn't changed. I was about to call you to ask if you'd started your test yet." Me: "What?!? The site is down -- I just checked it manually. Are we looking at the same instance?!?" Architect: "There's only one instance and the site is fine. Try it again." Me: "OK . . . What did you do? How did you get it back up?" Architect: "Nothing. Double-check your tests -- there's no memory leak." As you might imagine, that story could go on for a long, long time. In the end, after I'd lost virtually all of my credibility, we found that the pattern I was seeing was caused by a temporary license (a limited number of concurrent

connections) being installed on one of our servers. As it turned out, I had a completely valid confirmed bottleneck suspect that I reported poorly. My mistake was deciding that I knew what the problem was instead of calling the lead architect and saying, "Hey, I'm getting some odd responses when I run tests with ten or more virtual users. It seems to have some symptoms like a memory leak, but I can't tell for sure what it is. When you have a chance, can you come down and take a look?" This was one of the biggest lessons of my career as a performance test engineer, now summarized as ... Scott's eighth rule of bottlenecks: When reporting bottleneck suspects, don't assume you know the cause, just report the symptoms. More specifically, I've found the following advice to be useful when reporting suspected bottlenecks:

• • • • • •

Don't report suspected bottlenecks in a way that implies fault. Do describe all of the symptoms you've identified, not just the one you think is most relevant. Don't speculate on the cause of the bottleneck, even if you think you know what it is. Do describe all of the ways you've found to cause the symptoms. Don't get defensive when challenged -- it really might be the fault of your test. Do be prepared to support your claims.

Report visually Most of the time, stakeholders will want to see charts and graphs demonstrating the symptoms of the suspected bottleneck. Since I devoted Parts 6 through 10 of the "User experience, not metrics" series and Part 6 of this series to discussing how to display and interpret data visually, I won't discuss here which types of charts and tables are best to use. I will say that you should spend some time finding the best way to visually depict those symptoms and have those charts and/or tables ready to show when you report the suspected bottleneck. Having the data available visually will almost always shift the focus of conversation from "Your test is wrong" to "Hmm ... I wonder what would cause this odd pattern," and that's exactly what you're hoping for. At this stage your goal is to show the developers that you want to work with them, and the more information you can give them to help them draw their own conclusions the more they'll want to work with you to get even more information later. Report via demonstration Finally, no matter how smooth your words are or how convincing your charts and tables are, some folks will only believe you if you demonstrate the symptoms to them -- or if you allow them to experience the symptoms themselves. Much like writing a good defect report for a functional test case, you should always have a step-by-step process prepared that others can follow to reproduce the symptoms on demand. It's even better if the process doesn't involve the test tool. Having this step-by-step process will save your credibility every time -- especially if you remember my eighth rule and report only symptoms.

Is it time to tune? Each of the four "finding bottlenecks to tune" articles will end with the question "Is it time to tune?" In this section we'll discuss when to jump out of the bottleneck detection and exploitation cycle and jump into the bottleneck resolution cycle. While it may seem that all I've shown you how to do in this article is identify symptoms of potential issues (failures, slow spots, or bottlenecks) and report them, this is actually when most tuning begins. More times than I would have thought, when I report bottleneck suspects someone in the room responds, "Oops. I know what that is. Scott, I'll call you in a few hours and ask you to rerun your tests. I'm pretty sure this will be fixed." As far as I'm concerned as a performance test engineer, this is the ideal situation. Nine times out of ten, I go back to my desk and work on something else for an hour or two, the developer calls, I rerun the test, and that suspected performance issue is gone. Some companies or project managers will ask you to document that. I actually recommend

against documenting any performance issue that takes less than a day to resolve, but you'll obviously have to follow the guidance of your organization. The bottom line is that while you're reporting the symptoms of your suspected performance issues, you'll generally find yourself engaged in conversations about the cause of the symptoms. If there's consensus as to both the cause of the symptoms and how it should be resolved, the attempt should be made to resolve that issue (that is, to tune) immediately. If either the cause or the resolution is unclear, you'll want to modify your tests to focus on resolving the issue. This is the topic of Part 8, "Modifying Tests to Focus on Failure or Bottleneck Resolution." There I'll show you how to further categorize performance issues into failures, slow spots, and bottlenecks, and how to isolate symptoms with the intent of gaining more information about the cause.

Summing it up Identifying symptoms of performance issues is the first step toward actually improving the performance of a system. In fact, simply identifying and reporting the symptoms is often enough to lead to performance improvement. Remember, however, that at this stage you shouldn't speculate as to the cause of the identified issue, and you should be on the lookout for the characteristic differences between failures, slow spots, and bottlenecks. In many cases, you won't know which it is until you've taken another step or two in the process, but you should always be on the lookout for telltale signs.

Beyond performance testing part 8: Modifying tests to focus on failure or bottleneck resolution Scott Barber, Performance testing consultant, AuthenTec Summary: This installment of the popular performance testing series explores how to design and build or modify tests to focus on the failure or bottleneck identified. ) Now that we can conclusively reproduce the bottleneck, slow spot, or failure and the stakeholders agree that it's an issue worth addressing, what next? In this article we're going to explore how to design and build or modify our tests to focus more explicitly on the item of interest. Since we've identified only the functional symptoms of the performance issue so far, this focusing exercise will be critical to the process of determining what's causing those symptoms, thus starting us down the path toward resolving them. This is the second of four articles on the theme I call "finding bottlenecks to tune," where we're taking a step beyond just performance testing and beginning to explore how to add real value to the development team. So far, this is what we've covered in this series:

• • • • • • •

Part 1: Introduction Part 2: A performance engineering strategy Part 3: How fast is fast enough? Part 4: Accounting for user abandonment Part 5: Determining the root cause of script failures Part 6: Interpreting scatter charts Part 7: Identifying the critical failure or bottleneck

This article is intended for mid- to senior-level performance testers and members of the development team who work closely with performance testers. If you haven't yet read Parts 5, 6, and 7 of this series, I suggest you do so before reading this article.

What the development team needs to know "The ability to focus attention on important things is a defining characteristic of intelligence." -- Robert J. Shiller, Irrational Exuberance After you report the symptoms of suspected performance issues you've identified, your developers may recognize the symptoms and be able to resolve them in short order. But if not, they're going to need more information, some of it in the form of metrics. If you're a longtime reader of mine, you know that I generally try to avoid talking exclusively about metrics and like to pay at least as much attention to user experience, since metrics aren't the whole story. In the case of chasing performance issues, though, you eventually get to the point where metrics are needed for evaluation. With that said, I'll list some questions that point to the kinds of information and/or metrics that can help developers identify and/or isolate a performance issue. Which related user activities produce the same symptoms? The very first thing developers ask once they acknowledge that the symptoms you're reporting are real is "How did you get that to happen?" This is followed closely by "Is there any other way to make that happen?" Sometimes these questions are easy to answer. To repeat an example from Part 7, if you find out that searching for a book and searching for a store near you on a retail site are both slow, this allows the developers to narrow the scope of things they'll need to evaluate. Sometimes these questions aren't so easy to answer, and you'll need to ask the developers to help you determine which activities are related from their perspective so you can evaluate them. For example, you may learn that the only way those two searches are related is that they both have tables in the database, in which case testing other searches is less likely to produce the same symptoms. You may also learn that they share all the same code, and parameters are just passed into the generateSQL class, in which case you'll want to know all of the activities that pass parameters into the generateSQL class so you can see if they cause the same symptoms. The point is, the developers know which custom functions, classes, servers, tables, and so forth an activity touches. You generally won't. You'll often need the developers' assistance to determine which activities are related before you start modifying your tests. Working with them, you should be able to identify which related activities demonstrate the same symptoms. Which other activities are affected by the bottleneck? The developers will also want to know which other activities display any symptoms during the test that created the critical symptom. For example, they'll want to know it if the search page is very slow when searching by t and other users who are trying to search at the same time receive an internal server error. This is critical as it helps them identify potential causes of the symptoms. It's also not always easy to detect, so ask the developers which related activities they suggest you explore for more information. What were the load characteristics of the test yielding the symptoms? The next thing the developers want to know is what the load characteristics of the tests yielding the symptoms were. This isn't just information like "100 users were accessing the system during the test that yielded the symptoms." The developers need to know things like these:

• • • •

How many users were performing the activity before and during the appearance of the symptoms? What was their distribution in time (arrival rate)? What were other users doing before and during the appearance of the symptoms? With how few users can you observe the same symptoms?

In case you may have missed it, the answers to those questions are metrics. These metrics are specifically useful in this context. Still, they aren't the whole story you're telling, just some quantifying factors in the story. What data did you use to create the symptoms?

We know that both the volume and the complexity of data accessed can have a huge effect on performance. For example, searching for all books whose titles contain the letter t will certainly put more stress on the system than searching for the title Lessons Learned in Software Testing. The developers will want to know what data you used as input values to create the symptoms, and they'll likely have other data they'd like you to try. We'll discuss test variances more below. What's the configuration of the environment you're testing? When I start testing, I'm almost always told, "The environment is ready and it mirrors production." As soon as I find a bottleneck suspect, though, someone almost always says, "What environment were you testing against? That doesn't really match production. We need to change the settings to . . . " While you may never know the exact configuration of the environment you're testing, your developers will certainly need to know. The best you can do is help them by sharing the information you have. What other metrics do the developers want you to collect? On top of all that, the development team will normally ask you for a laundry list of metrics that you may not even understand, let alone know how to collect. It seems that on every project I get asked for at least one metric that I've never heard of. Don't be daunted by this. Simply ask the developers to help you identify and capture the metric they think will aid them in understanding or diagnosing the performance issue. I've never had a developer react poorly to "I'm sorry, I really don't know how to collect that data; could you help me?" If the developers don't give you a list of other things they'd like to know, ask them. I've found that if they're not asking for more information, it's often a sign that they don't have much faith in the results you're presenting.

How to design tests to get that information Our next step is to design the tests that will get us the information the developers are looking for. This is usually not the difficult part; the difficult part is often the following step, which is creating the tests. When you're designing the tests, don't worry about how you'll develop them with the tools you have available. Thinking about the capabilities of the tools at your disposal while designing will almost always lead you down the road of designing tests that are easy to implement, instead of tests that will provide immediate value. You don't want to go down that road. Ask yourself "what if . . . ?" questions The first thing to do when trying to come up with tests that focus on a particular set of symptoms is to ask yourself, What would happen to these symptoms if . . .

• • • • • • •

I eliminated all other user activity? I added more user activity? I used different data? I changed the load characteristics? I changed the delay times? I tested from multiple IP addresses? I used a different navigation path to get to this activity?

There are probably hundreds more questions you could ask, but this is a good start. Based on your answers, you can decide which tests to design first. Maybe you're more interested in trying out different data than multiple IP addresses based on your symptoms. Some of the "What if . . . ?" questions you can answer right away with a quick manual test, while others will require specific tests built to confirm or deny your suspicions. Ask developers to speculate I'm always surprised by the number of folks who argue that the tester should know best and thus the developers shouldn't be asked to speculate about which tests will help them identify and diagnose performance issues. I believe that the test engineer should know best how to detect potential issues, but when it comes to exploiting an issue so that it

can be diagnosed, experts on that particular system are needed. Ask your developers to speculate or guess what other tests will provide helpful information, and then do everything you can to provide those tests. They're often exactly the right ones. As mentioned earlier, you can also ask the developers what metrics would help them diagnose the suspects, and then ask yourself and/or them what test will provide that metric. These are often the most difficult tests to design and develop. Evaluate commands with slow responses Another good source of information about how to design appropriate tests comes right from TestManager after a test execution. We looked at TestManager when we were looking for bottleneck suspects. Now that we have suspects, we should return to TestManager and look at the individual commands that had slow responses. Each one of those commands is related to a specific requested item. Once you identify that item, you can search your log file for other instances of that item and add those to the list of things to test. We've talked about the different components of that process before, but I'll summarize the process here. Say you look at the performance report output in TestManager and notice a command that had a particularly slow response. For instance, in Figure 1, you see that the response to command GL_Jour~2.017 took 25.29 seconds, much longer than the responses to other commands.

Figure 1: Identifying a slow command response in the report output You can then find that command ID in the test log and look at the General tab of the Log Event window for more information on that command ID. See Figure 2.

Figure 2: Finding the command ID in the test log (click here to enlarge) Clicking the Virtual Tester Associated Data tab shows that the item being received was: POST /psc/GDV01/EMPLOYEE/ERP/c/PROCESS_JOURNALS.JOURNAL_ENTRY_IE.GBL See Figure 3.

Figure 3: Finding the object of the command ID in the test log (click here to enlarge) Now that you've identified the object related to the slow response time, you can search the entire log file to see what other activities call that object. You'll remember from Part 5 that the log file is the d00 file located at:

\[Drive\]:\\[RepositoryName\]\TestDatastore\TMS_Builds\\[BuildName\]\\[SubBuild \]\ Â Â Â Â Â Â \[TestRun\]\perfdata\

and that we open the log file using a text editor such as Notepad. Finding other instances of objects related to the symptoms will certainly provide some insight into the cause of the symptoms, or at least point to some other things to test. There's a caveat to that, though. Often the problem turns out to be with a previous request/receive pair. If the previous receive returns unexpected data or an unrecognized failure, it may cause subsequent request/receive pairs to fail. You would evaluate this in the same way that you evaluate script failures, as discussed in Part 5. This doesn't mean that the problem is necessarily a script failure -- only that the process of finding the offending command is the same. Typically, if it's a previous command that's causing a symptom later on, that turns out to be a failure rather than a slow spot or a bottleneck, but not always. Think in terms of distinguishing failures, slow spots, and bottlenecks Another thing to think about when designing tests to focus on suspects and symptoms is how you can design tests to distinguish whether the observed performance issue is actually a failure, a slow spot, or a bottleneck. Many of the considerations for test designs that we've already discussed in this article will help make the distinction, but it's always

a good idea when designing a test to ask yourself, Will this test help me determine if this issue is a failure, a slow spot, or a bottleneck? When the answer to that question is no, you should follow it up with, Will another test I've designed help me make this determination, or should I design a new test to do this? Visualize and prioritize Finally, once you've asked yourself and the developers all those questions and done some research on your own, you'll have a whole list in your head of potential tests to create. The thing is, you'll probably be given only a matter of days to track down information about these issues, not weeks. You simply won't have time to develop and execute all of those tests. To pick the right tests to develop, you may want to do what I do, which is to visualize and prioritize. This is actually just a quick-and-dirty way to organize your thoughts about this list of tests you've just come up with. All I do is fill out a grid like the one in Figure 4 to keep my thoughts straight and help me decide which tests to develop first.

Figure 4: Sample "Visualize and Prioritize" grid This is normally something I sketch on a whiteboard, and the column heads are different almost every time. It really doesn't matter how you keep track of the tests you come up with, as long as you have a way to remember one idea when the next one hits you and have a list to return to later after you've developed the first several tests and not found what you were hoping to find.

How to build the tests Now we get to the heart of the matter. How do you build tests to collect this next level of information using the IBM® Rational® TestStudio software and/or other tools? Unfortunately, there's no cookbook answer. Every piece of information is found in a different way, and even that changes from application to application, platform to platform, and development style to development style. The best I can do is outline some basic techniques and suggest some circumstances where they're most useful. In Part 9 we'll discuss in more detail how to use these tests in combination with other resources at your disposal to conclusively identify the cause of the performance issue. Modify existing tests The quickest and easiest way to gather more information about a particular issue is to use your existing tests. Thinking through the information you'll likely be interested in, the following modifications are ones that can be created quickly and often have large payback -- especially in combination with one another -- in terms of gathering that information:



Eliminate all activities from your test suite that aren't necessary to cause the symptoms and reexecute under various load conditions. This will help you pinpoint the parameters that lead to the symptoms plus distinguish between a failure, a bottleneck, and a slow spot. One thing to consider is extending the system timeouts to help determine if a symptom is a failure or not.





Reexecute using different data. For instance, if you're doing a search, do a test with a set of data that returns a small number of items, and then another with a dataset that returns a large number of items, or maybe even all of the items. You may also want to execute a test that loops through a large number of potential data items to see if there may be some pattern to the symptoms. It's possible that you could find that only searches for items that begin with, say, the letter b are causing the symptoms (unlikely, but not unheard of). Try various load characteristics. Don't worry about whether the test reflects reality. Try faster and slower arrival rates, longer and shorter user delays, larger and smaller loads, larger and smaller percentages of users performing the activity displaying the symptoms. These variances will normally help bracket the symptoms. Maybe the symptoms appear only when more than five people do an overlapping search regardless of what other volumes of activities are occurring.

In Part 9 we'll discuss what data other than response times in TestManager to monitor during these tests. Create new tests Sometimes you'll have to record and develop new tests to accomplish the kinds of variance just discussed. There are other situations where you'll want to record and develop new tests as well. For example, to exercise activities that you and/or the development team have identified as related but that weren't included in your initial test suite, new tests may be needed. This is by far the most common reason to create new tests at this phase in the testing effort. Executing tests on related functionality, both individually and collectively with the tests known to generate symptoms, will generally distinguish between slow spots and bottlenecks. New tests may also be needed if there's more than one way to accomplish a task that's been identified as a performance issue. For instance, on one application I tested, searching for a particular customer on the "account maintenance" screen took nearly four times as long as searching for the same customer on the "customer maintenance" screen. At first we'd only tested the "account maintenance" screen because the customer-related functionality was intended to be identical. It wasn't until we finally created a new test to evaluate the related functionality on the "customer maintenance" page that we were able to track down the problem by comparing the SQL generated by the two pages. Sometimes you may want to create a new test to try out a straight-line path to the symptoms even though you have an existing test that's meant to do the same thing. If you're having a particularly difficult time determining the cause of the symptoms, this cause may just be hiding in your script. Rerecord the simplest possible script (no splits, minimal datapools, no abandonment, and so forth) and see if you can recreate the symptoms. If not, do a close comparison of your scripts. Use test harnesses I've seen and been part of lots of debates about what a test harness is. Instead of opening up that debate here, let's agree that in this article series, the term test harness means any helper application or application modification created for the purpose of making it easier to use Rational TestStudio to collect information about a performance issue. Test harnesses can be used in many situations. For instance, in the example above with the "account maintenance" and "customer maintenance" screens, we built a test harness to help us evaluate the problem. The test harness was a simple Web page with an input box and a Submit button. We recorded a script that entered various SQL statements into that text box and clicked the Submit button. The Submit button bypassed most of the application we were actually testing and sent the SQL straight to the database. This test harness allowed us to quickly eliminate the database as the cause of the issue without going through the whole battery of tests. It's unlikely that you'll be the one developing test harnesses. You'll have to work closely with your development staff to create them. You should consider having test harnesses built whenever you can't find another way to isolate a piece of information even though it seems like you should be able to get it using TestStudio. Often, once you start discussing test harnesses with your developers, they'll have ideas for many test harnesses that will provide response time information they wouldn't easily be able to obtain otherwise.

Is it time to tune?

Most of the time, after conducting new or modified tests that focus on the symptoms of confirmed performance suspects, the development team will have enough information to start their tuning exercise. If tuning starts at this point, you probably won't be involved again until the developers believe they've solved the problem, at which time you should reexecute all of the tests that previously revealed symptoms. In general, you should recommend tuning in cases where your tests determine that the suspect was actually a failure or a slow spot rather than a bottleneck. In the case of bottlenecks or inconclusive results, the topics we'll discuss in Part 9 will likely be helpful. Fixing failures or slow spots, or even deciding to accept the performance of a slow spot, may not technically be considered tuning, but they're modifications to either the application or the performance acceptance criteria that affect the performance-testing effort.

Summing it up Focusing tests on performance issues is normally a critical step in determining the actual cause of the performance issue and ultimately tuning it. In most cases, modifying or creating focused tests isn't technically difficult but rather is an exercise in determining what tests or modifications will provide information of the highest value in the time you have to collect that information. These tests need to be created and executed quickly and efficiently to be truly useful to the development team. In Part 9 we'll discuss other methods for collecting additional information related to these focused tests. Beyond performance testing part 9: Pinpointing the architectural tier of the failure or bottleneck Scott Barber, Performance testing consultant, AuthenTec Summary: Even if you've compiled a lot of information about your failure, slow spot, or bottleneck, you simply can't tell where it lives until you evaluate the offending activity or activities by physical or logical tier. This article shows you how. By now, you've compiled a fairly substantial catalog of information about your failure, slow spot, or bottleneck, but if you're still reading, none of that information has actually told you where the bottleneck or failure lives. It's not fair to assume, for example, that the database needs to be tuned because a query returns slowly. It could be, in fact, that the code that creates the request is stuck in a near-infinite loop. You simply can't tell until you evaluate the offending activity or activities by physical or logical tier. This is the third of four articles on the theme I call "finding bottlenecks to tune," where we're taking the step beyond just performance testing and beginning to explore how to add real value to the development team. So far, this is what we've covered in this series:

• • • • • • • •

Part 1: Introduction Part 2: A performance engineering strategy Part 3: How fast is fast enough? Part 4: Accounting for user abandonment Part 5: Determining the root cause of script failures Part 6: Interpreting scatter charts Part 7: Identifying the critical failure or bottleneck Part 8: Modifying tests to focus on failure or bottleneck resolution

This article is intended for mid- to senior-level performance testers and members of the development team who work closely with performance testers. If you haven't read Parts 5, 6, 7, and 8 of this series, I suggest you do so before reading this article. A Refresher on n-tier architecture

"All parts should go together without forcing. You must remember that the parts you are reassembling were disassembled by you. Therefore, if you can't get them together again, there must be a reason. By all means, do not use a hammer." -- IBM maintenance manual, 1925 Before we can really dig into chasing bottlenecks to and into a specific tier, we should spend a few minutes reviewing some n-tier architecture basics. If you're comfortable with logical and physical architectures, feel free to skip to the next section ("Capturing metrics by tier"). One of the things that confused me early in my performance-testing career was the difference between the logical and the physical architecture of a system. I remember one meeting where the developers were talking about the "authentication server." I walked over to the network diagram and asked, "Which of these machines is the authentication server?" In a dismissive tone I was told, "None of them." Not easily discouraged, I asked, "Then where is the authentication server?" To which a developer replied, "It's on Web1 and Web4." If that response confuses you as much as it confused me at the time, the rest of this section is for you. Logical architecture Architecture used to be easy. Either you had a client/server (two-tier) application or you had a Web-based application (normally three-tier). During the early days of three-tier architectures, the tiers often corresponded to physical machines (as shown in Figure 1) whose roles were defined as follows:

• • •

Client tier (the user's machine) -- Presents requested data. Presentation tier (the Web server) -- Handles all business logic and serves data to client. Data storage tier (the database server) -- Maintains data used by the system, typically in a relational database.

The machine that made up the presentation tier came to be known as the Web server because it ran the software used to "serve Web pages."

Figure 1: Three-tier logical architecture At first, as architectures became more complex, individual machines were added whenever a new tier was needed. Later, tiers began to be made up of clusters of machines that served the same role. See Figure 2.

Figure 2: N-tier logical architecture The truth of the matter is that no one actually uses the term file storage tier. They refer to that functionality as "the file server," for the same reason that the presentation tier became synonymous with "Web server" for Web-based applications. The key to understanding a logical architecture is simply this: In a logical architecture, each tier contains a unique set of functionality that's logically separated from the other tiers. But even if a tier is commonly referred to with the word server, it's not safe to assume that every tier lives on its own dedicated machine. Physical architecture So, you may ask, what does the actual physical environment look like? That's an important question when it comes to performance testing -- and one that most developers and stakeholders find hard to believe matters to the performance test engineer. The paradigm that most stakeholders and developers hold to is that "testers don't need to know anything but how to access the system from the client machine." This is simply not true when it comes to performance testing. Be persistent and patient in your quest for information. Over time, they'll come to understand. I've called this section "Physical architecture," but that's actually one of the least-used terms for what we're talking about. Probably the most-used term is environment (that is, the test environment or hardware environment); it may also be called the network architecture. Whichever name your organization uses, what we're referring to here is represented in diagrams where actual, physical, tangible computers are shown and labeled with the roles they play and the other actual, physical, tangible computers they talk to. Figure 3 is the physical architecture of the system we looked at logically in Figure 2.

Figure 3: N-tier physical architecture Figure 3 is very similar to the diagram I was looking at when I asked the question "Where's the authentication server?" I'm sure you now understand my confusion a little better, since there's no machine in Figure 3 labeled "Authentication Server." Instead of trying to explain verbally how the authentication server relates to the physical architecture, let me simply redraw Figure 3 with some additional labeling. See Figure 4.

Figure 4: N-tier physical architecture with logical overlay

What we see here is that most logical tiers consist of more than one physical machine (often called clusters). We also see that the machines that make up the presentation tier (Web1 through Web4) are all serving double duty as either an authentication tier server or a file storage tier server. As it turns out, it's just about as common for a logical tier to be spread over several machines as it is for a physical machine to host the functionality of more than one logical tier. Speaking intelligently with your development team The purpose of this section has been to help you speak more intelligently with your development team. Our brief discussion of architecture and visual representations of architecture barely scratches the surface of what I would classify as "stuff that's useful for a performance test engineer to know about architecture." Someday I may write more articles on this topic, but in the meantime, if this is an area you feel weak in, I suggest that you ask your developers to recommend their favorite design and/or architecture books, since they're the ones you want to communicate best with. Be persistent with your questions; don't stop asking if things don't make sense. Try to use their language but don't be afraid to use different terms to clarify meaning, and when words fail, draw pictures.

Capturing metrics by tier I hope you now have a solid understanding of what a tier (both logical and physical) is and can see why it's important to evaluate performance tier by tier. Through our discussions about bottlenecks it should be apparent that the end-to-end response time can never be shorter than the time spent in the slowest tier, no matter how little time is spent in the other tiers. It should also be clear that if you can't identify which tier is holding up progress, tuning becomes an exercise in trial and error. So the next question is, How do you figure out which tier is causing the issue? That's what we'll discuss here. Capturing resource utilization by tier with IBM® Rational® TestManager The first set of metrics we're going to capture on a tier-by-tier basis is resource utilization statistics. In Part 6 of this series I outlined the process for capturing these statistics machine by machine using the IBM® Rational® TestManager software in the section titled "Creating overlaid scatter charts in TestManager." Please refer to that article for step-bystep instructions. You need to have Rational TestAgent installed on each machine whose resources you want to monitor. Figure 5 shows the resources that can be monitored using TestManager. While these are the most commonly monitored resources, they're by no means the only ones that can be monitored.

Figure 5: Options for monitoring resource utilization in TestManager Capturing resource uUtilization by tier with other methods If you're unable to install the agent software on the machines you want to monitor, or if the resource you want to monitor isn't available in the list shown in Figure 5, you'll have to monitor resources using another method. I briefly mentioned some of those methods when I discussed the component performance chart in Part 8 of the "User experience, not metrics" series. There I said that most operating systems come with resource-monitoring software, like Perfmon for Microsoft and PerfMeter for Solaris. There are countless resource-monitoring tools like these made by third-party vendors. It's usually just best to ask your developers/administrators which resource-monitoring tools they're using, and use them. The challenge when using a resource-monitoring tool other than the one that comes with TestManager is correlating your results. There are two ways to correlate the data:

• •

Have someone watch the utilization rates during the test run and make note of abnormal readings along with the time they were noticed, to compare with the test log after the test. Have your administrators log the resource data as the test is running. Then import the data of interest from both TestManager and the resource monitor logs into a spreadsheet program like Excel, line up the start times, and create your own charts and graphs (this is how the overlaid scatter charts were created for Part 6 of this series).

Figure 6 shows a small segment of the spreadsheet table used to create the overlaid scatter charts in Part 6 of this series. This is similar to the example Excel table shown in Figure 16 in that article but is a sample from a slightly different set of measurements.

Figure 6: Sample resource utilization spreadsheet table To populate this table requires these steps: 1. 2. 3. 4. 5. 6.

Copy the data from the table in the Response vs. Time report output into Excel. Convert the "Ending TS" column to "Time into Test" by subtracting the value in the first row from the values in all subsequent rows in the column and then dividing by 1000 (to convert to seconds). Copy the time stamp and resource measurements from the third-party tool into Excel. Convert the time stamp to "Time into Test." This process will vary greatly based on how your tool logged the time stamp. Mesh the two data sets together so that the "Time into Test" values are sequential. Generate the desired chart based on the table.

Capturing response rime by tier with IBM® Rational® TestStudio One of the most common criticisms of load-generation tools is that you're unable to tell where the reported time was spent -- for instance, in the database or on the Web server -- without additional research. This isn't a criticism unique to the IBM® Rational® TestStudio software; none of the mainstream load-generation tools are able to tell you right at test execution which tier the time was spent in. It is possible to capture response times tier by tier using TestStudio, but this isn't an insignificant undertaking. Still, if you have a strong suspicion that a particular tier contains a bottleneck or are confident that you want to isolate your load tests to a specific tier, you'll want to do it. Because this method is so specific to the environment you're testing and the tier you want to isolate, I'll illustrate by example rather than trying to come up with a list of "if-then" rules that would be sure to miss some quirk of the application you're actually testing. Let's assume, then, that we have a system with a simple architecture where each physical machine represents a logical tier, and there's a single load-generation machine (master station) as shown in Figure 7.

Figure 7: Basic load-testing environment Now, let's further assume that through testing we've determined that only transactions that interact with the database cause symptoms of poor performance. We've further established that these aren't failures, the symptoms span multiple activities, and the entire system is affected by the symptoms. By monitoring resources, we've found that the database often shows 100% CPU utilization and runs out of memory, and that there's often a queue of requests to enter the database under loads significantly below the target load. Based on this, we decide with our team that we want to test just the database server under load and eliminate the Web server response times from the equation. There are actually two ways to do this using Rational TestStudio. The first way involves either building a test harness to access the database or writing custom scripts by hand (that is, not using recording) to send SQL commands to the database. While the latter is possible, it's rarely done, due largely to the time and energy required for such an effort. If we decide to go this route, we'll want to configure our environment simply, as shown in Figure 8.

Figure 8: Data storage tier isolated for a load test

A second way is to use Master Station2 to capture the traffic against the database generated by the load test as it's being executed by Master Station1 (see Figure 9). This does require additional licenses but will give us a recorded script to edit that contains the entire load being placed on the database in a way that's easy to play back and evaluate. In this case, every command ID will represent a request sent to the database by the Web server. Executing these scripts and reviewing the response times for these command IDs will show us conclusively how much of the end-to-end test time is being spent in the database. It will also show us exactly how much time each request takes. This information is almost always what the database administrator needs in order to find and/or tune the issue.

Figure 9: Capturing load-test traffic against the data storage tier Here are the steps we would follow to capture load-test traffic against the database server in our example scenario: 1. 2. 3.

Configure a second master station on the same subnet as either the Web server or the database server (in this case, the database server subnet is preferred). Configure the second master station for network recording between the Web server and the database server. From the Robot menu bar, choose Tools > Session Record Options. Click the Method tab and select "Network recorder" (see Figure 10).

Figure 10: Session Record Options window, Method tab 4. Click the Method: Network tab and then click the Manage Computers button (see Figure 11).

Figure 11: Session Record Options window, Method: Network tab 5. In the Manage Computers window, click New (see Figure 12).

Figure 12: Manage Computers window 6. In the Computer Properties window, fill out the information about the database server (see Figure 13). You'll likely need to get this information from a systems administrator. The name is any name you assign; the network name is the actual machine name or IP. Click the Ping button to ensure the master station can communicate with the server, then click OK.

Figure 13: Computer Properties windo 7. Follow steps 4, 5, and 6 for the Web server.

8.

Return to the Method: Network tab and select the database server as the server machine and the Web server as the client machine (see Figure 14), then click OK.

Figure 14: Method: Network tab with client and server options selected 9. Ensure that no other people or systems are accessing the database and that you have the correct protocols selected, then start recording on Master Station2. 10. Launch the load test containing the transactions you want to capture on Master Station1. 11. Stop recording after the load test has completed executing on Master Station1. 12. View, edit, play back, and analyze the new script against the database server. While the steps seem straightforward enough, this method is actually pretty complicated. Some things to remember before and while attempting this method are as follows:

• • •

You'll have to use network or proxy recording on Master Station2. Network is generally easier, if the tiers you're interested in are on the same subnet and on a Rational-supported network configuration. Proxy recording is generally considered to be difficult to configure. You may not have the proper license for the communication protocol between the tiers you want to isolate, which means you'll have to either interpret socket traffic or obtain the proper protocol license. Editing these scripts, even with a supported protocol, is often complicated because you might have no realistic way of knowing which client-side activity generated a request. For example, I once tested an application where each client-side activity was generating two identical database requests. This type of testing didn't help to track that down.



You'll be collecting response times for only one tier per test; you won't be collecting the response times for each tier during the same test.

Capturing response time by tier with other methods There are several other ways to capture response time by tier, but they all involve either third-party tools or for your system to be instrumented to collect (log) data. If you don't already have a third-party tool, I highly recommend that you get one as a complement to TestStudio; however, I feel compelled to caution you that these tools are generally very specific to your application architecture. For instance, some of these tools will collect information only on J2EE applications, others only on .NET platforms. I suggest you do a Web search on the phrases "performance monitoring," "application performance management," "performance analysis tool," and/or "performance profiler" and compare the tools available to your specific application. If you don't have, and won't be getting, any third-party tool to complement TestStudio, there's one other way to capture response time by tier. This method involves close coordination with your developers and administrators and is also very specific to your application. The basic steps are as follows: 1. 2.

3. 4. 5. 6. 7.

Identify the tier(s) you want to capture response times for. Work with your developers and administrators to configure logging on those identified tiers to capture the time stamp of the arrival and/or departure of the transaction(s) you're interested in. Ensure that all computers in the system and the load-generation machines have their clocks synchronized. Execute the load test. Parse the arrival and departure time stamps from the log files. Correlate those time stamps with the end-to-end response times from TestManager using a spreadsheet program like Excel. Convert those time stamps to response times -- generally by averages -- and put them into charts and graphs.

This isn't a simple process, but it does provide lots of useful information. I'll show you a sample table and graph I created using this method (see Figures 15 and 16).

Figure 15: Response-time-by-tier base table Figure 15 is a spreadsheet table containing the base data needed to extract a response-time-by-tier graph. The basic steps to create this table are as follows:

1. 2. 3. 4. 5. 6. 7.

Copy the data from the table in the Response vs. Time report output into Excel. Convert the "Ending TS" column to "Time into Test" by subtracting the value in the first row from the values in all subsequent rows in the column and then dividing by 1000 (to convert to seconds). Copy the time stamps and labels from the log files into Excel. This process will vary greatly based on how your tool logged the time stamp. Convert the time stamp to "Time into Test." This process will vary greatly based on how your tool logged the time stamp. Subtract the "Time into Test" value from the time stamp. Mesh the two data sets together so that the "Time into Test" values are sequential and grouped by timer/label. Generate the desired chart or graph based on the table.

Figure 16 is the graph created from the averages of the values in the table in Figure 15.

Figure 16: Response-time-by-tier graph If you look closely at Figure 16, it becomes clear that about 8 seconds of the 10.16-second response time is being "lost" in the Web server (4 seconds each way). In this case, further research showed that the actual problem was a misconfigured router on the Web server's subnet that was imposing an artificial 4-second delay. A note about performance-monitoring tools As I mentioned above, there are hundreds of third-party tools available to assist with the capture of resource utilization statistics and response time by tier. There are fewer, but still many, third-party tools that provide a combination of these functions. These are commonly known as performance-monitoring tools. As an example of the kinds of tools available, let's take a look at Tivoli systems and applications monitoring solutions from IBM. If you aren't familiar with this family of solutions, it's officially described this way: "IBM Tivoli systems and applications monitoring solutions enable you to deploy a single monitoring solution for most or all of the resources in your environment. This allows your system monitors to share a common reporting engine, graphical user interface, and data repository. Building on the science of Autonomic computing, the IBM Tivoli suite of monitoring tools gives you several new capabilities such as the ability to automatically correct many component-level problems before they occur. It can also identify the persistence of problem conditions and feed key operating metrics into other layers of your management technology system." Tivoli, like almost all of the other performance-monitoring solutions, doesn't deliver with a robust load-generation component, but using it with a load generator like TestStudio greatly enhances the performance-engineering process. The white paper titled "IBM Tivoli Monitoring Solutions for Performance and Availability" does a good job of explaining some of what this particular solution has to offer. It's beyond the scope of this article to go into details about what Tivoli, or any other performance-monitoring solution, can add to your performance-engineering process, but I encourage you and your development team to jointly research a performance-monitoring tool that fits your needs. If your organization conducts performance-engineering exercises often, the time you'll save by obtaining and using one of these tools will far outweigh the cost of the tool in a short period of time.

Back to top Interpreting tier-specific metrics Often, tier-specific metrics leave little doubt as to their meaning, but as you saw in the "response time by tier" example, even these detailed metrics may not hold all of the answers. Now I'll share the methods I've found most useful, individually and collectively, to interpret tier-level metrics. Look for the obvious First and foremost, look for the obvious. In our "response time by tier" example, the obvious was that the Web server (presentation tier, more precisely) was eating up 4 seconds every time data passed through it. In the overlaid scatter chart shown in Figure 20 in Part 6 of this series, it was obvious that the CPU utilization of the application server peaked above acceptable levels shortly before the poor performance began. These are the kinds of clues we're looking for. Unfortunately, in both cases, these were still both symptoms and not causes. In both cases, those symptoms gave me and the development team ideas about where to look next. Consult the development team Once you either find some obvious abnormalities, symptoms, or clues or realize that you haven't found any obvious abnormalities, symptoms, or clues, you should contact your development team and discuss what those findings mean. If you found no clues, maybe it means that the metrics you collected weren't the right ones, or that there wasn't enough load on the system, or that you eliminated a trigger event when you modified your tests. You probably won't know which (if any) of these is the case without help from your development team. In the cases where you do find clues, the development team definitely wants to be involved. These clues are what point to either the next round of tests or to what they'll find themselves tuning in the next hours, days, or weeks. The point is, when you get this far into your performance testing, you and the development team really form a consolidated performance-testing-and-tuning team. Most development teams aren't used to working like this, so it's up to you to be the team leader and ensure that there's constant two-way communication about tests, results, clues, and suspicions. More than half the time, I find that I'm able to track down a bottleneck not by my keen insight or superior testing knowledge, but rather by listening to developers when they say things like "I wonder if . . . ," "Did you try . . . ? " or "What if we . . . ?" You'll also often find that after you show the results to your development team, one of them will come back to you later and say, "I found it," when you didn't even know he or she was looking for it. Change your test to prove your theory Once you see your tier-specific results, it'll be almost impossible for you not to form theories about what caused those results. This would be like reading a mystery novel and not trying to guess "Who done it?" before the final chapter -you just can't do it. Instead of trying to wait for the last chapter, I recommend embracing those theories and changing your test to prove or disprove them immediately. Once again, you'll likely need the assistance of the development team, but by this point you should have a good working relationship with them. Besides, most developers I've worked with really enjoy this part of the performance-testing process. Honestly, I have to agree with them. To me this is the fun part; it offers the same excitement a treasure hunt did when I was a kid . . . "I wonder what we'll find if we follow all the clues correctly?!?"

Back to top Is it time to tune? Once again, we come to the key question, "Is it time to tune?" By now, we'll have successfully pinpointed the bottleneck about 75% of the time. If you and the development team haven't gathered enough information at this point to be able to tune the system, you have a pretty elusive bottleneck. You may have noticed that in some of the examples we've been following, we aren't ready to tune. In these cases we still have to develop specific tests to exploit the symptoms before we can resolve them. That's our topic in Part 10.

Back to top Summing it up In this article we looked at some ways to isolate symptoms and metrics by logical and/or physical tier of the system. This process isn't always easy, but it does add a significant amount of information to what we already know about our bottleneck symptoms. It's critical to build a close relationship with your development team as you dive deeper and deeper into the application, if you haven't already. They'll be your best tool for collecting information about, and ultimately finding and tuning, performance bottlenecks. Beyond performance testing part 10: Creating a test to exploit the failure or bottleneck Scott Barber, Performance testing consultant, AuthenTec Summary: You know what the bottleneck is functionally and where it is architecturally, but you still need to track down the cause. This article shows you how to identify causes by building very specific tests to exploit a bottleneck or failure. Tag this! Update My dW interests (Log in | What's this?) Skip to help for Update My dW interests Date: 21 Apr 2004 Level: Introductory Activity: 59 views Comments: 0 (Add comments) Average rating (based on 0 votes) Now that you know what the bottleneck is functionally and where it is architecturally, you're ready to track down the cause. If you've made it this far, none of your other tests have isolated the bottleneck sufficiently to resolve it. That's what exploiting bottlenecks is all about. According to Roget's II: The New Thesaurus (Third Edition, 1995), to exploit is "to put into action or use: actuate, apply, employ, exercise, implement, practice, use, utilize." You exploit a bottleneck by building very specific tests that exercise the weakness in the system as an aid to the tuning effort. This article concludes the four-article group on the theme "finding bottlenecks to tune." This is the last step down the tuning path where the performance test engineer serves as the lead. By the conclusion of this article, you should be confident in your ability to work with the development team to identify and exploit areas of concern in a way that adds significant value to the overall development process. So far, this is what we've covered in this series:

• • • • • • • • •

Part 1: Introduction Part 2: A performance engineering strategy Part 3: How fast is fast enough? Part 4: Accounting for user abandonment Part 5: Determining the root cause of script failures Part 6: Interpreting scatter charts Part 7: Identifying the critical failure or bottleneck Part 8: Modifying tests to focus on failure or bottleneck resolution Part 9: Pinpointing the architectural tier of the failure or bottleneck

This article is intended for mid- to senior-level performance testers and members of the development team who work closely with performance test engineers. If you haven't read Parts 5, 6, 7, 8, and 9 of this series, I suggest you do so before reading this article. Why exploit identified bottlenecks? Inevitably, whenever I get to this point in a training course I'm asked, "If we know where the bottleneck is, why do we need to exploit it? Isn't that redundant?" The truth is that it's only redundant if the development team already knows what they need to tune and how to tune it. More often, just identifying the tier isn't enough. To explain why this is so, let's return to our hydrodynamics analogy from Part 7, in which we compared the flow of activity through a software system to the flow of water through a pipe system. Figure 1 is a simplistic representation of what the inside of a tier might look like if it were a hydraulics system. The pipe that represents our network comes into the tier from the top left. Once the water leaves that pipe it enters a pool with various pipes exiting the bottom. This represents requests entering a processing queue where there are a limited number of processing units (likely threads) to handle those requests. Which "exit pipe" the request flows through is based on the type of request that's being made. Notice that the exit pipes are of various sizes and may or may not be open at a given point in time.

Figure 1: The tier as a hydraulics system Without delving too deeply into the different possibilities for request processing, suffice it to say that any given tier can have more or fewer processes (pools) for a request to go through, depending on the specific request and/or the design of your system. The number, size, and availability of "exit pipes" from these processes can have a significant effect on the overall performance of the system. I'm sure you can see that just pointing to the tier and saying, "The bottleneck's in there" probably isn't good enough. To tune the system, we often need to help developers narrow the focus down to the specific process or even to the parameters (inputs and outputs) of that process (symbolized by an exit pipe). That's what we do when we exploit bottlenecks.

Back to top Ways to exploit identified bottlenecks In Part 8, I discussed how to modify tests to focus on bottleneck resolution. Now you're going to modify existing tests again and/or generate new ones to get more information about exactly what's causing the bottleneck in the tier you've identified. I'll explain how to exploit bottlenecks for tuning by finding bounds conditions, breakpoints, and resource constraints. Find bounds conditions One of the ways to exploit bottlenecks is to execute tests that focus on identifying bounds conditions rather than running under expected normal conditions. These bounds conditions are a little different from the bounds conditions we test during functional testing. We're not talking about testing to see if an input field accepts numbers with more than six digits correctly. We're talking about testing the bounds of performance -- for example, seeing how the system performs when executing only searches that return excessively large amounts of data, like searching for all book titles that include the letter t, or executing an extremely high volume of searches. These types of tests will often show results that allow us to say more than just "This search seems slow." Under these extreme conditions we look for information like the following:

• • •

How many searches can I do before the memory starts to rise above 80% utilized on the database server? How many rows of data must I be requesting before the system returns a time-out message? How many times can this activity be conducted in a 10-minute period before all of the available threads are consumed?

Each of these facts tells us something about bounds conditions. In both functional and performance testing, unexpected behavior tends to occur under these conditions. In performance testing, these unexpected behaviors often point us to the actual cause of the observed symptoms under expected usage conditions. On a recent project, we found that after we applied SSL to our Web site all of the pages slowed down by about 30%. While we expected the login activity to slow down, we didn't expect subsequent pages to be slower. At first we thought that the login process itself was slowing down the entire Web server, so we created a "login only" test and monitored the resources on the Web server (where the logical authentication tier resided). This revealed that logging in under load was not the problem, so we decided to exploit the bottleneck instead by looking for the bound where performance degraded. It turned out we didn't have to look far. We started by limiting our test to a single user logging in, navigating to the search screen, searching, and logging out, and we monitored the authentication tier through full logging. When we evaluated that log, we found that every page was checking with the authentication tier to see if the user had permission to access that page rather than simply getting the ACL (access control list) from a client-side cookie as intended. Once the developers saw that the problem was with the retrieve permissions process in the authentication tier, they were able to resolve the problem in less than an hour. The performance of all pages improved to what it had been before SSL was applied, and login gained back 50% of the performance it had lost when SSL was applied.

Find breakpoints Deliberately causing the application to fail by running it under conditions even more extreme than the ones under which it shows symptoms of poor performance is another method of exploiting a bottleneck. Such breakpoints, where the bottleneck becomes a failure, are often uncovered while searching for and testing at bounds conditions. Like finding bounds conditions, determining the point at which the system fails due to an extreme performance case will likely point to a cause. Information about breakpoints will most often be found not in TestManager but rather in the application server logs. Breakpoints are commonly identified by error messages being returned, system or browser time-outs occurring, and/or nothing returning at all (that is, the page just sitting there forever). Any of these conditions can yield valuable information to the developer who's trying to help track down and tune the bottleneck. I also used this method on a recent project. In this case we had determined that reports seemed to slow down dramatically under load. Through all of our monitoring, we were unable to track down the reason. Monitoring the report server seemed to point to the database returning data slowly, but monitoring the database showed the requests coming back quickly. We finally decided to just increase the number of requested reports until we received an error message. After increasing the reporting load significantly, we did receive an error message -- indicating an overflow error. While this error didn't make sense to most of the team, it did make sense to the administrator of the report server. From that error message she was able to determine that when the report server received a request for a report, it was sending a request to get the data from the database and putting that data into a single processing queue. This meant that the data for all of the reports was stacking up and only one report was being generated at a time, where the actual intent was for this server to have five parallel processes and not just one. After a few calls to support, the administrator was able to configure the report server to handle the five parallel processes, and our problem was resolved. Find resource constraints We discussed how to monitor resource utilization in Part 9. During this monitoring, you and your development team should be looking for resource utilization that's above the expected volume and/or is above the recommended usage for that particular resource. If adding stress (such as adding additional high-volume searches) pushes resource utilization to a higher-than-expected rate, this may indicate that the activity being tested isn't managing that resource adequately during less-stressful times, either. The inadequate resource utilization may not be obvious during low-stress situations but may still be the cause of the symptoms. Only by exploiting the bottleneck by intensifying it can we find out for sure if resource utilization is the cause. The most common example is memory utilization. Under large loads, one (or more) of your servers is likely to experience memory utilization consistently greater than 80%. Once this number grows to more than 80%, performance almost always suffers. In these cases, it's up to the developers and architects to determine if the application is managing memory poorly, if configuration settings need to be adjusted, or if more memory is required.

Back to top Handing off leadership to the development team You may have noticed that the farther down the trail of chasing bottlenecks you go, the more and more closely you're working with the development team. Interacting with the development team is crucial to the process of building tests to exploit bottlenecks. You'll very rarely have a deep enough understanding of the system to build tests and collect data at this level on your own. In cases where you're able to exploit the bottleneck simply by modifying test data, inputs, and load, the development team is still critical in the results interpretation stage. As your tests get narrower and narrower, and closer and closer to the actual code, the development team becomes increasingly critical in the test development stage. The development team is also normally where the best guesses come from as to what tests to develop to try to exploit a particular bottleneck, not just how to develop them, as the examples below will show. This is the point of transition from the testing phase where the performance test engineer leads and the development team assists to the phase where the development team leads and the performance test engineer assists. It's important to

explain to the development team that now you're helping them, not the other way around, and that you're going to exploit bottlenecks with the intent of helping them find the root cause of the symptom, not just to ferret out more symptoms. Be available to the development team and be open to building and executing tests on a moment's notice that you may not completely understand (though it's still a good idea to ask questions and gain understanding throughout the process). And don't be discouraged if developers start digging more independently at this point in the process. Following the development team's lead: Example 1 As an illustration of the crucial role the development team can play at this point, let's return to an example from Part 6. Figure 11 there, reproduced as Figure 2 below, is a scatter chart depicting a test where the response times experienced a significant slowdown about halfway through the test execution. As you may recall, the chart shows a test run with "caching" at the front, a "good" run for a period after that, then a mostly "classic slowdown" toward the midway point.

Figure 2: A scatter chart showing a slowdown midway through a test (click here to enlarge) In an attempt to determine the cause of that slowdown, we looked at several common resource statistics associated with the servers involved. We found that the CPU utilization of the application server reached unacceptable levels shortly before the response times increased (see Figure 3 below, a reproduction of Figure 20 in Part 6). In Part 6, I mentioned that we then decided to monitor the CPU queue length for that same test. The reason I stress we is that it was the developer's idea to look specifically at that metric, which wasn't among those that I initially recommended.

Figure 3: The scatter chart overlaid with application server CPU utilization data

Monitoring the CPU queue length resulted in the chart in Figure 4 below (a reproduction of Figure 21 in Part 6), which showed a direct correlation between the queue length and the poor performance. I can't say whether I would have looked at that metric eventually, but for whatever reason, I wasn't planning to look at it initially. That open communication between me and the developer saved at least one extra step and revealed the actual cause of the poor performance in this test.

Figure 4: The scatter chart overlaid with application server queue length data Incidentally, the test that generated those results was a test that had been created to exploit what we thought was a database bottleneck. The initial symptoms had been that activities writing to the database were slow. Building tests that exploited that activity and monitoring various resources allowed us to track the actual cause to code processing in the application server. Following the Development Team's Lead: Example 2 Another example of following the development team's lead is the one illustrated in Figure 16 of Part 9, reproduced as Figure 5 below, where looking at response times by tier revealed that the Web server seemed to be "eating" 4 seconds every time a request went through it. When I reported that finding it seemed ridiculous to both me and the development team, so we developed some more tests.

Figure 5: Response-time-by-tier graph

The first thing we did was put some graphics on the Web server of various sizes -- 1 KB, 10 KB, 100 KB, and 1000 KB. I then manually wrote four test scripts in the IBM® Rational® Robot software to retrieve each of those graphics and time the response. I executed these four scripts 100 times each. Looking at the results, I found something very interesting. The 1 KB graphic always returned in roughly 4.1, 8.1, or 12.1 seconds. The other graphics returned in roughly the same amount of time -- for instance, the 100 KB graphic returned in roughly 4.3, 8.3, or 12.3 seconds. In all four cases, about 60% of the responses returned in a little more than 4 seconds, 30% returned in a little more than 8 seconds, and the remaining 10% returned in a little more than 12 seconds. Having no idea what those measurements meant, we then added logging to the Web server, where we time-stamped the arrival of the request and the departure of the first byte of the response. In all cases, that measurement was well under .01 seconds, indicating that the problem wasn't actually in the Web server at all. We then contacted our network administrators, since the only piece left was the network between the load-generation machine and the Web server. First they put a sniffer on the subnet of the load-generation machine and validated that our results matched what was appearing on the network. On a wild hunch, we then moved the network sniffer to the subnet containing the Web server. When we compared those numbers, we found that the round trip for the requests/responses on that subnet didn't have the "4-second steps," as we'd come to call them. Some analysis showed that the only things between those two subnets were some passive hubs and a router. After that, an expert on configuring that particular model of router was called and reconfigured the router so it wasn't imposing the artificial 4-second delay.

Back to top Moving into a Different Kind of Testing All through the "User Experience, Not Metrics" series and up until Part 8 of this series, we focused on what could be categorized as "black box" performance tests -- that is, tests created without reference to the source code or other information about the internals of the product. In the words of Cem Kaner, senior author of Bad Software and Lessons Learned in Software Testing and professor of computer sciences at the Florida Institute of Technology, "the black box tester consults external sources for information about how the product runs (or should run), what the product-related risks are, what kinds of errors are likely and how the program should handle them, and so on." This is in contrast to what might be called "white box" testing, which Kaner defines as "testing with thorough knowledge of the code." In one discussion, Kaner goes on to say, "The programmer might be the person who does this. I've seen members of independent test groups do this type of testing. Some risks that are invisible to the black box tester aren't too hard to see in the source, such as weak error handling, a weak model of interrupt-triggering events, or excessive coupling of different parts of the program. The test groups that do this type of work usually specialize one or a few people who do nothing but read the source code looking for interesting / risky areas and then design thorough tests to exploit those risks." In Part 8, when we began designing our new tests in interaction with the development team, we started getting into the area of tests that could be classified as "gray box" tests. According to Kaner, design of gray box tests is educated by information about the code or the program operation of a kind that would normally be out of view of the tester. Kaner makes the point that the distinction between black box, white box, and gray box testing is in the thinking of the tester. Thinking that's focused neither on the usage-related world external to the program nor on the source of the program but is more focused on the technical relationship between the program and the system is what he refers to as gray box testing. Whether or not you like those particular terms, I'm certain you'll agree that at this level of bottleneck detection and tuning we've moved from user experience (usage-related) tests to tests that are focused on the technical relationship between the program and the system. It's often the case that exploiting bottlenecks isn't going to be done simply by modifying user-centric load-generation scripts. As it turns out, most of the tests we as performance testers conduct are gray box tests. While we begin designing our tests thinking about how users interact with the system, we then start thinking about how the system works and modify our initial design accordingly. For example, we may add a script that runs a particular report simply because we know that it accesses data from a particularly large table in the database. The fact that we decide to create that script based on the design of the database makes it a gray box test.

Back to top Knowing When to Put the Load-Generation Tool Away Load-generation tools can see only so far into the application. No matter how good your tests and analysis are, you'll sometimes have to dig into the application with your development team, often all the way to the code level. You could say that this is where you cross the line from gray box testing into white box testing. I'm not aware of a single loadgeneration tool on the market today that's designed for white box testing. Because of this, one of the best ways you can assist the development team at this level is with tools that complement your load-generation tool. An example of a tool at your disposal is the test harness and custom (handwritten) script method that I mentioned in Part 9 to access the database directly. This method can be used to access virtually any component of the application, right down to an individual line of code. Most of the time the test harness is built by the developer to complement the performance test engineer's individual skills and scripting preferences to exploit a very specific area of the application that the developer wants to be able to test in a repeatable way. Often, response time measurements from these tests are embedded in application logs that the developer reviews herself, and thus the performance test engineer rarely sees the results. This is completely natural. By this point, the developer is leading and you're assisting. In this case you're assisting by providing input into the system in a way that the developer either can't do or would find prohibitively difficult, and the developer is doing the analysis. This process is sometimes even thought of as collaborative unit testing. If you can't exploit bottlenecks using test harnesses and hand-coded scripts, it's probably time for a third-party tool to complement the IBM® Rational TestStudio® software. As I mentioned in Part 9, there are many such tools on the market, and they're very specific to the application architecture. The tools I'm referring to are often classified as code analyzers, runtime analyzers, code profilers, or even performance profilers. As an example of the kinds of tools that are available, let's take a look at IBM Rational's runtime analysis suite, IBM® Rational® PurifyPlus. If you're not already familiar with this product, it's officially described this way: "Rational PurifyPlus is a complete set of runtime analysis tools designed for improving application reliability and performance. PurifyPlus combines memory error and leak detection, application performance profiling, and code coverage analysis into a single, complete package. Together, these functions help developers ensure the highest reliability and performance of their software from its very first release." Many third-party tools, including PurifyPlus, are made to work independently of the load-generation tool but can be even more helpful when used in conjunction with it. You can add a lot of value by knowing how to use one or more of these third-party tools in conjunction with the load-generation and/or bottleneck-focused scripts you've already created. The bottom line is that familiarizing yourself with at least one of these tools will help you work more effectively with the development team.

Back to top Is it time to tune? Now it's time to tune. If you've followed all the steps outlined in the last four articles, even to the point of obtaining and using a performance profiler, you've no doubt identified the cause of the performance symptoms and shared that information with your development team. As I've mentioned, tuning is an iterative process. Be prepared to reexecute all of the tests you've created to pinpoint and exploit the bottleneck, in reverse order, on request from the developer. This is done to ensure that the symptoms have been resolved all the way back to the expected user loads and that no new symptoms have arisen as a result of the tuning effort.

Back to top

Summing it up This concludes the four-article theme "finding bottlenecks to tune." In this group of articles I've discussed detecting performance suspects, distinguishing between failures, slow spots, and bottlenecks, and how to track down performance bottlenecks to a level of detail great enough for developers to tune them. I've stressed the increasing levels of interaction with the development team throughout this theme. Without a good relationship with the development team, it's unlikely that you'll ever be certain of anything more concrete than suspects, symptoms, and hunches. Working together, you and your development team should be able to detect and tune bottlenecks quickly and efficiently. In the next four articles I'll discuss more advanced areas where the performance test engineer can serve in a support role, to include tuning. Beyond performance testing part 11: Collaborative tuning Scott Barber, Performance testing consultant, AuthenTec Summary: The step beyond performance testing is performance tuning. The performance tester isn't typically involved in this process but should be, for thereasons detailed in this article. Also detailed are ways you can get more involved in a team approach. Tag this! Update My dW interests (Log in | What's this?) Skip to help for Update My dW interests Date: 29 Jun 2005 Level: Introductory Activity: 48 views Comments: 0 (Add comments) Average rating (based on 7 votes) In the course of the previous six articles in this series you've learned how to identify, confirm, exploit, and in some cases even resolve failures, slow spots, and bottlenecks. Traditionally, this is as far as performance testing ever goes, but the title of this series is "Beyond Performance Testing." The step beyond performance testing is performance tuning. Tuning is the process by which all the remaining unresolved failures, slow spots, and bottlenecks are improved, fixed, or mitigated. The performance tester isn't typically involved in this process, but it's my belief that this is a mistake. For this reason, I'm starting another four-article theme intended to help you, the performance tester, become an integral part of what should now be the performance testing and tuning team. This article begins the theme by discussing the advantages of this approach, the roles and contributions of the members of this team during the tuning process, and the ways you can get more involved in a team approach. So far, this is what we've covered in this series:

• • • • • • • • • •

Part 1: Introduction Part 2: A performance engineering strategy Part 3: How fast is fast enough? Part 4: Accounting for user abandonment Part 5: Determining the root cause of script failures Part 6: Interpreting scatter charts Part 7: Identifying the critical failure or bottleneck Part 8: Modifying tests to focus on failure or bottleneck resolution Part 9: Pinpointing the architectural tier of the failure or bottleneck Part 10: Creating a test to exploit the failure or bottleneck

This article is intended for mid- to senior-level performance testers and members of the development team who work closely with performance test engineers. If you haven't read BPT Parts 5, 6, 7, 8, 9, and 10, I suggest you do so before reading this article.

Why tune collaboratively? If you're new to performance testing, you may be surprised to hear that there's often a lot of resistance to the idea of collaborative tuning. A common argument is that there should be a clear division of tasks between performance testing and performance tuning. This stems from the QA-versus-development mindset that prevails in the software development industry as a whole. While I believe this is a problem in general, I think it's an even bigger problem for performance testing than it is for other types of testing because division of tasks decreases efficiency and keeps so many potential gains from being realized. In the following sections, I'll discuss the potential advantages of tuning collaboratively and the disadvantages of not doing so. Bringing tester and developer mindsets together We've known and documented for years that testers and developers think differently. Most of the time this is a good thing, but sometimes it results in an "us versus them" attitude. In general, the difference in mindsets can be summarized as follows:

• •

Testers tend to look for ways to make the application perform incorrectly in every possible situation. Developers tend to try to make the application perform correctly in the situations in which they envision the application being used.

This difference in thought is what leads to developers responding to reported defects with the statement "But no one would ever really do that!" Testers know that no matter how unlikely a user action is, someone will do it, and that risk needs to be addressed. Often during functional or system testing, this difference leads to confrontations about what should or shouldn't be fixed, with hard feelings resulting on one side or the other based on a manager's decision about whether the defect gets fixed. However, when it comes time to start tuning for performance, you want these two thought processes working together collaboratively, not clashing as adversaries. The best way to build a collaborative relationship in performance testing is to introduce the performance tester to the development team early in the project as a "resource." Often long before the first build, the developers will have parts of the application built and will want to know how they perform -- essentially, they'll want to conduct performance unit testing. This is where the performance tester comes in. The developers typically don't have access to the tools or the tool expertise to conduct these performance unit tests. The performance tester can build these tests on the spot and help resolve many performance issues even before the first official test. In this way, the tester becomes a collaborative part of what will later be the performance-tuning team. Looking at the big-picture view alongside the detail view Developers are focused on the details during performance tuning. This is a good thing, since the performance issues generally live in the details of the application. In contrast, performance testers focus on the big-picture, end-user experience. This is also a good thing, as the end user should always have an advocate in the development process. These differences only become a problem when the two views are considered one at a time and not side by side. For example, a developer may fix a particular symptom (such as slow login) but damage the overall performance of the system (by making every page authenticate instead of downloading the access list during login, for example). With a separation of tasks, the developer may not know that the fix had a significant side effect (making every page except login slower) until the next build is released to the performance tester. Worse, if they're not working together when the performance tester gets the build, the tester will have no idea where to look for potential side effects (that is, may only retest the login page and not know to test all the other pages). If they're working together, the performance tester can build specific tests on the fly to help understand the effects of the changes as they happen. That way the developer can keep the detail focus, the tester can keep the end-user focus, and both can be confident that they're getting quality feedback. Streamlining the tuning cycle Usually, here's how the tuning cycle goes: the performance tester identifies an issue, the developers run off and fix it, the developers send a new build to the performance tester, the tester finds more side effects than solutions and sends the build back to the developers, and so on. This is all well and good, except that when the performance tester and the

developers aren't seen as part of a collaborative testing and tuning team, each step in this cycle often takes days. As a result, it's common for a week to go by between issue identification and side-effect identification. This might not sound bad to you until I mention that most issues take less than a day to resolve. So where do the other four days go? The following timeline typifies what happens in my experience when tuning isn't a collaborative process: Day 1 AM -Tester identifies issue. Day 1 PM -Tester schedules meeting to demonstrate issue to development team. Day 2 PM -Tester has meeting to demonstrate issue to development team. Day 3 AM -Developer assigned to resolve the issue, who inevitably wasn't in the meeting, comes to tester's desk for personal demo. Day 4 AM -Developer believes issue is resolved and requests that a special build be promoted into the test environment. Day 5 AM -Tester retests application and finds side effects. Day 5 PM -- Tester schedules meeting to discuss side effects. ... Day X -- Developer or configuration manager promotes performance-driven modifications to test environment with next scheduled build for testing. It's obvious how inefficient this process is. When the tester works together with the developers as a collaborative team, the timeline looks more like this: Day 1 AM -Tester identifies issue and calls developer. Day 1 AM -Developer comes to tester's desk for personal demo. Day 2 AM -Developer believes issue is resolved and asks tester to validate this in the development environment. Day 2 PM -Tester retests application, finds side effects, and calls developer. Day 2 PM -- Developer comes to tester's desk for personal demo. ... Day X -- Developer or configuration manager promotes validated performance-driven modifications to test environment with next scheduled build for final validation. As you can see, this team approach leads to a drastically shortened tuning cycle. It eliminates several intermediate steps that cause unnecessary additional work for individuals who would prefer brief progress updates. The feedback loop is now roughly one day instead of one week long. The value of this time savings is incalculable when you consider that most projects execute their first collective performance test two weeks before "go live" day. With the typical approach, the team gets a chance to fix only one performance issue and one set of side effects. With the collaborative approach, there's the opportunity to address five performance issues and their side effects at no additional cost, all as a result of team building.

Back to top The testing and tuning team I've referred thus far to the performance tester and the developer, but there are several other key players on the performance testing and tuning team. Your team may include slightly different members/titles, but the roles I outline here are common in my experience. Project manager The project manager has overall responsibility for the project, to include development, testing, schedule, budget, and personnel.

Performance-related focus The project manager is ultimately responsible for the performance of the application as delivered. Sometimes the project manager is the direct supervisor of the performance test engineer, and other times a subordinate of the project manager is the direct supervisor of the performance test engineer. In either case, the project manager should have a direct relationship and establish continuing one-to-one communication with the performance test engineer. Contribution to collaborative tuning The project manager has to believe in and promote a collaborative testing and tuning team or it generally won't happen. This person sets the overall tone for the testing and tuning exercise. If the project manager doesn't actively support the collaborative effort, the typical tuning cycle outlined earlier is sure to result. Lead developer/architect The leader developer/architect is generally the top technical person on the project, with overall responsibility for the development effort. In most cases, this individual reports to the project manager. The other developers, system administrators, database administrators (DBAs), and so on report to this person. Occasionally the performance test engineer will also report to the lead developer. Performance-related focus The lead developer is ultimately responsible for delivering the application with the required performance. This person is also responsible for ensuring that the development team and the performance test engineer communicate effectively. Contribution to collaborative tuning The lead developer/architect typically assigns tasks to the other developers, so it's up to this person to make performance testing and tuning a priority. If this person doesn't give the appropriate developers the time and freedom to participate in the testing and tuning activity, this activity will usually end up getting overlooked. The lead developer/architect is also generally the individual who must be convinced to bring in outside experts when needed. Other Developers/Administrators/DBAs Other developers, system administrators, and DBAs are generally responsible for developing, creating, and/or maintaining one specific component of the application. They generally report to the lead developer/architect, at least for their tasks related to the project at hand. I'll use the term developers to refer to members of this group as a whole. The developers generally begin a project assuming that the performance test engineer is "just a tester." Until you establish yourself as a member of their team for this part of the project, they probably won't work with you very much. Performance-related focus Each member of this group is responsible for the performance of her or his particular focus area, and ideally, the performance of the integration with other developers' focus areas. Contribution to collaborative tuning These individuals are the ones who actually tune the application to meet the performance goals. They're also the ones who are best served by having personal relationships and tight feedback loops with the performance test engineer. They're at the core of the collaborative testing and tuning team. Test manager The test manager is responsible for managing the testing effort as a whole and reports to the project manager. In many cases, the testers, business analysts, and configuration managers report to the test manager, who reports to the project manager.

Performance-related focus In most cases, the test manager serves as the supervisor of the performance test engineer and is therefore responsible for the overall performance-testing effort. In my opinion, this often causes a conflict of interest. Many of the test managers I've worked for over the years have been very focused on the typical tuning approach outlined earlier, demanding that all performance results go through them to the project manager, who then passes the information to the lead architect to assign a developer to the reported issue. In this scenario, the reverse is often true as well. When the developer requests a specific test to validate performance, the request has to go through the lead architect to the project manager and then to the test manager before it reaches the performance test engineer. In my experience, many test managers find it difficult to adopt the collaborative tuning approach. For this reason, I recommend sitting with the test manager, the project manager, and the lead developer/architect long before performance testing actually starts, in order to define roles. Contribution to collaborative tuning In the best case, the test manager is the first-level manager who helps get support and resources for the performance test engineer. In most cases, this person serves as the process, schedule, and documentation enforcer. Performance test engineer By this point in the series, I think you've got a pretty clear picture of the performance test engineer's role. In summary, this is typically one person who's responsible for the following:

• • • • • •

developing the overall performance test strategy collecting and quantifying the performance requirements determining and documenting the user community model(s) creating scripts representing the user community model(s) executing the scripts and analyzing the results working with the developers as part of the collaborative testing and tuning team

Performance-related focus The focus of the performance test engineer is simply to design, develop, execute, analyze, and report on the results of performance tests that accurately represent conditions likely to occur in production, and to assist in the tuning effort when these results fall short of requirements and/or expectations. Contribution to collaborative tuning The chief contribution of the performance test engineer consists of the tests themselves, the interpreted results of those tests, and recommendations based on those results. Outside experts Outside experts obviously aren't a permanent part of the team. Sometimes, however, a performance issue is uncovered that no permanent member of the team has the skills to resolve. Outside experts are typically treated as consultants to the project, even if they're employees of the company. Performance-related focus Outside experts are focused on the very specific issue they were retained to resolve. They tend to be retained to configure and tune new third-party software or hardware. Contribution to collaborative tuning

The outside expert very simply contributes a performance improvement to a specific issue detected by the testing and tuning team that they were unable to resolve themselves.

Back to top How to get more involved Unless you're extremely lucky, you're not currently part of a well-formed testing and tuning team. If you're not, the following suggestions will help you become a part of, or develop, that team. If you're lucky and are already a member of such a team, you may want to take personal inventory and see if you could make improvements in any of these areas. Know the technologies The single most important thing you can do to gain the respect of the development staff and start building a collaborative environment is to know about the technologies involved in the application you're working on. Being able to participate in technical discussions and ask intelligent questions will immediately promote you from "just a tester" to a "techy." For example, I became a "techy" about three weeks into a recent project when I asked, "Does it really add a lot of value to SSL-enable this application when it will be deployed on a secure network and all the data transmitted across the network will be in compressed format?" The question led to a lengthy discussion about performance versus real security versus policy. The outcome of the discussion wasn't particularly important. What was important was that I presented a technical issue intelligently and was able to support my position and generally contribute to the technical aspects of the conversation. That was the point in that project when I ceased being "just a tester." Attend meetings As you might imagine, I would never have had the opportunity to participate in the discussion I just described if I didn't attend meetings, particularly design meetings. If you're thinking, "But meetings are a waste of time!" I generally agree with you. The key is to attend the right meetings. I try to avoid administrative meetings if at all possible. However, I always try to attend meetings where the developers will be. I've found that I can learn more about the application I'm testing during an hour-long architectural review meeting than I can in an hour of reading documentation or exploration. As a side note, when you do attend meetings, make sure that you participate in the meeting. Attending as a bump on a log doesn't help establish you as a "techy." Educate your team The next most important thing you can do is educate your entire project team about performance testing. Most folks involved in the software development process know very little about performance testing either conceptually or technically. I have a series of one-hour workshops that I like to do for a new project team. I find that the more of these I'm able to present, the more collaborative the effort becomes. Here are the titles of those workshops:

• • • • • • •

"Performance Engineering Overview" (general audience) "Collecting and Quantifying Performance Requirements" (general audience) "Determining the User Community Model(s) to Be Tested" (general audience) "Understanding Protocols and Script Creation" (technical audience) "Interpreting Basic Reports" (general audience) "Collecting and Analyzing Performance Data" (technical audience) "Performance Unit Testing" (technical audience)

While all of these topics are important and valuable, the ones that pay the largest dividends are "Understanding Protocols and Script Creation" and "Collecting and Analyzing Performance Data." In my experience, the vast majority of developers come to these sessions thinking they already know what I'm going to say, and leave saying, "Wow! This is really cool! I've got a bunch of ideas on how this can help me. When do you have time to talk about it with me?" As

soon as I hear the words "Wow, this is really cool!" I know I've laid the groundwork for a collaborative testing and tuning team. Ask questions I've mentioned it briefly in other articles but I want to stress it here: Ask smart questions about things you don't know. A healthy thirst for knowledge will establish you as a person who's interested in the details of the application, not just the surface-level application interface. Many developers assume that testers don't understand or care about how the application works as long as it passes the documented test cases. It's critically important to demonstrate that as a performance test engineer you're interested in knowing and capable of understanding how the application works. Offer assistance You have skills and tools at your disposal that the development team doesn't have and that can assist them in a variety of ways, some of which you'll probably never even think of. Regardless of who your direct supervisor is, you should make your skills and tools available to the development team to assist them. Sometimes, you can record a script in a few minutes that can provide feedback that would otherwise take them hours, days, or longer to collect. One example is load balancer configuration testing. If you have an environment available to generate load from multiple IPs, in a matter of minutes you can create a script that will validate the configuration of the load balancer. Without that script the only other way to test that configuration may involve a large number of coordinated manual users in a test that could never be repeated exactly. Be available and approachable As a performance test engineer, you probably work in a room by yourself with a whole bunch of computers in it on the other side of the building from the rest of your project team. While this is a blessing in some ways, it tends to lead to your being the person who's "out of sight, out of mind." Don't allow that to happen. Once people start forgetting that you're part of the project team, you can't be effective there. Even if you're in a different part of the building or in a different building entirely, ensure that you're available and approachable. If you're doing all of the things I've discussed in this article, you're already doing this. I list this suggestion separately mostly to remind you not to allow yourself to become isolated. Tear down the QA-versus-development barrier All in all, you're trying to tear down the perceived barrier between QA and development. This barrier has been built up in most organizations over a long period of time for a variety of reasons, not the least of which is that the two groups have very different skill sets. As a performance test engineer, you probably have a skill set that's a hybrid of those skills typically associated with each group. The bottom line is that if you don't work with both groups, you'll never be able to do your whole job. It's up to you to work on both sides of this barrier and deal with any side effects of not fully aligning with one group or the other. Don't expect someone else to do it for you.

Back to top Summing it up Collaborative testing and tuning teams are very powerful when it comes to detecting and resolving performance-related issues. However, these types of teams are still rare and often resisted. By involving yourself with the developers and honing your technical skills, you should be able to establish this kind of collaboration and ultimately realize the value of this team approach. In the next three articles, we'll look at specific things you can do to assist with testing and tuning in common areas of poor performance. Beyond performance testing Part 12: Testing and tuning on common tiers Scott Barber, Performance testing consultant, AuthenTec

Summary: Web, application, and database servers are common tiers related to testing and tuning. This article explores typical causes of poor performance on these tiers and describes things you can do to assist in testing and tuning these areas. Tag this! Update My dW interests (Log in | What's this?) Skip to help for Update My dW interests Date: 18 Jun 2004 Level: Introductory Activity: 70 views Comments: 0 (Add comments) Average rating (based on 14 votes) This is the second of four articles on our final theme in this series, which I call "the performance testing and tuning team." As we've discussed throughout this series, performance testers can provide expontentially more value to the primary tuners if they have a solid working knowledge of the systems and technologies they're testing. Accordingly, this article and the final two explore common areas of poor performance and describe things you can do to assist in the esting and tuning of these areas. While these three articles won't make you a tuning expert, I hope they'll make you more aware of some of the things that you can do to add value to the team. These final three articles will also introduce many topics in passing that you may want to research in more detail. Where appropriate, I'll recommend reference material for further research. Here's what we've covered so far in this series:

• • • • • • • • • • •

Part 1: Introduction Part 2: A performance engineering strategy Part 3: How fast is fast enough? Part 4: Accounting for user abandonment Part 5: Determining the root cause of script failures Part 6: Interpreting scatter charts Part 7: Identifying the critical failure or bottleneck Part 8: Modifying tests to focus on failure or bottleneck resolution Part 9: Pinpointing the architectural tier of the failure or bottleneck Part 10: Creating a test to exploit the failure or bottleneck Part 11: Collaborative Tuning

This article focuses on Web, application, and database servers, which are common tiers related to testing and tuning. All Web-based systems have a Web or presentation tier, and most Web-based systems that are at all interactive, such as ecommerce sites, include some kind of application or business logic tier and a data storage tier. As we discussed in Part 9, tiers don't always correlate with physical machines but are logical distinctions that often correlate with hardware and/or software applications. Here we discuss specific issues related to these tiers. This article is intended for mid- to senior-level performance testers and members of the development team who work closely with performance test engineers. You should have read at least Parts 9 and 11 before reading this article; it would be helpful to have read Parts 5-8 and Part 10 as well. The Web or Presentation Tier For a Web-based system, the Web or presentation tier is generally the least complicated tier and the easiest to both test and tune. Of course, this doesn't imply that it should be ignored. Not fewer than half of the Web applications I've been involved in testing have required some testing and tuning specific to the Web server. Often some very simple Web server-specific testing can have profound effects on the production system. Some of this is included in the kinds of performance testing we've already discussed, but there are also other ways to test particular attributes of a Web server that can add significant value to the analysis and tuning processes.

Testing During every performance test that you execute against a Web application measuring end-to-end response time, you exercise the Web server. With very rare exceptions, all of the requests sent by your user experience scripts are sent to the Web server. Analyzing those individual requests and response times to identify pages that are loading more slowly than expected is a great way to start testing this tier and is one way to determine if more Web server-specific testing would be worthwhile. Instead of looking at your timers (page load times), look at the response times for individual commands, each of which represents a particular interaction with the Web server. You'll usually find that most commands have very short response times. The few that have longer times may be worth looking at more closely. The first thing to look at is whether those commands trigger activities that take place exclusively on the Web server or whether they trigger activities on other tiers before a response is sent back to the client. I discussed how to determine which actual GET or POST is related to a command ID in Part 8 of this series (see the section entitled "Evaluate Commands with Slow Responses"). Loading graphics and static HTML pages generally exercises only the Web server, so those are the specific activities we're interested in. If one of these activities contributes significantly to the "too slow" total page load time, the Web server is a viable place to look for tuning opportunities. Some specific indications that the Web server is a significant contributor to slow page load times are as follows:

• • • • •

small graphics (under 50 KB) returning slowly all graphics returning slowly header files returning slowly a significant and sudden increase in response times at a specific and identifiable load below what the Web server was expected to handle significant resource utilization (that is, CPU or memory usage) at loads that the Web server was expected to handle

If none of these indications apply, it's unlikely that the Web server is the bottleneck, and therefore it probably doesn't deserve additional attention at this time. If one or more of these indications apply, it may make sense to develop some specific tests targeted against the Web server. My approach to this is as follows. I first ask a developer to put these files on the Web server and provide me with a valid URL to access them from the machine I'm using to record scripts:

• • • •

an HTML document containing only unformatted text (approximately one screen's worth) graphics files of various sizes (1 KB, 10 KB, 50 KB, 100 KB are common, but use what you have easily available as long as you note the actual file sizes) one HTML document containing each of the graphics exclusively one HTML document containing all of the graphics

Once this is done (which should take someone with full access to the Web server less than an hour), simply record a script that requests each of these objects sequentially (you can put each request in a timer for easy reporting) and make appropriate modifications for delays. Because there won't be any links to follow, you'll have to record this script by physically typing the URL from the developer into the browser's address bar. Before running this script, you'll want to disable caching on the Web server and on the client/script. This will force the script to get the actual file each time. Then execute the script, looped 10 or more (I prefer 100 when time permits) times with a single user. This will give you various single-user response times to use as a comparison, both to one another and to future tests. While this test is executing, the Web server administrator will likely want to monitor resource usage on the Web server. If the test doesn't provide enough insight, you can reexecute it with increasing numbers of users. While this test has little to no realism, it isolates the Web server in a way that helps determine if it's a bottleneck and provides information that may be critically helpful in tuning. Remember that it may be helpful to turn on, or increase the level of, logging on the Web server.

Note that if all Web server transactions seem to be slow and there doesn't seem to be anything tuned poorly on the Web server, you may want to ping the Web server from your client. The latency may actually be in the network, not the Web server. We'll discuss this in detail in Part 13 of this series. Tuning Tuning the Web server is generally the responsibility of the system administrator, but it may help for you to know some common things that may need to be tuned. There are several good books and resources about tuning parameters for each of the countless Web servers on the market, such as Apache, IIS, iPlanet, and Netscape Enterprise Server. One example is the Web site dedicated to configuring or tuning Apache. In my experience, these are the things that commonly need to be tuned on a Web server if it turns out to be the cause of the bottleneck:

• • • • • • • • • •

number of processes number of server threads allowable open/persistent connections session duration caching logging operating system parameters (that is, virtual memory settings and such) hardware resources (that is, disk I/O, RAM, and the like) partition configurations other programs/processes/software on the same machine

This isn't an inclusive list, just some common places to look as you get started in your tuning process. For further information, I recommend two books: Web Performance Tuning: Speeding Up the Web by Patrick Killelea and Speed Up Your Site: Web Site Optimization by Andrew B. King. These books are great introductions to tuning various components of Web-based applications. Web Performance Tuning contains specific advice regarding hardware, operating systems, and software common in 1998. While some of the specific information may be outdated, the concepts presented in this book are still valid. Speed Up Your Site is about optimizing the code for individual Web pages. This is great information when you find that your download times are slow, or your bandwidth utilization is high though your server response times are fast.

Back to top The Application or Business Logic Tier The application tier of a Web-based system is generally the tier that contains the business logic. This can be thought of as the brains of the application, where input taken from the user (via the Web server) is processed in some way. For instance, this is where search parameters are converted into valid SQL statements to be passed to the database, or shipping costs and applicable taxes are calculated for an e-commerce site. Testing and tuning this tier is significantly more complex than testing and tuning the Web or presentation tier. Testing Determining whether tests specific to the application tier are required or useful follows basically the same process as for the Web tier. Let the user experience tests you've already created exercise the application tier while you observe resource utilization. The system administrator can assist with that. If you (or the administrator) see that resources seem to be stretched too thin, you can start limiting or modifying your existing user experience tests to narrow your search for the offending pages and/or requests. If there are no suspect resource utilization metrics, it's time to look more closely at your page load times. If a particular subset of pages is exhibiting unacceptable slowness, you'll need to determine what's unique about them. You can

identify the underlying GET or POST of each command that has a slow response in order to evaluate whether these pages have requests for certain objects (like graphics) in common. If this doesn't point conclusively to something known to be handled by a different tier (like an excessively large graphic or file download), your next step is to find out if the offending request is accessing and causing activity on the application tier. This is often a collaborative process with the development team. Once you determine that slow pages do access the application tier, you should start the bottleneck detection process discussed in detail in Parts 7-10 of this series. Unlike for the Web tier where you'll be able to recommend tests to the development team that might add value, you'll likely have to depend on them to turn on logging during your tests of the application tier and to tell you which tests will be most helpful. Often these tests will require the techniques for collecting data by tier discussed in Part 9. Tuning The task of tuning the application tier is often shared by the tester, the system administrator, and the development team. It's often a balancing act involving modification and configuration of the hardware, the operating system, the application server software, and custom code. As the performance tester, you can expect to be asked to create complex tests and to execute them many, many times as you try out different combinations of configurations and modifications. While you most likely won't know enough to recommend many different configuration options, one important way you can contribute is to take notes and document the results of making changes. This is actually extremely useful to the administrators and developers, and keeps you actively involved with the team. The following is a generic list of things to think about when the application tier has been identified as the bottleneck:

• • • • • • • •

hardware resources (that is, disk I/O, RAM, and so on) operating system parameters (that is, virtual memory settings and such) logging management of processes and/or threads business logic algorithms custom code partition configurations other programs/processes/software on the same machine

Once again, there are countless books and resources dedicated to configuration and tuning of specific application servers, such as IBM WebSphere, JBoss, and BEA WebLogic. One example of a book dedicated to configuring or tuning the IBM WebSphere application server on a specific hardware platform is Java and WebSphere Performance on IBM Iseries Servers by IBM Redbooks.

Back to top The Data Storage Tier Data storage tiers very often hide significant bottlenecks that don't present themselves until performance testing. DBAs typically test and tune databases for expected single-user scenarios, whereas it's the multiuser scenarios and/or data variance involved in performance testing that commonly bring database bottlenecks to light. Testing Determining whether the database is a bottleneck starts in exactly the same way as for the application tier. First look at the slow page loads and find out whether those pages involve database access. If they do, ask the DBA to monitor the database during tests. Make certain that the DBA understands which data you're using and the sequencing of the activities in the test. Often you'll be asked to vary your testing to help the DBA monitor and determine the cause of the slowness. A common request will be to vary the load or the test configuration to determine if the bottleneck is related to multiple users accessing the same row or table.

You may find yourself using some of the techniques from Part 9 to isolate the data storage tier. This will help you determine if the bottleneck is actually on that tier or just appears to be because of a problem like inefficient SQL generation on another tier. Tuning Database tuning can be extremely complex and varies dramatically from one database to another. Most enterprise databases have so many interrelated configuration options that there are literally millions of configurations possible, so it's virtually impossible to follow the scientific method of making one change at a time. Even an experienced DBA often has to follow an educated trial-and-error process. You can help most by designing tests that are easily reexecuted, monitoring end-to-end response times, and taking notes -- just like for the application tier. Here are some common areas to think about when the database turns out to be the bottleneck:

• • • • • • • • • •

hardware resources (that is, disk I/O, RAM, and so on) operating system parameters (that is, virtual memory settings and such) logging query optimization indexes stored procedures row and/or table locking complex joins versus temporary tables partition configurations other programs/processes/software on the same machine

Some commonly used databases are DB2, Oracle, Sybase SQL Server, and Microsoft SQL Server. Doing a Yahoo search on "DB2 tuning" yielded more than 121,000 entries. I scrolled through the first 200 and found every one to be potentially useful to someone wanting to tune that particular database. "Oracle tuning" yielded more than 250,000 hits. As you can see, a lot has been written about database tuning, so don't expect to become an expert in this area overnight.

Back to top Summing It Up Testing and tuning common tiers is generally an iterative, collaborative process. There are countless possible areas to be configured and/or tuned. Through experience, you'll come to recognize what some symptoms indicate. But no matter how experienced you become, you'll find that you can help the most by developing and executing tests with the idea of providing value to the developers and administrators and by keeping a notebook of changes and associated results. You can increase your ability to contribute to test creation and tuning by researching published tuning tips specific to the hardware and software being used by the application you're testing. Beyond performance testing part 13: Testing and tuning load balancers and networks Scott Barber, Performance testing consultant, AuthenTec Summary: Load balancers and networks shouldn't be causing performance problems or bottlenecks, but if they are, some configuration changes will usually remedy the problem. This article shows you what to look for and what to do. In this installment of our final theme in this series, which I call "the performance testing and tuning team,"I'll discuss testing and tuning load balancers and networks. These are generally pretty easy to test, but the testing methods are fundamentally different from those discussed in Part 12 for the common tiers. Load balancers and networks shouldn't actually be causing performance problems or bottlenecks, but if they are, some configuration changes will

usually remedy the problem. Once again, this article won't turn you into an expert on these topics, but I hope it will help you make a greater contribution to the testing and tuning team when faced with bottlenecks in these areas. Here's what we've covered so far in this series:

• • • • • • • • • • • •

Part 1: Introduction Part 2: A performance engineering strategy Part 3: How fast is fast enough? Part 4: Accounting for user abandonment Part 5: Determining the root cause of script failures Part 6: Interpreting scatter charts Part 7: Identifying the critical failure or bottleneck Part 8: Modifying tests to focus on failure or bottleneck resolution Part 9: Pinpointing the architectural tier of the failure or bottleneck Part 10: Creating a test to exploit the failure or bottleneck Part 11: Collaborative Tuning Part 12: Testing and Tuning Common Tiers

This article is intended for mid- to senior-level performance testers and members of the development team who work closely with performance test engineers. You should have read at least Parts 9 and 11 before reading this article; it would be helpful to have read Parts 5--8 and Part 10 as well. Load Balancers Load balancers are conceptually quite simple. They take the incoming load of client requests and distribute that load across multiple server resources. When configured correctly, a load balancer rarely causes a performance problem. But my experience shows that load balancers are often not configured correctly and these results in very poor performance. The only way to ensure that a load balancer is configured properly is to test it under load before it's put into use in the production system. The bottom line is that if the load balancer isn't speeding up your site or increasing the volume it can handle, it's not doing its job properly and needs to be reconfigured. Before I describe how to test load balancers, I'll refresh your memory of the basic concepts behind load balancing and load balancers. A Refresher on Load Balancers Let's start by reviewing where the load balancer fits in the physical architecture. As you can see in Figure 1, the load balancer is generally the last stop a request makes before reaching a Web server.

Figure 1: Typical physical architecture with load balancer This diagram shows the load balancer as being separate from the firewall. Sometimes the same physical device serves as any combination of firewall, proxy server, router, and/or load balancer. Sometimes the load balancer is an actual hardware device with the software embedded (sometimes referred to as a content switch), while other times the load balancer is just software installed on a machine of your choosing. On top of the different hardware and software configurations, there are many different ways that a load balancer distributes load. The simplest is known as the round-robin method. With this method, the load balancer simply takes each incoming request and sends it to the next Web server. For instance, if Figure 1 represented our actual environment, the first request to reach the load balancer would be directed to Web server 1, the second to server 2, the third to server 3, the fourth to server 4, and the fifth back to server 1. The round-robin form of load balancing completely ignores the concept of user sessions, so that during a single session one user could be passing requests through several different Web servers. For an e-commerce application, this won't work. For these types of applications, on sites that use sessions, the load balancer needs to be able to identify a particular user and keep that user pointed to the same Web server throughout the entire session. Even this doesn't ensure a balanced load on each server. If somehow all of the users assigned to Web server 1 spent hour using the application and all the other users just spent minutes, Web server 1 could get overloaded while the other servers went underutilized. That's why many load balancers have dozens of load-balancing options and algorithms. They may balance by total traffic volume, they may monitor the resource utilization on the Web server to decide which server is least utilized at the moment the next request is received, or they may exercise any number of other possibilities. For more information about a few currently popular hardware-based load balancers (including the balancing algorithms they support), see the Network Computing site.

Testing If you've modeled a realistic workload, your performance test is going to be the best indicator of whether the load balancer is configured in the best possible way for your site. In fact, the only other way to see if the load balancer is configured properly is to release it into production and monitor the traffic.

In order to exercise the load balancer in a way that's useful for determining if it's configured properly, you'll need to make your virtual testers appear to be coming from unique IP addresses (since most load balancers identify users by IP address). If the load balancer doesn't recognize your virtual users as being unique users, it's not likely to balance the load properly. There are only two ways to make your virtual testers appear to be coming from unique IP addresses: use a lot of agents or enable IP aliasing through TestManager. Having lots of agents isn't always practical, so that leaves IP aliasing. In case you're not familiar with it, IP aliasing allows many IP addresses to be assigned to the same physical system. Every virtual tester can be assigned a different IP address to realistically emulate your user community (although you don't necessarily need one IP address per virtual user, as I'll explain in a minute). The requests generated by these virtual testers receive responses back from the Web server with timing characteristics and validation recorded intact. As explained by the TestManager Users Guide, if IP aliasing is enabled, the TestManager software on each computer (local or agent) queries the system for all available IP addresses at the beginning of a run. Each suite scheduled to run on that computer is assigned an IP address from that list, in round-robin fashion. If a computer has more virtual testers than IP addresses, an IP address is assigned to multiple virtual testers. If a computer has fewer virtual testers than IP addresses, some IP addresses aren't used. This approach optimizes the distribution of IP addresses regardless of the number of virtual testers scheduled on a computer and frees you from having to match IP addresses to specific virtual testers. IP aliasing takes effect only if you're running HTTP test scripts and your system administrator has configured your system for IP aliasing. While this configuration may not be technically difficult, it requires a knowledge of which IP addresses are both valid and available on the network you're testing from. For Windows NT, the administrator sets up the IP addresses on any particular computer with the Settings > Control Panel > Network > Protocols > TCP/IP Protocol > Properties > Advanced > IP Addresses > Add button. For UNIX, this can be done with the ifconfig command line utility. But you don't need to worry about any of that. All you need to do is to enable IP aliasing through TestManager. If you've already created your suite, simply choose Suite > Edit Runtime from the menu bar and check the box as shown in Figure 2.

Figure 2: Enabling IP aliasing through TestManager

To test a load balancer, you don't necessarily need one IP address for every virtual tester, as mentioned above, but I recommend having at least four times the number of IP addresses as you have Web servers behind the load balancer. In the case of the environment depicted in Figure 1, I would recommend a minimum of sixteen IP addresses (four IPs times four Web servers) via any combination of agents and/or IP aliasing. Make certain you talk to the administrator in charge of configuring the load balancer. That person will be better able to tell you what a realistic sampling size will be for your environment. Generally, you'll want to test a load balancer under a fairly high load. It often makes sense to do some performance testing and tuning on a single Web server (and associated back-end components) before adding the load balancer and other "legs" of the load-balanced environment (see Figure 3) to ensure that the rest of the system can handle the load that you'll be applying to the load balancer. If the rest of the application is unable to handle the load, you won't be able to effectively determine whether performance issues observed during the test are related to the load balancer or something else in the system.

Figure 3: Single "leg" of a load-balanced environment (indicated by red lines) Most teams believe they can stop testing after they've tested this single leg. They assume that if the application server has more than half its resources still available and if the database and report servers have more than 75% of their resources available when the Web server reaches its target load, adding the rest of the servers in this environment with a load balancer will simply multiply that target load by four. This is absolutely a faulty assumption. Just because one leg can handle a certain load doesn't mean that more legs will be able to handle more load. The actual environment must be tested. Even under ideal conditions, adding legs rarely, if ever, results in simple multipliers of performance volume. Remember when testing a load-balanced environment to have someone monitoring not only the load balancer but also each Web server to ensure that it's handling its share of the load. Most of this type of testing isn't about response time but rather about ensuring the load balancer is configured correctly and determining the overall load the entire loadbalanced system can handle. Ensure your testing and monitoring reflect that focus. Tuning This section could probably be titled "Configuring" rather than "Tuning." Since the majority of popular load balancers today come embedded on their own hardware, there's very little hardware tuning to be done. It's often the case, however, that additional configuring is required to ensure that all of the available Web servers are receiving an even portion of the generated load. It may be that the type of load that's being generated requires a different balancing algorithm than the one the load balancer is currently using. This is generally up to the administrator who handles the load balancer to decide. It's our job simply to provide an accurate load and to help monitor the Web servers to provide the administrator the information necessary to make any configuration changes to improve the distribution of the load.

Back to top Networks You may remember that in Part 7 I asserted some rules for bottlenecks. One of those rules was: "The bottleneck is more likely to be found in the hardware than in the network, but the network is easier to check." I felt confident in making this assertion for two reasons. First, you don't need to have accurate performance scripts or a working application to test the network, and second, virtually all networks have an administrator and that administrator has tools to help you test the network. Testing The simplest way to start testing a network is to "ping" the remote server from your client. To use the ping utility in Windows, simply go to the Start menu, choose Programs > Accessories > Command Prompt, and type "ping [URL or IP to ping]" at the command line. Your results will look something like this: C:\WINDOWS>ping www.testsite.com Pinging www.testsite.com [17.2.240.92] with 32 bytes of data: Reply Reply Reply Reply

from from from from

17.2.240.92: 17.2.240.92: 17.2.240.92: 17.2.240.92:

bytes=32 bytes=32 bytes=32 bytes=32

time<10ms time=10ms time<10ms time<10ms

TTl=63 TTl=63 TTl=63 TTl=63

Ping statistics for 17.2.240.92; Packets Sent = 4, Received = 4, Lost = 0 <0% loss>, Approximate round trip times in milli-seconds: Minimum = 5ms, Maximum = 10ms, Average = 7ms

As you can see, the ping utility sent 32 bytes of data to www.testsite.com. This was done four times. One time, the round trip took 10 milliseconds (ms); the other times, it took less, for an average round trip time of 7 ms. In this case, we can see that the network between this client and www.testsite.com is very fast. If this ping were to be slow, it would be time to contact your network administrator. The other DOS utility that can be useful in testing networks is tracert (short for trace route). Using tracert at the command line for our test site yielded the following results: C:\WINDOWS>tracert www.testsite.com Tracing route to www.testsite.com [17.2.240.92] over a maximum of 30 hops: 1 2 3 4 5 6 7 8 9 10 11

1 4 3 4 8 32 43 51 59 56 46

ms ms ms ms ms ms ms ms ms ms ms

<10 ms 2 ms 4 ms 4 ms 8 ms 34 ms 37 ms 58 ms 58 ms 45 ms 51 ms

Trace complete.

1 ms 10.215.3.1 ms 129.71.200.254 ms 129.71.200.1 ms 129.71.254.1 ms 207.68.7.18 ms 209.158.31.249 ms 205.171.24.85 ms 205.171.5.233 ms 205.171.30.10 ms 205.171.30.14 ms 38.7.135.1

2 4 3 10 41 41 47 58 43 41

This trace shows that a request took 11 hops to get from my location to the destination. Adding the duration of all these hops up, I come up with roughly 300 ms, or 0.3 seconds. Based on this information, I can pretty safely assume that about a third of a second of the time required for each request to reach its destination is due to network latency. That could really add up for a complex Web page. Additionally, if this were a test that I did from inside the firewall, 11 would be an excessive number of internal hops and would indicate a problem that should be addressed. You really don't want every HTTP request bouncing around on your internal network for 11 hops before getting to your Web server! Both of these simple tests were done under no load. It's often useful to do these same tests during your test execution to see what effect executing the test has on the results. If these tests yield odd results, or if you've conducted these tests as well as other tests and can't find the slowdown, it's probably time to get the network administrator involved. The network administrator will normally have tools to show how much network bandwidth is being used, what the collision rate is, and what the utilization and resource utilization rates are on network components (like routers). The admin should even be able to "sniff" the network to get a clearer picture of what's going on. Find out what tools the admin has available and what kinds of load you should generate to best help the admin either track down bottlenecks on the network or eliminate the network from the list of potential bottleneck suspects. Tuning Tuning the network or network components is almost entirely in the hands of the network administrator. The best you can do is be at the admin's disposal to execute tests to verify any changes this person makes, or to develop a specific test to help pinpoint a problem. Even if you're a networking expert, it's unlikely that you have the specific information about the network the system under test is configured on to be able to help. From where we sit, it's nearly impossible to tell if the slow spot we're encountering is a router, a routing table, a proxy server, a firewall, or an overutilized network segment. Occasionally, the network can become congested, and that's a major (and expensive) problem to fix. Luckily, most network administrators are aware when the network is getting full so organizations can plan in advance.

Back to top Summing It Up I've shown you a few simple things you can do to determine whether the load balancer is correctly configured and whether the network is causing any delays in response time. In both cases, making any changes in response to your findings is up to the administrator involved. Your role is to accurately model the load and monitor what goes on during your testing. As I've implied before, sometimes being a helper is as good as being a hero.

Related Documents