This article is provided courtesy of STQE, the software testing and quality engineering magazine.
Tools & Automation Three Web load testing blunders, and how to avoid them by Alberto Savoia
TRADE
re t
c se FROM A
s
WEB TESTING EXPERT QUICK LOOK
■ Why concurrent users is a misleading metric ■ The impact of user abandonment ■ Accurately analyzing results
oad testing has rapidly become one of the top QA priorities for companies with mission-critical Web sites. How many users will your site be able to serve while still maintaining acceptable response times?
L
That’s an indispensable piece of information for planning marketing campaigns, estimating IT budgets, and basic delivery of service. And yet practically all Web site load tests are seriously flawed—because they all seem to make mistakes that have a huge impact on the accuracy of the test and the reliability of the results. Let’s look at three of the biggest and most common Web load testing blunders, and how to avoid them. 54
www.stqemagazine.com
STQE
May/June 2001
This article is provided courtesy of STQE, the software testing and quality engineering magazine.
1. Misunderstanding Concurrent Users The first blunder centers on the widespread use of the concept of concurrent users to quantify loads and design load tests. I find it amazing that concurrent users is the prevailing metric when it comes to describing a Web load, because the approach is riddled with obvious problems. So many problems, in fact, that I could probably write a book about it—but fortunately for you, the space constraints of this article force me to be concise. In that spirit, I’ll focus on the main problem with concurrent users: the number of concurrent users shouldn’t be seen as input for a load test run, but as the result of a number of factors. And yet whenever you read a load testing plan, you probably see something like this: “The Web site will be tested with a load of 1,000 concurrent users.” To explain why this is the wrong way to look at things, let’s do a simple thought experiment. Let’s assume that three users— Alan, Betty, and Chris—visit a financial Web site to get stock quotes on three consecutive days, and that each of them plans to get three different stock quotes. None of our three users knows each other, so their actions on each day are completely independent and unsynchronized, like those of most Web users. On Monday, Alan starts his session at 12:00, Betty at 12:01, and Chris at 12:02, and each of their sessions consists of four page requests (Home Page→Quote 1→Quote 2→Quote 3). Each of our users will, after receiving each page, spend 10 seconds looking at it before requesting the next page (in load testing parlance, this is called think time). If the Web site takes 5
Receiving a page
seconds to respond to each page request, Alan’s, Betty’s, and Chris’s sessions will not overlap. Alan will have received and read his four pages before Betty’s session starts, and Betty will be done with her session before Chris starts his (see Figure 1). In this case, it’s accurate to say that Chris’s, Betty’s, and Alan’s sessions are not concurrent.
The number of concurrent users shouldn’t be seen as input for a load test run, but as the result of a number of factors. Let’s now assume that by lunchtime on Tuesday the site has slowed down a little bit. Now instead of taking 5 seconds, each page request takes 10 seconds, but the time it takes the user to read it remains constant. Each session will now last 80 seconds instead of 60. Alan’s and Betty’s sessions will overlap, and so will Chris’s and Betty’s. In this case we will have some periods of time with one user and some periods of time with two concurrent users. On Wednesday, the Web site slows down even more. Now instead of taking 10 seconds, each page request takes 30 seconds. Each session will now last 160 seconds. Alan’s and Betty’s sessions will overlap, as well as Chris’s and Betty’s, and for a while all three sessions will also overlap, resulting in three concurrent users.
Reading a page
Two concurrent users
As Figure 1 illustrates, the number of concurrent users is not a measure of load. The load was identical in all three cases: three users, starting a minute apart, viewing four pages each with 10 seconds of think time per page. The number of concurrent users was a result: a measure of the Web site’s ability to handle a specific load. A slower Web site resulted in more concurrent users. When it comes to measuring Web site scalability, it turns out that the number of concurrent users is not even that useful as an output result of load testing. If the Web site is somewhat slow, the number of concurrent users increases. If it’s really slow, a lot of real users will abandon it, thus reducing the number of concurrent users. (More on user abandonment in the next section.) But, on the other hand, if a Web site is very fast sessions will complete more quickly, also reducing the number of concurrent users. You see the problem? The bottom line is that concurrent users is a dangerously misleading metric that can be misused in so many ways that it’s practically guaranteed to give you questionable results. So what should you use in its place? For describing an input load, my favorite metric is user sessions started per hour. This metric offers a major advantage: the number of user sessions started per hour is a constant, unaffected by the performance of the Web site under test. If you’ve launched a big marketing campaign and expect to draw a peak of 10,000 user sessions per hour to your Web site, those users will come to the Web site and request the first page, regardless of whether or not your site can handle them. Whether they complete their sessions or not, however, depends on the Web site’s ability to sup-
Three concurrent users
12:00
12:01
12:02
Monday
12:03
12:00
12:01
12:02
Tuesday
12:03
12:00
12:01
12:02
12:03
Wednesday
FIGURE 1 Users’ concurrency influenced by page request times
ANNIE BISSETT
Alan Betty Chris
55
May/June 2001
STQE
www.stqemagazine.com
This article is provided courtesy of STQE, the software testing and quality engineering magazine.
port the load with acceptable response time—and that’s precisely what you want to find out with a load test, isn’t it?
2. Miscalculating User Abandonment Let’s move on to another constantly overlooked load testing factor that has a huge impact on load testing results: user abandonment. Have you ever left a Web site because its pages were loading too slowly? Unless you have the patience of a Benedictine monk, I am sure you have. And since many people seem to have the attention span of a chipmunk when they’re on the Internet, abandoned sessions are extremely common. Considering that the magnitude of user abandonment is going to be quite high at the levels that are likely to be used in a load test, and that this user abandonment will have a significant impact on the resulting load, I find it very surprising that most Web load tests don’t even attempt to simulate this ver y common behavior with any degree of realism. You should simulate user abandonment as realistically as possible. If you don’t, you’ll be creating a type of load that will never occur in real life—and creating bottlenecks that might never happen with real users. At the same time, you will be ignoring one of the most important load testing results: the number of users that might abandon your Web site due to poor performance. In other words, your test might be quite useless.
er additional users will abandon—at least until the load increases again and the cycle repeats itself. It’s clear that in this example not all sessions will conclude happily; a number of people will abandon their session. That’s a very important result, a critical piece of information that your load test should help you discover. After all, aren’t you doing a load test to ensure that the Web site can serve a specific number of users at a specific performance level? User abandonment is a very important metric and a clear indication that the Web site is not able to operate satisfactorily at that load level. Considering how critical abandonment is, it’s surprising that most load tests are designed to use scripts that simulate abandonment only in extreme cases. (I most commonly hear 60 to 120 seconds timeout. Unfortunately, I don’t know anybody in my immediate and extended family with that kind of patience; actually, I don’t know anybody in my area code with that kind of patience.) So, how do you implement realistic user abandonment? Here’s a simple approach you can use to get started. When you write your load testing scripts, determine what the acceptable page response times would be for each type of page, and what the likely abandonment rates are going to be when they are exceeded. Then program each simulated user script to terminate if the response time for a page exceeds its pre-specified threshold. Table 1 shows a sample matrix in which to map out the possibilities.
To explain why this is, let’s perform another simple thought experiment. Let’s assume that you want to test your Web site at a load of 10,000 concurrent users (just testing—I hope you’ve banished concurrent users from your vocabulary)…As I was saying, let’s assume you want to test your Web site at a load of 10,000 user session starts per hour. Let’s also assume that if the home page response time is less than 5 seconds, no users will abandon the Web site because of performance. We’ll say that as home page response time increases farther and farther away from 5 seconds, more and more users will abandon. For example: 30% will abandon between 5 and 10 seconds, 45% between 10 and 15 seconds, and so on. Let’s also assume that each complete user session consists of four pages. In one scenario, let’s assume that the Web site under test is able to handle the 10,000 user sessions per hour with a home page response time consistently below 5 seconds. In this case, all 10,000 users will complete their sessions and the Web site will have served a load of 40,000 pages. In another scenario, let’s assume that the Web site is not as scalable. When it’s confronted with a load level of 10,000 user session starts per hour, the response time increases to 15 seconds per page. What happens in this case is not as straightforward. Initially, as the performance deteriorates, some users will start to abandon; but since this abandonment reduces the load, the performance will start improving again. As performance improves, few-
User Abandonment Matrix
PA G E T Y P E
% ABANDONMENT 0–5 SECONDS
% ABANDONMENT 5–10 SECONDS
% ABANDONMENT 10–15 SECONDS
% ABANDONMENT 15–20 SECONDS
Home Page
0%
30%
45%
75%
Stock Quote
0%
15%
25%
45%
Stock Transaction
0%
0%
0%
15%
Account Information
0%
5%
15%
35%
TABLE 1 Estimated user abandonment rates for different page types 56
www.stqemagazine.com
STQE
May/June 2001
This article is provided courtesy of STQE, the software testing and quality engineering magazine.
This matrix takes into account the fact that people expect home pages to load very quickly, but are more tolerant of pages that they assume require more work for the servers (e.g., completing a stock transaction). This kind of table is actually a great way to get your peers and management to formally discuss and document the performance expectations for your Web site. As we will see a little later, this type of model also lets you create much better load testing reports—since you will be providing information that’s going to be significantly more meaningful and relevant than what you’d be able to deliver without a simulation of user abandonment. Accurately Simulating Abandonment Whenever I talk about user abandonment, people agree with its importance, and with the goal of simulating it as realistically as possible. But how can they do that? They have no idea what the user abandonment behavior might be for their Web site, and what percentages they should put in their user abandonment matrix. This is a very valid concern that, fortunately, can be addressed in several ways, depending on how accurate and realistic you want to be. If you want to be as accurate and realistic as possible, you could set up your Web site to redirect a percentage of your visitors to a slower mirror version of the Web site—one that’s identical to the main site, except that it’s artificially slowed down. This is not as complicated as it sounds, and it can be accomplished in a number of ways; the simplest method might be to add to each page some simple JavaScript code whose only purpose is to add an artificial delay of several seconds before displaying the content of the page. Let’s walk through a very simple example of how you might use this approach. Assume that you’ve set up your Web site so that 90% of the traffic is sent to the regular server, while the other 10% is routed to a server identical to the first, but in which the home page has been artificially slowed down by, say, 5 seconds. Run your Web site in this configuration
for a few hours, or a few days, until you have enough sessions to make the results statistically significant. (I would set a minimum threshold of 1,000 user sessions before drawing any conclusions.) After this period of time, using a log file analyzer on both the regular server and the slowed-down server, take a look at what percentage of
Abandonment is not only an interesting result in itself; it also has a major impact on what parts of your Web site will get stressed under real conditions. sessions requested the home page and then proceeded no further (i.e., all the sessions that abandoned after the home page—I call them homealone pages). If the percentage of home-alone pages is 6% for the regular server and, say, 20% for the slowed-down server, you must conclude that, since everything else was equal, the increased abandonment of 14% had to be caused by impatient users not putting up with an additional delay of 5 seconds on the home page. This approach does have a downside: it requires some effort, and you will have caused some inconvenience for 10% of your Web site visitors (and may have lost a few of them). But for some Web sites the cost of this type of experimentation will be easily justified if it leads to a better understanding of customer behavior—an understanding which can then be applied to maximize the success of the overall Web site. Frederic Haubrich, the Chief Web and Technical Officer for the Web site Hooked on Phonics™ (www.hop.com), for example, regular-
ly tests new Web site designs and navigation options on a small percentage of users to determine if the new design increases or decreases the percentage of sessions that result in a purchase. New 40KB graphics for a home page might look great, but will the improved aesthetics compensate for the additional loading time? The investment required for this approach is appropriate for a large, mission-critical Web site, where just 1% abandonment may mean losing big annual revenue; but it may be harder to justify for smaller sites. In such cases, your best bet is to make some educated guesses about the low and high abandonment rates for the various pages. You can be pretty certain, for example, that if a home page takes 30 seconds to load, a lot of people will not put up with it; so you can set the abandonment range with a low (best case) of 20% and a high (worst case) of, say, 50%. You can then run one load test with the best-case numbers and one with the worst-case numbers to get a rough idea of what the range of abandonment might be. Your goal is not to get these percentages exactly right, but to recognize and document your users’ expectations and behavior. This abandonment is not only an interesting result in itself; it also has a major impact on what parts of your Web site will get stressed under real conditions. If you have a very slow home page, for example, most real users will not continue their session and therefore will not put any load on the rest of the Web site. In this case, if you don’t realistically simulate home page abandonment, you will apply an improbable and disproportionate load to the rest of the Web site. This improbable load might create an improbable bottleneck, and you might get stuck fixing a virtual problem that might never have occurred naturally—while ignoring the slow home page that is causing massive abandonment. It’s important to realize that even the most primitive abandonment model is a giant leap in realism when compared to the commonly used 60- or 120-second timeouts. And since there’s a lot of money involved, there is no doubt that our 57
May/June 2001
STQE
www.stqemagazine.com
This article is provided courtesy of STQE, the software testing and quality engineering magazine.
understanding of Internet user expectations and behavior is going to increase dramatically in the next few years, allowing us to create increasingly accurate load models. (See this article’s Sticky-Notes for more information.)
ANNIE BISSETT
3. Over-Averaging Page Response Times Even if you manage to design, develop, and execute an incredibly realistic load test, you have one last opportunity to mess things up, really mess them up, when you analyze and report the results. As you might expect, load tests generate loads of data, and all this data can be processed, mangled, diced, and sliced in a number of ways by using, misusing, and abusing statistics to produce reports that can look very pretty but fail to shed light on what really matters. (Before you think that I have something against statistics, let me reassure you that I am a big fan of this discipline…at least 62.3% of the time. And even though 73.7% of statistics are usually made up on the spot, the remaining 26.3% are potentially very useful.) When it comes to reporting load testing results, the greatest opportunity for voluntary, or involuntary, misuse of statistics is related to average page response time (APRT). Typically, the main objective of a load test is to determine the scalability of a Web site, an important question if you don’t want to lose potential customers due to performance problems (i.e., slow-loading pages). Unfortunately, by averaging page response times, you run the risk of masking serious performance problems that will impact your users. Let me show you how, by using a final thought experiment. Let’s assume that you run a test with a load of 10,000 session starts per hour and you get an APRT for the home page of 4 seconds. Since your chart tells you that at less than 5 seconds you will experience negligible abandonment, you are in pretty good shape, right? Well maybe yes, maybe no—this single piece of data does not tell you much. You could have an APRT of 4 seconds, although at that load level the Web site
performance would be unacceptable for a large percentage of users. How? Let’s consider three cases: 1. One way you can get an APRT of 4 seconds is if each home page was returned in approximately 4 seconds. 2. Another way to get an APRT of 4 seconds is if 5,000 of the home pages were returned in approximately 2 seconds and the other 5,000 in approximately 6 seconds. 3.
Yet another way of getting an APRT of 4 seconds is if 9,000 users experience a 1-second response time, and 1,000 users experience a 31-second response time.
In the first case, all users should be happy, since they all experienced a response time below 5 seconds. In the second case, the APRT is the same, but half the users are experiencing a 6-second page response time—a time lag that you know will cause some abandonment in a realworld situation. In the last case, even though the APRT is still the same 4 seconds, you have a thousand users with a truly unacceptable response time of 31 seconds, which points to a potentially serious performance problem and massive abandonment. Unfortunately, this danger sign was well hidden by the averaging process. Another way that APRT distorts otherwise valid load test results is Page Response Time 20 Average Range (minimum to maximum)
15 10 5
2,000
4,000
6,000
8,000
10,000
Load Level (Sessions started per hour)
FIGURE 2 Chart showing average, minimum, and maximum response times
when the response times from different types of pages are carelessly combined into a single APRT. Let’s assume that you run a load test and you get an APRT of 4 seconds for all pages. Pretty good, no? Well, possibly not. The home page response time may have been 30 seconds (which can easily happen when Web designers get carried away with fancy graphics), while all the other pages loaded in 2 or 3 seconds. In this case the APRT might look good, but—as you should know by now—most real users would never have gotten past that home page. Here’s a simple way to remember that averages can be very misleading. The next time you hear an average number, remember this: I could put one of your feet in a bucket of icy cold (0 degrees Celsius) water and the other one in a bucket of boiling (100 degrees Celsius) water and tell you that, on average, your feet are in a nice, warm, cozy 50-degree bath. Overcoming the Problem of Averages So, how do you get around the problems associated with the APRT? Fortunately, there are several ways. 1.
First of all, make sure that you report different APRTs for different types of pages.
2. Augment the APRT number with other statistical information such as standard deviation or median; however, first make sure that you take into account the statistical competence of your audience. How many people will be able to really understand what your statistics mean, or how to act on them? (For example, “At a load of 6,000 user sessions per hour, the average APRT for the home page was 4.9 seconds with a standard deviation of 2.3 seconds and a median value of 4.8 seconds.”) 3. Show a chart with not only the average values, but the minimum and maximum values as well, as seen in Figure 2. 4. Forego the average altogether and report the percentage of
58
www.stqemagazine.com
STQE
May/June 2001
This article is provided courtesy of STQE, the software testing and quality engineering magazine.
Distribution of Page Response Times (at 6,000 sessions/hr) 35%
Number of Sessions 7,000 Completed sessions
30%
6,000
25%
5,000
20%
4,000
15%
3,000
10%
2,000
5%
1,000 2
3
4
5
6
7
8
9
10
Page Response Time (Seconds)
FIGURE 3 Histogram of page response times
2,000
5. Present a histogram of the page response time distribution, as illustrated in Figure 3.
Using the table and model described in the previous section, you can turn those numbers into meaningful information, showing not only performance problems, but the impact they might have on the business. My favorite approach is to make page response time just one of the result metrics, and instead focus on user satisfaction and potential abandonment. The chart in Figure 4 is an example of how you might report the results, showing how changes in performance will impact the user experience and, potentially, your business. This will greatly complement the over-simplified (and potentially misleading) data based on page response time.
they still deal with pretty dry metrics to which most people cannot relate. Case in point, this conversation between two people who speak different languages: QA project leader: Thirty-seven percent of the users experienced a home
8,000
10,000
FIGURE 4 User abandonment rates
page response time greater than five seconds. VP of Sales and Marketing: How interesting. But…so what? How do I use this number? Is it good or bad? What should it be, ideally?
Make page response time just one of the result metrics, and instead focus on user satisfaction and potential abandonment.
6,000
Load Level (Sessions started per hour)
pages returned within a specific, relevant, time limit (e.g., “At a load of 6,000 users sessions per hour, 63% of the product information pages were returned in under 5 seconds.”).
All of these methods enrich the APRT and will help highlight any performance abnormality. But unfortunately,
4,000
Conclusion Concurrent users, session timeouts, and average page response time are three of the most fundamental concepts in load testing—three concepts that are regularly misunderstood, misused, and misrepresented, leading to potentially misleading load testing results. When you adopt concurrent users as a load testing input parameter and fail to account for user aban-
donment, you run the risk of creating loads that are highly unrealistic and improbable. As a result, you may be confronted with bottlenecks that might never occur under real circumstances. Risks abound at the other end of the load testing cycle, too: improper use of simple averages in the analysis phase might easily obscure ver y serious performance problems. Some of the solutions we’ve looked at here are very simple to implement, while others require substantially more work. In the end, it’s going to be up to you as a quality assurance professional to determine how realistic and accurate your load tests—and their results—have to be for your particular situation. STQE Alberto Savoia (alberto.savoia@ keynote.com) is Chief Technologist of Keynote’s load testing division, and has also served as founder and CTO of Velogic, General Manager of SunTest, and Director of Software Research at Sun Microsystems Laboratories. His sixteen-year career has been focused on applying scientific methodology and rigor to software testing. Editors note: This is the second of Alberto Savoia’s three Web-related articles for STQE. His first article, Web Load Test Planning, appeared in the March/April 2001 issue. His third article in the series will appear in the July/August 2001 issue.
ANNIE BISSETT
1
Abandoned sessions
59
May/June 2001
STQE magazine is produced by STQE Publishing, division of Software Quality Engineering STaQ E
www.stqemagazine.com