Introduction to Software Testing Software testing is an vital part of the software lifecycle. To understand its role, it is instructive to review the definition of software testing in the literature. Among alternative definitions of testing are the following: "... the process of exercising or evaluating a system or system component by manual or automated means to verify that it satisfies specified requirements or to identify differences between expected and actual results ..." (ANSI/IEEE Standard 729, 1983). "... any activity aimed at evaluating an attribute or capability of a program or system and determining that it meets its required results. Testing is the measurement of software quality ..." (Hetzel, W., The Complete Guide to Software Testing, QED Information Sciences Inc., 1984). "... the process of executing a program with the intent of finding errors..." (Myers, G. J., The Art of Software Testing, Wiley, 1979). Of course, none of these definitions claims that testing shows that software is free from defects. Testing can show the presence, but not the absence of problems. According to Humphrey [1], software testing is defined as 'the execution of a program to find its faults'. Thus, a successful test is one that finds a defect. This sounds simple enough, but there is much to consider when we want to do software testing. Besides finding faults, we may also be interested in testing performance, safety, fault-tolerance or security. Testing often becomes a question of economics. For projects of a large size, more testing will usually reveal more bugs. The question then becomes when to stop testing, and what is an acceptable level of bugs. This is the question of 'good enough software'. It is important to remember that testing assumes that requirements are already validated.
Basic Methods White Box Testing
White box testing is performed to reveal problems with the internal structure of a program. This requires the tester to have detailed knowledge of the internal structure. A common goal of white-box testing is to ensure a test case exercises every path through a program. A fundamental strength that all white box testing strategies share is that the entire software implementation is taken into account during testing, which facilitates error detection even when the software specification is vague or incomplete. The effectiveness or thoroughness of white-box testing is commonly expressed in terms of test or code coverage metrics, which measure the fraction of code exercised by test cases. Black Box Testing Black box tests are performed to assess how well a program meets its requirements, looking for missing or incorrect functionality. Functional tests typically exercise code with valid or nearly valid input for which the expected output is known. This includes concepts such as 'boundary values'. Performance tests evaluate response time, memory usage, throughput, device utilization, and execution time. Stress tests push the system to or beyond its specified limits to evaluate its robustness and error handling capabilities. Reliability tests monitor system response to representative user input, counting failures over time to measure or certify reliability.
Testing Levels Different Levels of Test Testing occurs at every stage of system construction. The larger a piece of code is when defects are detected, the harder and more expensive it is to find and correct the defects. The different levels of testing reflect that testing, in the general sense, is not a single phase of the software lifecycle. It is a set of activities performed throughout the entire software lifecycle. In considering testing, most people think of the activities described in figure 1. The activities after Implementation are normally the only ones associated with testing. Software testing must be considered before implementation, as is suggested by the input arrows into the testing activities.
Figure 1: V-Shaped Life Cycle .The following paragraphs describe the testing activities from the 'second half' of the software lifecycle. Unit Testing Unit testing exercises a unit in isolation from the rest of the system. A unit is typically a function or small collection of functions (libraries, classes), implemented by a single developer. The main characteristic that distinguishes a unit is that it is small enough to test thoroughly, if not exhaustively. Developers are normally responsible for the testing of their own units and these are normally white box tests. The small size of units allows a high level of code coverage. It is also easier to locate and remove bugs at this level of testing. Integration Testing One of the most difficult aspects of software development is the integration and testing of large, untested sub-systems. The integrated system frequently fails in significant and mysterious ways, and it is difficult to fix it
Integration testing exercises several units that have been combined to form a module, subsystem, or system. Integration testing focuses on the interfaces between units, to make sure the units work together. The nature of this phase is certainly 'white box', as we must have a certain knowledge of the units to recognize if we have been successful in fusing them together in the module. There are three main approaches to integration testing: top-down, bottom-up and 'big bang'. Top-down combines, tests, and debugs top-level routines that become the test 'harness' or 'scaffolding' for lower-level units. Bottom-up combines and tests low-level units into progressively larger modules and subsystems. 'Big bang' testing is, unfortunately, the prevalent integration test 'method'. This is waiting for all the module units to be complete before trying them out together. (From [1]) Bottom-up Top-down Major Features • • •
Allows early testing aimed t proving feasibility and practicality of particular modules. Modules can be integrated in various clusters as desired. Major emphasis is on module functionality and performance. The control program is tested first Modules are integrated one at a time Major emphasis is on interface testing
Advantages No test stubs are needed It is easier to adjust manpower needs Errors in critical modules are found early No test drivers are needed The control program plus a few modules forms a basic early prototype Interface errors are discovered early Modular features aid debugging Disadvantages Test drivers are needed Many modules must be integrated before a working program is available Interface errors are discovered late Test stubs are needed The extended early phases dictate a slow manpower buildup Errors in critical modules at low levels are found late
Comments At any given point, more code has been written and tested that with top down testing. Some people feel that bottom-up is a more intuitive test philosophy. An early working program raises morale and helps convince management progress is being made. It is hard to maintain a pure top-down strategy in practice.
Integration tests can rely heavily on stubs or drivers. Stubs stand-in for finished subroutines or sub-systems. A stub might consist of a function header with no body, or it may read and return test data from a file, return hard-coded values, or obtain data from the tester. Stub creation can be a time consuming piece of testing. The cost of drivers and stubs in the top-down and bottom-up testing methods is what drives the use of 'big bang' testing. This approach waits for all the modules to be constructed and tested independently, and when they are finished, they are integrated all at once. While this approach is very quick, it frequently reveals more defects than the other methods. These errors have to be fixed and as we have seen, errors that are found 'later' take longer to fix. In addition, like bottom up, there is really nothing that can be demonstrated until later in the process. External Function Testing The 'external function test' is a black box test to verify the system correctly implements specified functions. This phase is sometimes known as an alpha test. Testers will run tests that they believe reflect the end use of the system. System Testing The 'system test' is a more robust version of the external test, and can be known as an alpha test. The essential difference between 'system' and 'external function' testing is the test platform. In system testing, the platform must be as close to production use in the customers’ environment, including factors such as hardware setup and database size and complexity. By replicating the target environment, we can more accurately test 'softer' system features (performance, security and fault-tolerance). Because of the similarities between the test suites in the external function and system test phases, a project may leave one of them out. It may be too expensive to replicate the user environment for the system test, or we may not have enough time to run both. Acceptance Testing An acceptance (or beta) test is an exercise of a completed system by a group of end users to determine whether the system is ready for deployment. Here the system will receive more realistic testing that in the 'system test' phase, as the users have a better idea how the system will be used than the system testers. Regression Testing
Regression testing is an expensive but necessary activity performed on modified software to provide confidence that changes are correct and do not adversely affect other system components. Four things can happen when a developer attempts to fix a bug. Three of these things are bad, and one is good: New Bug No New Bug Successful Change Bad Good Unsuccessful Change Bad Bad Because of the high probability that one of the bad outcomes will result from a change to the system, it is necessary to do regression testing. It can be difficult to determine how much re-testing is needed, especially near the end of the development cycle. Most industrial testing is done via test suites; automated sets of procedures designed to exercise all parts of a program and to show defects. While the original suite could be used to test the modified software, this might be very timeconsuming. A regression test selection technique chooses, from an existing test set, the tests that are deemed necessary to validate modified software. There are three main groups of test selection approaches in use: • •
•
Minimization approaches seek to satisfy structural coverage criteria by identifying a minimal set of tests that must be rerun. Coverage approaches are also based on coverage criteria, but do not require minimization of the test set. Instead, they seek to select all tests that exercise changed or affected program components. Safe attempt instead to select every test that will cause the modified program to produce different output than original program.
An interesting approach to limiting test cases is based on whether we can confine testing to the "vicinity" of the change. (Ex. If I put a new radio in my car, do I have to do a complete road test to make sure the change was successful?) A new breed of regression test theory tries to identify, through program flows or reverse engineering, where boundaries can be placed around modules and subsystems. These graphs can determine which tests from the existing suite may exhibit changed behavior on the new version. Regression testing has been receiving more attention as corporations focus on fixing the 'Year 2000 Bug'. The goal of most Y2K is to correct the date handling portions of their
system without changing any other behavior. A new 'Y2K' version of the system is compared against a baseline original system. With the obvious exception of date formats, the performance of the two versions should be identical. This means not only do they do the same things correctly, they also do the same things incorrectly. A non-Y2K bug in the original software should not have been fixed by the Y2K work. A frequently asked question about regression testing is 'The developer says this problem is fixed. Why do I need to re-test?’ to which the answer is 'The same person probably told you it worked in the first place'. Installation Testing The testing of full, partial, or upgrade install/uninstall processes. Completion Criteria There are a number of different ways to determine the test phase of the software life cycle is complete. Some common examples are: • • • • • •
All black-box test cases are run White-box test coverage targets are met Rate of fault discovery goes below a target value Target percentage of all faults in the system are found Measured reliability of the system achieves its target value (mean time to failure) Test phase time or resources are exhausted
When we begin to talk about completion criteria, we move naturally into a discussion of software testing metrics.
Metrics Goals As stated above, the major goal of testing is to discover errors in the software. A secondary goal is to build confidence that the system will work without error when testing does not reveal any errors. Then what does it mean when testing does not detect any errors? We can say that either the software is high quality or the testing process is low quality. We need metrics on our testing process if we are to tell which is the right answer. As with all domains of the software process, there are hosts of metrics that can be used in testing. Rather than discuss the merits of specific measurements, it is more important to know what they are trying to achieve. Three themes prevail:
• • •
Quality Assessment (What percentage of defects are captured by our testing process, how many remain?) Risk Management (What is the risk related to remaining defects?) Test Process Improvement (How long does our testing process take?)
Quality Assessment An important question in the testing process is "when should we stop?" The answer is when system reliability is acceptable or when the gain in reliability cannot compensate for the testing cost. To answer either of these concerns we need a measurement of the quality of the system. The most commonly used means of measuring system quality is defect density. Defect density is represented by: # of Defects / System Size where system size is usually expressed in thousands of lines of code or KLOC. Although it is a useful indicator of quality when used consistently within an organization, there are a number of well documented problems with this metric. The most popular relate to inconsistent definitions of defects and system sizes. Defect density accounts only for defects that are found in-house or over a given amount of operational field use. Other metrics attempt to estimate of how many defects remain undetected. A simplistic case of error estimation is based on "error seeding". We assume the system has X errors. It is artificially seeded with S additional errors. After a testing, we have discovered Tr 'real' errors and Ts seeded errors. If we assume (questionable assumption) that the testers find the same percentage of seeded errors as real errors, we can calculate X: • •
S / (X + S) = Ts / (Tr + Ts) X = S * ((Tr + Ts) / Ts -1)
For example, if we find half the seeded errors, then the number of 'real' defects found represents half of the total defects in the system. Estimating the number and severity of undetected defects allows informed decisions on whether the quality is acceptable or additional testing is cost-effective. It is very important to consider maintenance costs and redevelopment efforts when deciding on value of additional testing. Risk Management Metrics involved in risk management measure how important a particular defect is (or could be). These measurements allow us to prioritize our testing and repair cycles. A
truism is that there is never enough time or resources for complete testing, making prioritization a necessity. One approach is known as Risk Driven Testing, where Risk has specific meaning. The failure of each component is rated by Impact and Likelihood. Impact is a severity rating, based on what would happen if the component malfunctioned. Likelihood is an estimate of how probable it is that the component would fail. Together, Impact and Likelihood determine the Risk for the piece. Obviously, the higher rating on each scale corresponds to the overall risk involved with defects in the component. With a rating scale, this might be represented visually: 4 I m
3 2
p a c
1
t +
1
2
3
4
Likelihood -
The relative importance of likelihood and impact will vary from project to project and company to company. A system level measurement for risk management is the Mean Time To Failure (MTTF). Test data sampled from realistic beta testing is used find the average time until system failure. This data is extrapolated to predict overall uptime and the expected time the system will be operational. Sometimes measured with MTTF is Mean Time To Repair (MTTR). This represents the expected time until the system will be repaired and back in use after a failure is observed. Availability, obtained by calculating MTTF / (MTTF + MTTR), is the probability that a system is available when needed. While these are reasonable measures for assessing quality, they are more often used to assess the risk (financial or otherwise) that a failure poses to a customer or in turn to the system supplier. Process Improvement It is generally accepted that achieve improvement you need a measure against which to gauge performance. To improve our testing processes we the ability to compare the results from one process to another. Popular measures of the testing process report:
• •
Effectiveness: Number of defects found and successfully removed / Number of Defect Presented Efficiency: Number of defects found in a given time
It is also important to consider reported system failures in the field by the customer. If a high percentage of customer reported defects were not revealed in-house, it is a significant indicator that the testing process in incomplete. A good defect reporting structure will allow defect types and origins to be identified. We can use this information to improve the testing process by altering and adding test activities to improve our changes of finding the defects that are currently escaping detection. By tracking our test efficiency and effectiveness, we can evaluate the changes made to the testing process. Testing metrics give us an idea how reliable our testing process has been at finding defects, and can is a reasonable indicator if its performance in the future. It must be remembered that measurement is not the goal, improvement through measurement, analysis and feedback is what is needed.