POSITION PAPERS
173
Shen, V. Y., Yu, T.-L., Theabaut, S. M., and Paulsen, L. R. 1985. “Identifying error-prone software—an empirical study. IEEE Transactions on Software Engineering SE-11(4): 317–323. Siegel, S., and Jr., N. J. C. 1988. Nonparametrics Statistics for the Behavioural Sciences. McGraw-Hill, second edition. Turner, K. J., ed. 1993. Using Formal Description Techniques—An Introduction to ESTELLE, LOTOS and SDL. John Wiley & Sons.
Problems and Prospects in Quantifying Software Maintainability JARRETT ROSENBERG
[email protected];
[email protected] Sun Microsystems, 2550 Garcia Avenue, Mountain View, CA 94043
In one form or another, quantifying the maintainability of software has been attempted for decades, in pursuit of two goals: •
predicting the likelihood of corrective maintenance,
•
predicting the difficulty of maintenance activities, corrective or otherwise.
Despite the extensive activity in this area, we are not much further along than when we started. This paper looks at some of the reasons why, and the factors necessary for future progress. Goal #1: Predicting Likelihood of Corrective Maintenance Predicting the likelihood of defects in source code has long been a major focus of software metrics research (Melton, 1996). There are two necessary factors that must be considered in any such prediction: characteristics of the source code itself (including the design it embodies), and characteristics of the testing process used on the code up to the point when the prediction is made. Let us examine each of these in turn. Source Code Characteristics Some thirty years of research has yielded two basic classes of metrics: size metrics (most famously, lines of code and the Halstead metrics) and complexity metrics (most famously, McCabe’s cyclomatic complexity metric). Size metrics do predict likelihood of defects, but in precisely the same way as any exposure variable does: the more source code, the more defects. If an application’s source code were to be evenly divided into equal-sized files, size would no longer accurately predict likelihood of defects per file. For this reason, size metrics are only useful as covariates in evaluating the effect of other, non-size, metrics, that is to say, predictions must be adjusted for size to be valid.
174
EMPIRICAL SOFTWARE ENGINEERING
Although complexity metrics are somewhat correlated with size metrics (since it is difficult for a module to be very complex without also being somewhat large), factor analyses typically show them to be somewhat distinct from the latter. Of the many different complexity metrics (cf. Zuse, 1991), cyclomatic complexity is the most frequently cited example associated with the presence of defects (Shepperd and Ince, 1993; Rosenberg 1996a). The association is nevertheless weak and sometimes missing altogether. There are a variety of reasons for the failure to find a compelling connection between source code characteristics and defects: for example, the various studies are done quite differently, on different samples of source code (often in different languages). Misunderstandings about measurement are also rampant (Fenton, 1991, Briand 1996). Yet the fundamental reason is that researchers have failed to appreciate just how important the distinction between syntactic and semantic defects is. Syntactic defects, from typographic errors to abuse of language features like goto, are easy to detect and measure. As a result, such syntactic defects are increasingly caught by development tools, from compilers to memory-leak checkers.1 This reduction in syntactic defects both reduces the effectiveness of syntactic predictions and heightens the role of semantically based defects, yet semantically based metrics are few and rarely studied. This is a pity, because the semantic errors are the “deep” ones, due either to a conceptual oversight or a fundamental misunderstanding of the problem or its solution. Even when they are not the most numerous, they are the hardest to find and fix (see the review of studies in Roper, 1994). If there are to be predictive source code characteristics, then, it is likely they will be semantically based metrics, rather than the syntactic ones currently in use. This does not mean that source code analysis is pointless however, merely that it will have to be much more sophisticated. For example, metrics based on the results of program slicing (Weiser, 1984; Gallagher and Lyle 1991) may give a better indication of complexity than any current syntactic metric. Test Process Characteristics While the software metrics community has been trying to predict the occurrence of defects based on syntactic features of the code itself, the testing community has been independently trying to assess the presence or absence of defects based on characteristics of the testing process the source code has undergone. Here again a basic distinction has been made, but not fully appreciated: the distinction between structural and functional testing. Both are necessary, but they have very different roles: since, by definition, only a sample of possible inputs can be used to test software, stochastically based functional testing is the only way to rigorously and objectively assess the reliability of the software. Since infrequently used and/or safety critical parts of the code cannot be adequately tested by functional means, structural testing must be used there instead. Unfortunately, just as the ease of syntactic measurements has distracted metrics researchers, the ease of structural testing compared to stochastic testing has distracted testing researchers into unhelpful comparisons of different coverage methods and partitioning strategies (Hamlet and Taylor, 1990). Even though there is no evidence that degree of test coverage is a cause rather than a consequence of a thorough test program, increasing
POSITION PAPERS
175
test coverage is usually taken as the goal of test process improvement. This misdirection of effort leaves the critical issues of functional testing, such as the construction of valid operational profiles and the automation of the test oracle, unaddressed. The basic principle still stands, however, that the type and amount of testing carried out on the source code must also be factored into the equation predicting likelihood of defects: complex code that has been heavily tested may actually be less likely to have defects than simple code that has not been extensively tested. Exactly what the relationship is, is unknown, since to my knowledge no one has done the experiment. Goal #2: Predicting the Difficulty of Maintenance Beyond the likelihood that corrective maintenance may need to be performed, we would like to be able to tell from the source code how difficult it will be to make a change, for whatever reason. The difficulty is measured both in the amount of time needed to make the change, and in the probability of successfully making the change on the first attempt. Again we are faced with essentially semantic problems rather than syntactic ones. Code Intelligibility The intelligibility or comprehensibility of code is obviously a major factor in successfully making changes to it. While various tools have been developed to visualize code and thus aid its comprehension (e.g., Eick 1992), the real issues are cognitive, not technical. Psychologists have become increasingly interested in programming as a domain to explore theories of learning and cognition (as in the series of symposia on empirical studies of programmers, e.g., Cook et al., 1993) but they unfortunately are rarely sufficiently acquainted with industrial software development and developers to make a major impact. Conversely, computer scientists grappling with the issue of program comprehension (as in the IEEE symposia on program comprehension) seem unaware of the existence of cognitive psychology. This lack of awareness of both halves of the puzzle has led each side to focus on what it knows best, leaving unaddressed some of the most basic issues, such as how the factors of programmer skill, domain knowledge, language features, and development environment interact to allow the creation and comprehension of intelligible code. Code Modularity The more modular the code, the less likely a change to it will create a ripple effect involving changes, especially unforeseen changes, elsewhere in it. The challenge, then, is to create valid measures of modularity which can guide the process of “software change impact analysis” (Bohner and Arnold, 1996). Such measures are partly syntactic and partly semantic, and will require some sophisticated tools to be effective. Program slicing techniques may also be valuable here, as metrics based on them could indicate how far-reaching the effects of one module are. The advent of object-oriented programming languages creates new
176
EMPIRICAL SOFTWARE ENGINEERING
opportunities, and new challenges, to defining and assessing modularity. While objectoriented metrics (Lorenz and Kidd, 1994) and object-oriented design principles (Lakos, 1996) have been proposed, there is little empirical work based on them.
Code Decay Just as the testing history of source code must be taken into account in predicting its likelihood of defects, the amount of modification the code has undergone must be taken into account in assessing its ease of modifiability. The phenomenon of “code decay” discussed by other participants in this workshop, whereby continually modified code becomes progressively harder to comprehend and maintain, is a kind of limiting factor in the natural history of programs. Understanding and predicting this phenomenon is one of the central issues of maintainability.
Future Prospects Given that there has been so little progress to date, were past trends to continue there would be little hope for future progress, but I am optimistic because much of the lack of progress has been due to a lack of appreciation of how other disciplines are needed to solve these problems. The unique aspects of software seem to have convinced software engineers that they have little to learn from other disciplines, with a consequent tendency to sporadically and poorly re-invent methods that have been known for decades in other domains. Slowly an awareness is growing of the need for participation by psychologists, statisticians, and traditional industrial engineers in solving software engineering problems (see, for example, NRC 1996 and Rosenberg 1996b). There are two factors that I believe are critical to success in quantifying maintainability: •
A genuinely scientific approach There needs to be a serious emphasis on formulating hypotheses and conducting properly designed empirical research to test those hypotheses. A large body of publicly available source code and related data (e.g., defect data, configuration data) is needed to support the normal scientific practices of re-analysis and replication; far too much of current work is based on either toy systems or proprietary data. It will be hard to build a science under those conditions. Moreover, many of the basic phenomena under study need to more fully defined: for example, while it is generally accepted that programmer characteristics such as expertise are critical factors in software quality and productivity, there are as yet no generally accepted definitions or measuring instruments for them. Without such standard measures, there is no way to empirically demonstrate those effects, let alone compare experiments and accumulate a body of knowledge about them. Finally, until computer scientists are trained in research methodology and data analysis, there needs to be an awareness that experts in these areas should be consulted in doing empirical research.
POSITION PAPERS
•
177
Active interdisciplinary collaboration with other fields Few of software engineering’s critical problems can be solved without the active involvement of other disciplines, whether they be methodologists such as statisticians, or those with critical theories and results, such as cognitive psychologists or industrial engineers. Such collaborations have already started to occur, and they need to be actively supported; interdisciplinary work is not easy, but progress without it is simply not possible.
Notes 1. Indeed, Hatton (1995) makes the claim that even a defect-causing language like C can be made safe to use if tools are used to catch the stereotypic errors the language induces.
References Bohner, S., and Arnold, R. eds. 1996. Software Change Impact Analysis. Los Alamitos, CA: IEEE Computer Society Press. Briand, L., El Amam, K., and Morasca, S. 1996. On the application of measurement theory in software engineering. Empirical Software Engineering 1(1): 61–81. Cook, C., Scholtz, J., and Spohrer, J. eds. 1993. Empirical Studies of Programmers: Fifth Workshop. Norwood, NJ: Ablex. Gallagher, K., and Lyle, J. 1991. Using program slicing in software maintenance. IEEE Trans. Software Eng. 17(8): 751–761. Fenton, N. 1991. Software Metrics: A Rigorous Approach. London: Chapman and Hall. Hamlet, R., and Taylor, R. 1990. Partition testing does not inspire confidence. IEEE Trans. Software Eng. 16(12): 1402–1411. Hatton, L. 1995. Safer C. NY: McGraw-Hill. Lakos, J. 1996. Large Scale C++ Software Design. Reading, MA: Addison-Wesley. Lorenz, M., and Kidd, J. 1994. Object-Oriented Software Metrics. Englewood Cliffs, NJ: Prentice-Hall. Melton, A. ed. 1996. Software Measurement. London: International Thomson Computer Press. National Research Council (NRC) 1996. Statistical Software Engineering. Committee on Applied and Theoretical Statistics, Board on Mathematical Sciences. Washington, D.C.: National Academy Press. Roper, M. 1994. Software Testing. NY: McGraw-Hill. Rosenberg, J. 1996a. Linking internal and external quality measures. Proceedings of the Ninth International Software Quality Week. San Francisco, California, May 1996. Rosenberg, J. 1996b. Software Testing As Acceptance Sampling. Proceedings of the Fourteenth Pacific Northwest Software Quality Conference. Portland, Oregon. Oct. 1996. Shepperd, M., and Ince, D. 1993. Derivation and Validation of Software Metrics. Oxford: Clarendon Press. Weiser, M. 1984. Program slicing. IEEE Trans. Software Eng. 10(6): 352–357. Zuse, H. 1991. Software Complexity Methods and Measures. NY: De Gruyter.