Statistical Methods for Rater Agreement
Statistical Methods for Rater Agreement
Recep ÖZCAN
[email protected] http://recepozcan06.blogcu.com/
2009 1
Statistical Methods for Rater Agreement
INDEX Page 1. Statistical Methods for Rater Agreement…………………………………………… 5 1.0 Basic Considerations……………………………………………...…...…………… 5 1.1 Know the goals……………………………………………………………..………. 5 1.2 Consider theory………………………………………………………..…………… 5 1.3 Reliability vs. validity…………………………………………………..………….. 6 1.4 Modeling vs. description…………………………………………………………… 6 1.5 Components of disagreement……………………………………….……………… 7 1.6 Keep it simple……………………………………………………………………….7 1.6.1 An example……………………………………………………...………………7 1.7 Recommended Methods………………………………………….………………… 8 1.7.1 Dichotomous data……………………………………………………………….8 1.7.2 Ordered-category data…………………………………………….……………. 9 1.7.3 Nominal data…………………………………………………………………… 9 1.7.4 Likert-type items………………………………………………….……………. 10 2. Raw Agreement Indices……………………………………………………...………. 12 2.0 Introduction ………………………………………………………………...……… 12 2.1 Two Raters, Dichotomous Ratings ………………………………………………… 12 2.2 Proportion of overall agreement ………………………………………...………….12 2.3 Positive agreement and negative agreement ……………………………..…………13 2.4 Significance, standard errors, interval estimation …………………………………. 13 2.4.1 Proportion of overall agreement ………………………………….…………….13 2.4.2 Positive agreement and negative agreement ………………………..…………..14 2.5 Two Raters, Polytomous Ratings …………………………………….……………. 15 2.6 Overall Agreement ……………………………………………...…….…………… 16 2.7 Specific agreement …………………………………………………..…………….. 17 2.8 Generalized Case ………………………………………………………………….. 17 2.9 Specific agreement ……………………………………………………...…………. 17 2.10 Overall agreement …………………………………………………..……………. 18 2.11 Standard errors, interval estimation, significance …………………..……………. 19 3. Intraclass Correlation and Related Method……………………………...………… 21 3.0 Introduction …………………………………………………………..……………. 21 3.1 Different Types of ICC……………………………………………….……………. 23
2
Statistical Methods for Rater Agreement
3.2 Pros and Cons……………………………………………………………...……….. 23 3.2.1 Pros………………………………………………………………...…………… 23 3.2.2 Cons…………………………………………………………………………….. 24 3.3 The Comparability Issue……………………………………….……………………25 4. Kappa Coefficients ………………………………………………………..…………. 27 4.0 Summary ……………………………………………………………………………27 5. Tests of Marginal Homogeneity ……………………………………………………. 28 5.0 Introduction …………………………………………………………….………….. 28 5.1 Graphical and descriptive methods………………………………………………… 29 5.2 Nonparametric tests ……………………………………………………..…………. 30 5.3 Bootstrapping………………………………………………………………………. 30 5.4 Loglinear, association and quasi-symmetry modeling ………………….…………. 31 5.5 Latent trait and related models ………………………………………..…………… 32 6. The Tetrachoric and Polychoric Correlation Coefficients………….…………….. 34 6.0 Introduction……………………………………………………………...…………. 34 6.0.1 Summary ……………………………………………………………….……… 34 6.1 Pros and Cons: Tetrachoric and Polychoric Correlation Coefficients……………… 34 6.1.1 Pros……………………………………………………………………..………. 34 6.1.2 Cons…………………………………………………………………….………. 35 6.2 Intuitive Explanation ………………………………………………………………. 35 7. Detailed Description …………………………………………………………………. 38 7.0 Introduction ………………………………………………………………...……… 38 7.1 Measurement Model ……………………………………………………….……….38 7.2 Using the Polychoric Correlation to Measure Agreement …………………..…….. 40 7.3 Extensions and Generalizations ………………………………………….…………42 7.3.1 Examples ……………………………………………………….……………… 42 7.4 Factor analysis and SEM……………………………………………...……………. 45 7.4.1 Programs for tetrachoric correlation…………………………………………….45 7.4.2 Programs for polychoric and tetrachoric correlation………………...………….46 7.4.3 Generalized latent correlation…………………………………………...………47 8. Latent Trait Models for Rater Agreement………………………………………….. 49 8.0 Introduction ………………………………………………………………...……… 49 8.1 Measurement Model ……………………………………………………..…………49 8.2 Evaluating the Assumptions ……………………………………………..………… 50 3
Statistical Methods for Rater Agreement
8.3 What the Model Provides ………………………………………………..………… 50 9. Odds Ratio and Yule's Q…………………………………………………….………. 52 9.0 Introduction …………………………………………………………….………….. 52 9.1 Intuitive explanation…………………………………………………….…………..52 9.2 Yule's Q………………………………………………………………..…………… 53 9.3 Log-odds ratio……………………………………………………………………… 53 9.4 Pros and Cons: the Odds Ratio…………………………………………..………….54 9.4.1 Pros………………………………………………………………….………….. 54 9.4.2 Cons………………………………………………………………….…………. 54 9.5 Extensions and alternatives …………………………………………….………….. 55 9.5.1 Extensions ………………………………………………………………………55 9.5.2 Alternatives…………………………………………………………..………….55 10. Agreement on Interval-Level Ratings …………………………………..………… 57 10.0 Introduction ………………………………………………………….…………… 57 10.1 General Issues …………………………………………………………….……….58 10.2 Rater Association ………………………………………………………...………..58 10.3 Rater Bias …………………………………………………………………..…….. 59 10.4 Rating Distribution ……………………………………………………………….. 59 10.5 Rater vs. Rater or Rater vs. Group ……………………………………………….. 59 10.6 Measuring Rater Agreement …………………………………………..…………..60 10.7 Measuring Rater Association ………………………………………..…………… 60 10.8 Measuring Rater Bias …………………………………………………………….. 62 10.9 Rater Distribution Differences …………………………………...………………. 62 10.10 Using the Results ………………………………………………..……………….63 10.11 The Delphi Method ………………………………………………...…………….63 10.12 Rater Bias ………………………………………………………….……………. 63 10.13 Rater Association ……………………………………………………..………….63 10.14 Distribution of Ratings ………………………………………………..………… 64 10.15 Discussion of Ambiguous Cases ……………………………………...………… 64
4
Statistical Methods for Rater Agreement
1. Statistical Methods for Rater Agreement 1.0 Basic Considerations In many fields it is common to study agreement among ratings of multiple judges, experts, diagnostic tests, etc. We are concerned here with categorical ratings: dichotomous (Yes/No, Present/Absent, etc.), ordered categorical (Low, Medium, High, etc.), and nominal (Schizophrenic, Bi-Polar, Major Depression, etc.) ratings. Likert-type ratings--intermediate between ordered-categorical and interval-level ratings, are also considered. There is little consensus about what statistical methods are best to analyze rater agreement (we will use the generic words "raters" and "ratings" here to include observers, judges, diagnostic tests, etc. and their ratings/results.) To the non-statistician, the number of alternatives and lack of consistency in the literature is no doubt cause for concern. This site aims to reduce confusion and help researchers select appropriate methods for their applications. Despite the many apparent options for analyzing agreement data, the basic issues are very simple. Usually there are one or two methods best for a particular application. But it is necessary to clearly identify the purpose of analysis and the substantive questions to be answered. 1.1 Know the goals The most common mistake made when analyzing agreement data is not having a explicit goal. It is not enough for the goal to be "measuring agreement" or "finding out if raters agree." There is presumably some reason why one wants to measure agreement. Which statistical method is best depends on this reason. For example, rating agreement studies are often used to evaluate a new rating system or instrument. If such a study is being conducted during the development phase of the instrument, one may wish to analyze the data using methods that identify how the instrument could be changed to improve agreement. However if an instrument is already in a final format, the same methods might not be helpful. Very often agreement studies are an indirect attempt to validate a new rating system or instrument. That is, lacking a definitive criterion variable or "gold standard," the accuracy of a scale or instrument is assessed by comparing its results when used by different raters. Here one may wish to use methods that address the issue of real concern--how well do ratings reflect the true trait one wants to measure? In other situations one may be considering combining the ratings of two or more raters to obtain evaluations of suitable accuracy. If so, again, specific methods suitable for this purpose should be used. 1.2 Consider theory A second common problem in analyzing agreement is the failure to think about the data from 5
Statistical Methods for Rater Agreement
the standpoint of theory. Nearly all statistical methods for analyzing agreement make assumptions. If one has not thought about the data from a theoretical point of view it will be hard to select an appropriate method. The theoretical questions one asks do not need to be complicated. Even simple questions, like "is the trait being measured really discrete, like presence/absence of a pathogen, or is the trait really continuous and being divided into discrete levels (e.g., "low," "medium, "high") for convenience? If the latter, is it reasonable to assume that the trait is normally distributed? Or is some other distribution plausible? Sometimes one will not know the answers to these questions. That is fine, too, because there are methods suitable for that case also. The main point is to be inclined to think about data in this way, and to be attuned to the issue of matching method and data on this basis. These two issues--knowing ones goals and considering theory, are the main keys to successful analysis of agreement data. Following are some other, more specific issues that pertain to the selection of methods appropriate to a given study. 1.3 Reliability vs. validity One can broadly distinguish two reasons for studying rating agreement. Sometimes the goal is estimate the validity (accuracy) of ratings in the absence of a "gold standard." This is a reasonable use of agreement data: if two ratings disagree, then at least one of them must be incorrect. Proper analysis of agreement data therefore permits certain inferences about how likely a given rating is to be correct. Other times one merely wants to know the consistency of ratings made by different raters. In some cases, the issue of accuracy may even have no meaning--for example ratings may concern opinions, attitudes, or values. 1.4 Modeling vs. description One should also distinguish between modeling vs. describing agreement. Ultimately, there are only a few simple ways to describe the amount of agreement: for example, the proportion of times two ratings of the same case agree, the proportion of times raters agree on specific categories, the proportions of times different raters use the various rating levels, etc. The quantification of agreement in any other way inevitably involves a model about how ratings are made and why raters agree or disagree. This model is either explicit, as with latent structure models, or implicit, as with the kappa coefficient. With this in mind, two basic principles are evident: • •
It is better to have a model that is explicitly understood than one which is only implicit and potentially not understood. The model should be testable.
Methods vary with respect to how well they meet the these two criteria.
6
Statistical Methods for Rater Agreement
1.5 Components of disagreement Consider that disagreement has different components. With ordered-category (including dichotomous) ratings, one can distinguish between two different sources of disagreement. Raters may differ: (a) in the definition of the trait itself; or (b) in their definitions of specific rating levels or categories. A trait definition can be thought of as a weighted composite of several variables. Different raters may define or understand the trait as different weighted combinations. For example, to one rater Intelligence may mean 50% verbal skill and 50% mathematical skill; to another it may mean 33% verbal skill, 33% mathematical skill, and 33% motor skill. Thus their essential definitions of what the trait means differ. Similarity in raters' trait definitions can be assessed with various estimates of the correlation of their ratings, or analogous measures of association. Category definitions, on the other hand, differ because raters divide the trait into different intervals. For example, by "low skill" one rater may mean subjects from the 1st to the 20th percentile. Another rater, though, may take it to mean subjects from the 1st to the 10th percentile. When this occurs, rater thresholds can usually be adjusted to improve agreement. Similarity of category definitions is reflected as marginal homogeneity between raters. Marginal homogeneity means that the frequencies (or, equivalently, the "base rates") with which two raters use various rating categories are the same. Because disagreement on trait definition and disagreement on rating category widths are distinct components of disagreement, with different practical implications, a statistical approach to the data should ideally quantify each separately. 1.6 Keep it simple All other things being equal, a simpler statistical method is preferable to a more complicated one. Very basic methods can reveal far more about agreement data than is commonly realized. For the most part, advanced methods are complements to, not substitutes for simple methods. 1.6.1 An example: To illustrate these principles, consider the example for rater agreement on screening mammograms, a diagnostic imaging method for detecting possible breast cancer. Radiologists often score mammograms on a scale such as "no cancer," "benign cancer," "possible malignancy," or "malignancy." Many studies have examined rater agreement on applying these categories to the same set of images. In choosing a suitable statistical approach, one would first consider theoretical aspects of the data. The trait being measured, degree of evidence for cancer, is continuous. So the actual rating levels would be viewed as somewhat arbitrary discretizations of the underlying trait. A reasonable view is that, in the mind of a rater, the overall weight of evidence for cancer is an aggregate composed of various physical image features and weights attached to each feature. Raters may vary in terms of which features they notice and the weights they associate with each.
7
Statistical Methods for Rater Agreement
One would also consider the purpose of analyzing the data. In this application, the purpose of studying rater agreement is not usually to estimate the accuracy of ratings by a single rater. That can be done directly in a validity study, which compares ratings to a definitive diagnosis made from a biopsy. Instead, the aim is more to understand the factors that cause raters to disagree, with an ultimate goal of improving their consistency and accuracy. For this, one should separately assess whether raters have the same definition of the basic trait (that different raters weight various image features similarly) and that they have similar widths for the various rating levels. The former can be accomplished with, for example, latent trait models. Moreover, latent trait models are consistent with the theoretical assumptions about the data noted above. Raters' rating category widths can be studied by visually representing raters' rates of use for the different rating levels and/or their thresholds for the various levels, and statistically comparing them with tests of marginal homogeneity. Another possibility would be to examine if some raters are biased such that they make generally higher or lower ratings than other raters. One might also note which images are the subject of the most disagreement and then to try identify the specific image features that are the cause of the disagreement. Such steps can help one identify specific ways to improve ratings. For example, raters who seem to define the trait much differently than other raters, or use a particular category too often, can have this pointed out to them, and this feedback may promote their making ratings in a way more consistent with other raters. 1.7 Recommended Methods This section suggests statistical methods suitable for various levels of measurement based on the principles outlined above. These are general guidelines only--it follows from the discussion that no one method is best for all applications. But these suggestions will at least give the reader an idea of where to start. 1.7.1
Dichotomous data
Two raters
Assess raw agreement, overall and specific to each category. Use Cohen's kappa: (a) from its p-value, establish that agreement exceeds that expected under the null hypothesis of random ratings; (b) interpret the magnitude of kappa as an intraclass correlation. If different raters are used for different subjects, use the Scott/Fleiss kappa instead of Cohen's kappa. Alternatively, calculate the intraclass correlation directly instead of a kappa statistic. Use McNemar's test to evaluate marginal homogeneity. Use the tetrachoric correlation coefficient if its assumptions are sufficiently plausible. Possibly test association between raters with the log odds ratio;
Multiple raters
Assess raw agreement, overall and specific to each category. 8
Statistical Methods for Rater Agreement
Calculate the appropriate intraclass correlation for the data. If different raters are used for each subject, an alternative is the Fleiss kappa. If the trait being rated is assumed to be latently discrete, consider use of latent class models. If the trait being rated can be interpreted as latently continuous, latent trait models can be used to assess association among raters and to estimate the correlation of ratings with the true trait; these models can also be used to assess marginal homogeneity. In some cases latent class and latent trait models can be used to estimate the accuracy (e.g., Sensitivity and Specificity) of diagnostic ratings even when a 'gold standard' is lacking.
1.7.2 Ordered-category data Two raters Use weighted kappa with Fleiss-Cohen (quadratic) weights; note that quadratic weights are not the default with SAS and you must specify (WT=FC) with the AGREE option in PROC FREQ. Alternatively, estimate the intraclass correlation. Ordered rating levels often imply a latently continuous trait; if so, measure association between the raters with the polychoric correlation or one of its generalizations. Test overall marginal homogeneity using the Stuart-Maxwell test or the Bhapkar test. Test (a) for differences in rater thresholds associated with each rating category and (b) for a difference between the raters' overall bias using the respectively applicable McNemar tests. Optionally, use graphical displays to visually compare the proportion of times raters use each category (base rates). Consider association models and related methods for ordered category data. (See Agresti A., Categorical Data Analysis, New York: Wiley, 2002). Multiple raters Estimate the intraclass correlation. Test for differences in rater bias using ANOVA or the Friedman test. Use latent trait analysis as a multi-rater generalization of the polychoric correlation. Latent trait models can also be used to test for differences among raters in individual rating category thresholds. Graphically examine and compare rater base rates and/or thresholds for various rating categories. Alternatively, consider each pair of raters and proceed as described for two raters.
9
Statistical Methods for Rater Agreement
1.7.3 Nominal data Two raters Assess raw agreement, overall and specific to each category. Use the p-value of Cohen's unweighted kappa to verify that raters agree more than chance alone would predict. Often (perhaps usually), disregard the actual magnitude of kappa here; it is problematic with nominal data because ordinarily one can neither assume that all types of disagreement are equally serious (unweighted kappa) nor choose an objective set of differential disagreement weights (weighted kappa). If, however, it is genuinely true that all pairs of rating categories are equally "disparate", then the magnitude of Cohen's unweighted kappa can be interpreted as a form of intraclass correlation. Test overall marginal homogeneity using the Stuart-Maxwell test or the Bhapkar test. Test marginal homogeneity relative to individual categories using McNemar tests. Consider use of latent class models. Another possibility is use of loglinear, association, or quasi- symmetry models. Multiple raters Assess raw agreement, overall and specific to each category. If different raters are used for different subjects, use the Fleiss kappa statistic; again, as with nominal data/two raters, attend only to the p-value of the test unless one has a genuine basis for regarding all pairs of rating categories as equally "disparate". Use latent class modeling. Conditional tests of marginal homogeneity can be made within the context of latent class modeling. Use graphical displays to visually compare the proportion of times raters use each category (base rates). Alternatively, consider each pair of raters individually and proceed as described for two raters.
1.7.4 Likert-type items Very often, Likert-type items can be assumed to produce interval-level data. (By a "Likerttype item" here we mean one where the format clearly implies to the rater that rating levels are evenly-spaced, such as lowest highest |-------|-------|-------|-------|-------|-------| 1 2 3 4 5 6 7 (circle level that applies) Two raters Assess association among raters using the regular Pearson correlation coefficient. Test for differences in rater bias using the t-test for dependent samples. Possibly estimate the intraclass correlation. Assess marginal homogeneity as with ordered-category data. See also methods listed in the section Methods for Likert-type or interval-level data. Multiple raters Perform a one-factor common factor analysis; examine/report the correlation of each rater with the common factor (for details, see the section Methods for Likert-type or interval-level data).
10
Statistical Methods for Rater Agreement
Test for differences in rater bias using two-way ANOVA models. Possibly estimate the intraclass correlation. Use histograms to describe raters' marginal distributions. If greater detail is required, consider each pair of raters and proceed as described for two raters
11
Statistical Methods for Rater Agreement
2. Raw Agreement Indices 2.0 Introduction Much neglected, raw agreement indices are important descriptive statistics. They have unique common-sense value. A study that reports only simple agreement rates can be very useful; a study that omits them but reports complex statistics may fail to inform readers at a practical level. Raw agreement measures and their calculation are explained below. We examine first the case of agreement between two raters on dichotomous ratings. 2.1 Two Raters, Dichotomous Ratings Consider the ratings of two raters (or experts, judges, diagnostic procedures, etc.) summarized by Table 1:
Rater 2 Rater 1
+
-
total
+
a
b
a+b
-
c
d
c+d
total
a+c
b+d
N
The values a, b, c and d here denote the observed frequencies for each possible combination of ratings by Rater 1 and Rater 2. 2.2 Proportion of overall agreement The proportion of overall agreement (po) is the proportion of cases for which Raters 1 and 2 agree. That is: a+d a+d po = ------------- = -----. a+b+c+d N
(1)
12
Statistical Methods for Rater Agreement
This proportion is informative and useful, but, taken by itself, has possible has limitations. One is that it does not distinguish between agreement on positive ratings and agreement on negative ratings. Consider, for example, an epidemiological application where a positive rating corresponds to a positive diagnosis for a very rare disease -- one, say, with a prevalence of 1 in 1,000,000. Here we might not be much impressed if po is very high -- even above .99. This result would be due almost entirely to agreement on disease absence; we are not directly informed as to whether diagnosticians agree on disease presence. Further, one may consider Cohen's (1960) criticism of po: that it can be high even with hypothetical raters who randomly guess on each case according to probabilities equal to the observed base rates. In this example, if both raters simply guessed "positive" the large majority of times they would usually agree on the diagnosis. Cohen proposed to remedy this by comparing po to a corresponding quantity, pc, the proportion of agreement expected by raters who randomly guess. As described on the kappa coefficients page, this logic is questionable; in particular, it is not clear what advantage there is to compare an actual level of agreement, po, with a hypothetical value, pc, which would occur under an obviously unrealistic model. A much simpler way to address this issue is described immediately below. 2.3 Positive agreement and negative agreement We may also compute observed agreement relative to each rating category individually. Generically the resulting indices are called the proportions of specific agreement (Spitzer & Fleiss, 1974). With binary ratings, there are two such indices, positive agreement (PA) and negative agreement (NA). They are calculated as follows: 2a 2d PA = ----------; NA = ----------. 2a + b + c 2d + b + c
(2)
. PA, for example, estimates the conditional probability, given that one of the raters, randomly selected, makes a positive rating, the other rater will also do so. A joint consideration of PA and NA addresses the potential concern that, when base rates are extreme, po is liable to chance-related inflation or bias. Such inflation, if it exists at all, would affect only the more frequent category. Thus if both PA and NA are satisfactorily large, there is arguably less need or purpose in comparing actual to chance- predicted agreement using a kappa statistic. But in any case, PA and NA provide more information relevant to understanding and improving ratings than a single omnibus index (see Cicchetti and Feinstein, 1990). 2.4 Significance, standard errors, interval estimation
13
Statistical Methods for Rater Agreement
2.4.1 Proportion of overall agreement Statistical significance. In testing the significance of po, the null hypothesis is that raters are independent, with their marginal assignment probabilities equal to the observed marginal proportions. For a 2×2 table, the test is the same as a usual test of statistical independence in a contingency table. Any of the following could potentially be used: • • • • •
test of a nonzero kappa coefficient test of a nonzero log-odds ratio a Pearson chi-squared (X²) or likelihood-ratio chi-squared (G²) test of independence the Fisher exact test test of fit of a loglinear model with main effects only
A potential advantage of a kappa significance test is that the magnitude of kappa can be interpreted as approximately an intra-class correlation coefficient. All of these tests, except the last, can be done with SAS PROC FREQ. Standard error. One can use standard methods applicable to proportions to estimate the standard error and confidence limits of po. For a sample size N, the standard error of po is: SE(po) = sqrt[po(1 - po)/N]
(3.1)
One can alternatively estimate SE(po) using resampling methods, e.g., the nonparametric bootstrap or the jackknife, as described the next section. Confidence intervals. The Wald or "normal approximation" method estimates confidence limits of a proportion as follows: CL = po - SE × zcrit (3.2) CU = po + SE × zcrit (3.3) where SE here is SE(po) as estimated by Eq. (3.1), CL and CU are the lower and upper confidence limits, and zcrit is the z-value associated with a confidence range with coverage probability crit. For a 95% confidence range, zcrit = 1.96; for a 90% confidence range, zcrit = 1.645. When po is either very large or very small (and especially with small sample sizes) the Wald method may produce confidence limits less than 0 or greater than 1; in this case better approximate methods (see Agresti, 1996), exact methods, or resampling methods (see below) can be used instead. 2.4.2 Positive agreement and negative agreement Statistical significance. Logically, there is only one test of independence in a 2×2 table; therefore if PA significantly differs from chance, so too would NA, and vice versa. Spitzer and Fleiss (1974) described kappa tests for specific rating levels; in a 2×2 there are two such "specific kappas", but both have the same value and statistical significance as the overall kappa. Standard errors. 14
Statistical Methods for Rater Agreement
As shown by Mackinnon (2000; p. 130), asymptotic (large sample) standard errors of PA and NA are estimated by the following formulas: • SE(PA) = sqrt[4a (c + b)(a + c + b)] / (2a + b + c)^2 (3.4) • SE(NA) = sqrt[4d (c + b)(d + c + b)] / (2d + b + c)^2 (3.5) Alternatively, one can estimate standard errors using the nonparametric bootstrap or the jackknife. These are described with reference to PA as follows: •
•
•
With the nonparametric bootstrap (Efron & Tibshirani, 1993), one constructs a large number of simulated data sets of size N by sampling with replacement from the observed data. For a 2×2 table, this can be done simply by using random numbers to assign simulated cases to cells with probabilities of a/N, b/N, c/N and d/N (however, with large N, is more efficient algorithms are preferable.) One then computes the proportion of positive agreement for each simulated data set -- which we denote PA*. The standard deviation of PA* across all simulated data sets estimates the standard error SE(PA). The delete-1 (Efron, 1982) jackknife works by calculating PA for four alternative tables where 1 is subtracted from each of the four cells of the original 2 × 2 table. A few simple calculations then provide an estimate of the standard error SE(PA). The delete-1 jackknife requires less computation, but the nonparametric bootstrap is usually considered more accurate.
Confidence intervals. •
•
Asymptotic confidence limits for PA and NA can be obtained as in Eqs. 3.2 and 3.3., substituting PA and NA for po and using the asymptotic standard errors given by Eqs. 3.4 and 3.5. Alternatively, the bootstrap can be used. Again, we describe the method for PA. As with bootstrap standard error estimation, ones generate a large number (e.g., 100,000) of simulated data sets, computing an estimate PA* for each one. Results are then sorted by increasing value of PA*. Confidence limits of PA are obtained with reference to the percentiles of this ranking. For example, the 95% confidence range of PA is estimated by the values of PA* that correspond to the 2.5 and 97.5 percentiles of this distribution.
An advantage of bootstrapping is that one can use the same simulated data sets to estimate not only the standard errors and confidence limits of PA and NA, but also those of p o or any other statistic defined for the 2×2 table. A SAS program to estimate the asymptotic standard errors and asymptotic confidence limits of PA and NA has been written. For a free standalone program that supplies both bootstrap and asymptotic standard errors and confidence limits, please email the author. Readers are referred to Graham and Bull (1998) for fuller coverage of this topic, including a comparison of different methods for estimating confidence intervals for PA and NA. 2.5 Two Raters, Polytomous Ratings We now consider results for two raters making polytomous (either ordered category or purely nominal) ratings. Let C denote the number of rating categories or levels. Results for the two raters may be summarized as a C × C table such as Table 2. 15
Statistical Methods for Rater Agreement
Table2 Summary of polytomous ratings by two raters Rater 2 Rater 1
1
2
...
C
total
1
n11
n12
...
n1C
n1.
2
n21
n22
...
n2C
n2.
. .
. .
. .
...
. .
. .
C
nC1
nC2
...
nCC
nC.
total
n.1
n.2
...
n.C
N
Here nij denotes the number of cases assigned rating category i by Rater 1 and category j by Rater j, with i, j = 1, ..., C. When a "." appears in a subscript, it denotes a marginal sum over the corresponding index; e.g., ni. is the sum of nij for j = 1, ..., c, or the row marginal sum for category i; n.. = N denotes the total number of cases. 2.6 Overall Agreement For this design, po is the sum of frequencies of the main diagonal of table {nij} divided by sample size, or C po = 1/N SUM nii (4) i=1 Statistical significance •
•
One may test the statistical significance of po with Cohen's kappa. If kappa is significant/nonsignificant, then po may be assumed significant/nonsignificant, and vice versa. Note that the numerator of kappa is the difference between po and the level of agreement expected under the null hypothesis of statistical independence. The parametric bootstrap can also be used to test statistical significance. This is like the nonparametric bootstrap already described, except that samples are generated from the null hypothesis distribution. Specifically, one constructs many -- say 5000 -simulated samples of size N from the probability distribution {πij}, where
16
Statistical Methods for Rater Agreement
ni.n.j πij = ------. N
(5)
and the tabulates overall agreement, denoted p*o, for each simulated sample. The po for the actual data is considered statistically significant if it exceeds a specified percentage (e.g., 5%) of the p*o values. If one already has a computer program for nonparametric bootstrapping only slight modifications are needed to adapt it to perform a parametric bootstrap significance test. Standard error and confidence limits. Here the standard error and confidence intervals of po can again be calculated with the methods described for 2×2 tables. 2.7 Specific agreement With respect to Table 2, the proportion of agreement specific to category i is: 2nii ps(i) = ---------. ni. + n.i
(6)
Statistical significance Eq. (6) amounts to collapsing the C × C table into a 2×2 table relative to category i, considering this category a 'positive' rating, and then computing the positive agreement (PA) index of Eq. (2). This is done for each category i successively. In each reduced table one may perform a test of statistical independence using Cohen's kappa, the odds ratio, or chi-squared, or use a Fisher exact test. Standard errors and confidence limits •
•
•
Again, for each category i, we may collapse the original C × C table into a 2×2 table, taking i as the 'positive' rating level. The asymptotic standard error formula Eq. (3.4) for PA may then be used, and the Wald method confidence limits given by Eqs. (3.1) and (3.2) may be computed. Alternatively, one can use the nonparametric bootstrap to estimate standard errors and/or confidence limits. Note that this does not require a successive collapsing of the original table. The delete-1 jackknife can be used to estimate standard errors, but this does require successive collapsings of the C × C table.
2.8 Generalized Case We now consider generalized formulas for the proportions of overall and specific agreement. They apply to binary, ordered category, or nominal ratings and permit any number of raters, with potentially different numbers of raters or different raters for each case. 2.9 Specific agreement
17
Statistical Methods for Rater Agreement
Let there be K rated cases indexed by k = 1, ..., K. The ratings made on case k are summarized as: {njk} (j = 1, ..., C) = {n1k, n2k, ..., nCk} where njk is the number of times category j (j = 1, ..., C) is applied to case k. For example, if a case k is rated five times and receives ratings of 1, 1, 1, 2, and 2, then n1k = 3, n2k = 2, and {njk} = {3, 2}. Let nk denote the total number of ratings made on case k; that is, C nk = SUM njk. j=1
(7)
For case k, the number of actual agreements on rating level j is njk (njk - 1).
(8)
The total number of agreements specifically on rating level j, across all cases is K S(j) = SUM njk (njk - 1). k=1
(9)
The number of possible agreements specifically on category j for case k is equal to njk (nk - 1)
(10)
and the number of possible agreements on category j across all cases is: K Sposs(j) = SUM njk (nk - 1). k=1
(11)
The proportion of agreement specific to category j is equal to the total number of agreements on category j divided by the total number of opportunities for agreement on category j, or S(j) ps(j) = -------. Sposs(j)
(12)
2.10 Overall agreement The total number of actual agreements, regardless of category, is equal to the sum of Eq. (9) across all categories, or C O = SUM S(j). j=1
(13)
18
Statistical Methods for Rater Agreement
The total number of possible agreements is Oposs
K = SUM nk (nk - 1). k=1
(14)
Dividing Eq. (13) by Eq. (14) gives the overall proportion of observed agreement, or O po = ------. Oposs
(15)
2.11 Standard errors, interval estimation, significance The jackknife or, preferably, the nonparametric bootstrap can be used to estimate standard errors of ps(j) and po in the generalized case. The bootstrap is uncomplicated if one assumes cases are independent and identically distributed (iid). In general, this assumption will be accepted when: • • •
the same raters rate each case, and either there are no missing ratings or ratings are missing completely at random. the raters for each case are randomly sampled and the number of rating per case is constant or random. in a replicate rating (reproducibility) study, each case is rated by the procedure the same number of times or else the number of replications for any case is completely random.
In these cases, one may construct each simulated sample by repeated random sampling with replacement from the set of K cases. If cases cannot be assumed iid (for example, if ratings are not missing at random, or, say, a study systematically rotates raters), simple modifications of the bootstrap method--such as two-stage sampling, can be made. The parametric bootstrap can be used for significance testing. A variation of this method, patterned after the Monte Carlo approach described by Uebersax (1982), is as follows: Loop through s, where s indexes simulated data sets Loop through all cases k Loop through all ratings on case k For each actual rating, generate a random simulated rating, chosen such that: Pr(Rating category=j|Rater=i) = base rate of category j for Rater i.
19
Statistical Methods for Rater Agreement
If rater identities are unknown or for a reproducibility study, the total base rate for category j is used. End loop through case k's ratings End loop through cases Calculate p*o and p*s(j) (and any other statistics of interest) for sample s. End main loop The significance of po, ps(j), or any other statistic calculated, is determined with reference to the distribution of corresponding values in the simulated data sets. For example, po is significant at the .05 level (1-tailed) if it exceeds 95% of the p*o values obtained for the simulated data sets. References Agresti A. An introduction to categorical data analysis. New York: Wiley, 1996. Cicchetti DV. Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 1990, 43, 551-558. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960, 20, 37-46. Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal agreement: two families on agreement measures. Canadian Journal on Statistics, 1995, 23, 333-344. Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics, 1982. Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall, 1993. Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76, 378-381. Fleiss JL. Statistical methods for rates and proportions, 2nd Ed. New York: John Wiley, 1981. Graham P, Bull B. Approximate standard errors and confidence intervals for indices of positive and negative agreement. J Clin Epidemiol, 1998, 51(9), 763-771. Mackinnon, A. A spreadsheet for the calculation of comprehensive statistics for the assessment of diagnostic tests and inter-rater agreement. Computers in Biology and Medicine, 2000, 30, 127-134.
20
Statistical Methods for Rater Agreement
Spitzer R, Fleiss J. A re-analysis of the reliability of psychiatric diagnosis. British Journal on Psychiatry, 1974, 341-47. Uebersax JS. A design-independent method for measuring the reliability of psychiatric diagnosis. Journal on Psychiatric Research, 1982-1983, 17(4), 335-342.
21
Statistical Methods for Rater Agreement
3. Intraclass Correlation and Related Method 3.0 Introduction The Intraclass Correlation (ICC) assesses rating reliability by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects. The theoretical formula for the ICC is:
ICC
s 2(b) = -----------s 2(b) + s 2 (w)
[1]
where s 2(w) is the pooled variance within subjects, and s 2(b) is the variance of the trait between subjects. It is easily shown that s 2(b) + s 2(w) = the total variance of ratings--i.e., the variance for all ratings, regardless of whether they are for the same subject or not. Hence the interpretation of the ICC as the proportion of total variance accounted for by within-subject variation. Equation [1] would apply if we knew the true values, s 2 (w) and s 2(b). But we rarely do, and must instead estimate them from sample data. For this we wish to use all available information; this adds terms to Equation [1]. For example, s 2(b) is the variance of true trait levels between subjects. Since we do not know a subject's true trait level, we estimate it from the subject's mean rating across the raters who rate the subject. Each mean rating is subject to sampling variation--deviation from the subject's true trait level, or it's surrogate, the mean rating that would be obtained from a very large number of raters. Since the actual mean ratings are often based on two or a few ratings, these deviations are appreciable and inflate the estimate of between-subject variance. We can estimate the amount and correct for this extra, error variation. If all subjects have k ratings, then for the Case 1 ICC (see definition below) the extra variation is estimated as (1/k) s 2(w), where s 2(w) is the pooled estimate of within-subject variance. When all subjects have k ratings, s2(w) equals the average variance of the k ratings of each subject (each calculated using k-1 as denominator). To get the ICC we then: 1. Estimate s 2(b) as [s 2(b) - s 2(w)/k], where s2(b) is the variance of subjects' mean ratings, 2. Estimate s 2(w) as s 2(w), and 3. Apply Equation [1] For the various other types of ICC's, different corrections are used, each producing it's own equation. Unfortunately, these formulas are usually expressed in their computational form-with terms arranged in a way that facilitates calculation, rather than their derivational form-which would make clear the nature and rationale of the correction terms. 22
Statistical Methods for Rater Agreement
3.1 Different Types of ICC In their important paper, Shrout and Fleiss (1979) describe three classes of ICC for reliability, which they term Case 1, Case 2 and Case 3. Each Case applies to a different rater agreement study design. Case 1 Case 2 Case 3
Raters for each subject are selected at random The same raters rate each case. These are a random sample. The same raters rate each case. These are the only raters.
Case 1. One has a pool of raters. For each subject, one randomly samples from the rater pool k different raters to rate this subject. Therefore the raters who rate one subject are not necessarily the same as those who rate another. This design corresponds to a 1-way Analysis of Variance (ANOVA) in which Subject is a random effect, and Rater is viewed as measurement error. Case 2. The same set of k raters rate each subject. This corresponds to a fully-crossed (Rater × Subject), 2-way ANOVA design in which both Subject and Rater are separate effects. In Case 2, Rater is considered a random effect; this means the k raters in the study are considered a random sample from a population of potential raters. The Case 2 ICC estimates the reliability of the larger population of raters. Case 3. This is like Case 2--a fully-crossed, 2-way ANOVA design. But here one estimates the ICC that applies only to the k raters in the study. Since this does not permit generalization to other raters, the Case 3 ICC is not often used. Shrout and Fleiss (1981) also show that for each of the three Cases above, one can use the ICC in two ways: • •
To estimate the reliability of a single rating, or To estimate the reliability of a mean of several ratings.
For each of the Cases, then, there are two forms, producing a total of 6 different versions of the ICC. 3.2 Pros and Cons 3.2.1 Pros •
Flexible The ICC, and more broadly, ANOVA analysis of ratings, is very flexible. Besides the six ICCs discussed above, one can consider more complex designs, such as a grouping factor among raters (e.g., experts vs. nonexperts), or covariates. See Landis and Koch (1977a,b) for examples.
•
Software
23
Statistical Methods for Rater Agreement
Software to estimate the ICC is readily available (e.g, SPSS and SAS). Output from most any ANOVA software will contain the values needed to calculate the ICC. •
Reliability of mean ratings The ICC allows estimation of the reliability of both single and mean ratings. "Prophecy" formulas let one predict the reliability of mean ratings based on any number of raters.
•
Combines information about bias and association. An alternative to the ICC for Cases 2 and 3 is to calculate the Pearson correlation between all pairs of rater. The Pearson correlation measures association between raters, but is insensitive to rater mean differences (bias). The ICC decreases in response to both lower correlation between raters and larger rater mean differences. Some may see this advantage, but others (see Cons) as a limitation.
•
Number of categories The ICC can be used to compare the reliability of different instruments. For example, the reliability of a 3-level rating scale can be compared to the reliability of a 5-level scale (provided they are assessed relative to the same sample or population; see Cons).
3.2.2 Cons •
Comparability across populations The ICC is strongly influenced by the variance of the trait in the sample/population in which it is assessed. ICCs measured for different populations might not be comparable. For example, suppose one has a depression rating scale. When applied to a random sample of the adult population the scale might have a high ICC. However, if the scale is applied to a very homogeneous population--such as patients hospitalized for acute depression--it might have a low ICC. This is evident from the definition of the ICC as s 2(b)/ [s 2(b)+s 2(w)]. In both populations above, s 2(w), variance of different raters' opinions of the same subject, may be the same. But between-subject variance, s 2(b), may be much smaller in the clinical population than in the general population. Therefore the ICC would be smaller in the clinical population. The the same instrument may be judged "reliable" or "unreliable," depending on the population in which it is assessed. This issue is similar to, and just as much a concern as, the "base rate" problem of the kappa coefficient. It means that:
24
Statistical Methods for Rater Agreement
1. One cannot compare ICCs for samples or populations with different betweensubject variance; and 2. The often-reproduced table which shows specific ranges for "acceptable" and "unacceptable" ICC values should not be used. For more discussion on the implications of this topic see, The Comparability Issue below. •
Assumes equal spacing To use the ICC with ordered-category ratings, one must assign the rating categories numeric values. Usually categories are assigned values 1, 2, ..., C, where C is the number of rating categories; this assumes all categories are equally wide, which may not be true. An alternative is to assign ordered categories numeric values from their cumulative frequencies via probit (for a normally distributed trait) or ridit (for a rectangularly distributed trait) scoring; see Fleiss (1981).
•
Association vs. bias The ICC combines, or some might say, confounds, two ways in which raters differ: (1) association, which concerns whether the raters understand the meaning of the trait in the same way, and (2) bias, which concerns whether some raters' mean ratings are higher or lower than others. If a goal is to give feedback to raters to improve future ratings, one should distinguish between these two sources of disagreement. For discussion on alternatives that separate these components, see the Likert Scale page of this website.
•
Reliability vs. agreement With ordered-category or Likert-type data, the ICC discounts the fact that we have a natural unit to evaluate rating consistency: the number or percent of agreements on each rating category. Raw agreement is simple, intuitive, and clinically meaningful. With ordered category data, it is not clear why one would prefer the ICC to raw agreement rates, especially in light of the comparability issue discussed below. A good idea is to report reliability using both the ICC and raw agreement rates.
3.3 The Comparability Issue Above it was noted that the ICC is strongly dependent on the trait variance within the population for which it is measured. This can complicate comparisons of ICCs measured in different populations, or in generalizing results from a single population. Some suggest avoiding this problem by eliminating or holding constant the "problematic" term, s 2(b). Holding the term constant would mean choosing some fixed value for s 2(b), and using this in place of the different value estimated in each population. For example, one might pick as s 2 (b) the trait variance in the general adult population--regardless of what population the ICC is measured in.
25
Statistical Methods for Rater Agreement
However, if one is going to hold s 2(b) constant, one may well question using it at all! Why not simply report as the index of unreliability the value of s 2(w) for a study? Indeed, this has been suggested, though not used in practice much. But if one is going to disregard s 2(b) because it complicates comparisons, why not go a step further and express reliability simply as raw agreement rates--for example, the percent of times two raters agree on the exact same category, and the percent of time they are within on level of one another? An advantage of including s 2(b) is that it automatically controls for the scaling factor of an instrument. Thus (at least within the same population), ICCs for instruments with different numbers of categories can be meaningfully compared. Such is not the case with raw agreement measures or with s 2 (w) alone. Therefore, someone reporting reliability of a new scale may wish to include the ICC along with other measures if they expect later researchers might compare their results to those of a new or different instrument with fewer or more categories.
26
Statistical Methods for Rater Agreement
4. Kappa Coefficients 4.0 Summary There is wide disagreement about the usefulness of kappa statistics to assess rater agreement. At the least, it can be said that (1) kappa statstics should not be viewed as the unequivocal standard or default way to quantify agreement; (2) one should be concerned about using a statistic that is the source of so much controversy; and (3) oneshould consider alternatives and make an informed choice. One can distinguish between two possible uses of kappa: as a way to test rater independence (i.e. as a test statistic), and as a way to quantify the level of agreement (i.e., as an effect-size measure). The first use involves testing the null hypothesis that there is no more agreement than might occur by chance given random guessing; that is, one makes a qualitative, "yes or no" decision about whether raters are independent or not. Kappa is appropriate for this purpose (although to know that raters are not independent is not very informative; raters are dependent by definition, inasmuch as they are rating the same cases). It is the second use of kappa--quantifying actual levels of agreement--that is the source of concern. Kappa's calculation uses a term called the proportion of chance (or expected) agreement. This is interpreted as the proportion of times raters would agree by chance alone. However, the term is relevant only under the conditions of statistical independence of raters. Since raters are clearly not independent, the relevance of this term, and its appropriateness as a correction to actual agreement levels, is very questionable. Thus, the common statement that kappa is a "chance-corrected measure of agreement" misleading. As a test statistic, kappa can verify that agreement exceeds chance levels. But as a measure of the level of agreement, kappa is not "chance-corrected"; indeed, in the absence of some explicit model of rater decisionmaking, it is by no means clear how chance affects the decisions of actual raters and how one might correct for it. A better case for using kappa to quantify rater agreement is that, under certain conditions, it approximates the intra-class correlation. But this too is problematic in that (1) these conditions are not always met, and (2) one could instead directly calculate the intraclass correlation.
27
Statistical Methods for Rater Agreement
5. Tests of Marginal Homogeneity 5.0 Introduction Consider symptom ratings (1 = low, 2 = moderate, 3 = high) by two raters on the same sample of subjects, summarized by a 3×3 table as follows: Table 1. Summarization of ratings by Rater 1 (rows) and Rater 2 (columns). 1
2
3
1
p11
p12
p13
p1.
2
p21
p22
p23
p2.
3
p31
p32
p33
p3.
p.1
p.2
p.3
1.0
Here pij denotes the proportion of all cases assigned to category i Rater 1 and category j by Rater 2. (The table elements could as easily be frequencies.) The terms p1., p2., and p3. denote the marginal proportions for Rater 1--i.e. the total proportion of times Rater 1 uses categories 1, 2 and 3, respectively. Similarly, p.1, p.2, and p.3 are the marginal proportions for Rater 2. Marginal homogeneity refers to equality (lack of significant difference) between one or more of the row marginal proportions and the corresponding column proportion(s). Testing marginal homogeneity is often useful in analyzing rater agreement. One reason raters disagree is because of different propensities to use each rating category. When such differences are observed, it may be possible to provide feedback or improve instructions to make raters' marginal proportions more similar and improve agreement. Differences in raters' marginal rates can be formally assessed with statistical tests of marginal homogeneity (Barlow, 1998; Bishop, Fienberg & Holland, 1975; Ch. 8). If each rater rates different cases, testing marginal homogeneity is straightforward: one can compare the marginal frequencies of different raters with a simple chi-squared test. However this cannot be done when different raters rate the same cases--the usual situation with rater agreement studies; then the ratings of different raters are not statistically independent and this must be accounted for. Several statistical approaches to this problem are available. Alternatives include:
• • •
Nonparametric tests Bootstrap methods Loglinear, association, and quasi-symmetry models
•
Latent trait and related models
28
Statistical Methods for Rater Agreement
These approaches are outlined here. 5.1 Graphical and descriptive methods Before discussing formal statistical methods, non-statistical methods for comparing raters' marginal distributions should be briefly mentioned. Simple descriptive methods can be very useful. For example, a table might report each raters' rate of use for each category. Graphical methods are especially helpful. A histogram can show the distribution of each raters' ratings across categories. The following example is from the output of the MH program: Marginal Distributions of Categories for Rater 1 (**) and Rater 2 (==) 0.304 + ** | ** == | ** == == | ** == ** == ** == | ** == ** == ** == | ** == ** == ** == | ** == ** == ** == ** == | ** == ** == ** == ** == ** == | ** == ** == ** == ** == ** == ** == | ** == ** == ** == ** == ** == ** == 0 +----+-------+-------+-------+-------+-------+---1 2 3 4 5 6 Notes: x-axis is category number or level. y-axis is proportion of cases. Vertical or horizontal stacked-bar histograms are good ways to summarize the data. With ordered-category ratings, a related type of figure shows the cumulative proportion of cases below each rating level for each rater. An example, again from the MH program, is as follows:
Proportion of cases below each level 1 234 5 6 *---*-*-*-----*-------------------*-------------------------- Rater 1 *---*-*-*--------*------------*------------------------------ Rater 2 1 234 5 6 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ Scale 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .1
29
Statistical Methods for Rater Agreement
These are merely examples. Many other ways to graphically compare marginal distributions are possible. 5.2 Nonparametric tests The main nonparametric test for assessing marginal homogeneity is the McNemar test. The McNemar test assesses marginal homogeneity in a 2×2 table. Suppose, however, that one has an N×N crossclassification frequency table that summarizes ratings by two raters for an Ncategory rating system. By collapsing the N×N table into various 2×2 tables, one can use the McNemar test to assess marginal homogeneity of each rating category. With ordered-category data one can also collapse the N×N table in other ways to test rater equality of category thresholds, or test raters for overall bias (i.e., a tendency to make higher or lower rating than other raters.) The Stuart-Maxwell test can be used to test marginal homogeneity between two raters across all categories simultaneously. It thus complements McNemar tests of individual categories by providing an overall significance value. Further explanation of these methods and their calculation can be found by clicking on the test names above. MH, a computer program for testing marginal homogeneity with these methods is available online. For more information, click here. These tests are remarkably easy to use and are usually just as effective as more complex methods. Because the tests are nonparametric, they make few or no assumptions about the data. While some of the methods described below are potentially more powerful, this comes at the price of making assumptions which may or may not be true. The simplicity of the nonparametric tests lends persuasiveness to their results. A mild limitation is that these tests apply only for comparisons of two raters. With more than two raters, of course, one can apply the tests for each pair of raters. 5.3 Bootstrapping Bootstrap and related jackknife methods (Efron, 1982; Efron & Tibshirani, 1993) provide a very general and flexible framework for testing marginal homogeneity. Again, suppose one has an N×N crossclassification frequency table summarizing agreement between two raters on an N-category rating. Using what is termed the nonparametric bootstrap, one would repeatedly sample from this table to produce a large number (e.g., 500) of pseudo-tables, each with the same total frequency as the original table. Various measures of marginal homogeneity would be calculated for each pseudo-table; for example, one might calculate the difference between the row marginal proportion and the column marginal proportion for each category, or construct an overall measure of row vs. column marginal differences. Let d* denote such a measure calculated for a given pseudo-table, and let d denote the same measure calculated for the original table. From the pseudo-tables, one can empirically calculate the standard deviation of d*, or d*. Let d' denote the true population value of d. 30
Statistical Methods for Rater Agreement
Assuming that d' = 0 corresponds to the null hypothesis of marginal homogeneity, one can test this null hypothesis by calculating the z value: z = d/
*
d
and determining the significance of the standard normal deviate z by usual methods (e.g., a table of z value probabilities). The method above is merely an example. Many variations are possible within the framework of bootstrap and jackknife methods. An advantage of bootstrap and jackknife methods is their flexibility. For example, one could potentially adapt them for simultaneous comparisons among more than two raters. A potential disadvantage of these methods is that the user may need to write a computer program to apply them. However, such a program could also be used for other purposes, such as providing bootstrap significance tests and/or confidence intervals for various raw agreement indices. 5.4 Loglinear, association and quasi-symmetry modeling If one is using a loglinear, association or quasi-symmetry model to analyze agreement data, one can adapt the model to test marginal homogeneity. For each type of model the basic approach is the same. First one estimates a general form of the model--that is, one without assuming marginal homogeneity; let this be termed the "unrestricted model." Next one adds the assumption of marginal homogeneity to the model. This is done by applying equality restrictions to some model parameters so as to require homogeneity of one or more marginal probabilities (Barlow, 1998). Let this be termed the "restricted model." Marginal homogeneity can then be tested using the difference G2 statistic, calculated as: difference G2 = G2(restricted) - G2(unrestricted) where G2(restricted) and G2(unrestricted) are the likelihood-ratio chi-squared model fit statistics (Bishop, Fienberg & Holland, 1975) calculated for the restricted and unrestricted models. The difference G2 can be interpreted as a chi-squared value and its significance determined from a table of chi-squared probabilities. The df are equal to the difference in df for the unrestricted and restricted models. A significant value implies that the rater marginal probabilities are not homogeneous. An advantage of this approach is that one can test marginal homogeneity for one category, several categories, or all categories using a unified approach. Another is that, if one is already analyzing the data with a loglinear, association, or quasi-symmetry model, the addition of marginal homogeneity tests may require relatively little extra work.
31
Statistical Methods for Rater Agreement
A possible limitation is that loglinear, association, and quasi-symmetry models are only welldeveloped for analysis of two-way tables. Another is that use of the difference G2 test typically requires that the unrestricted model fit the data, which sometimes might not be the case. For an excellent discussion of these and related models (including linear-by-linear models), see Agresti (2002). 5.5 Latent trait and related models Latent trait models and related methods such as the tetrachoric and polychoric correlation coefficients can be used to test marginal homogeneity for dichotomous or ordered-category ratings. The general strategy using these methods is similar to that described for loglinear and related models. That is, one estimates both an unrestricted version of the model and a restricted version that assumes marginal homogeneity, and compares the two models with a difference G2 test. With latent trait and related models, the restricted models are usually constructed by assuming that the thresholds for one or more rating levels are equal across raters. A variation of this method tests overall rater bias. That is done by estimating a restricted model in which the thresholds of one rater are equal to those of another plus a fixed constant. A comparison of this restricted model with the corresponding unrestricted model tests the hypothesis that the fixed constant, which corresponds to bias of a rater, is 0. Another way to test marginal homogeneity using latent trait models is with the asymptotic standard errors of estimated category thresholds. These can be used to estimate the standard error of the difference between the thresholds of two raters for a given category, and this standard error used to test the significance of the observed difference. An advantage of the latent trait approach is that it can be used to assess marginal homogeneity among any number of raters simultaneously. A disadvantage is that these methods require more computation than nonparametric tests. If one is only interested in testing marginal homogeneity, the nonparametric methods might be a better choice. However, if one is already using latent trait models for other reasons, such as to estimate accuracy of individual raters or to estimate the correlation of their ratings, one might also use them to examine marginal homogeneity; however, even in this case, it might be simpler to use the nonparametric tests of marginal homogeneity. If there are many raters and categories, data may be sparse (i.e., many possible patterns of ratings across raters with 0 observed frequencies). With very sparse data, the difference G2 statistic is no longer distributed as chi-squared, so that standard methods cannot be used to determine its statistical significance. References Agresti A. Categorical data analysis. New York: Wiley, 2002. Barlow W. Modeling of categorical agreement. The encyclopedia of biostatistics, P. Armitage, T. Colton, eds., pp. 541-545. New York: Wiley, 1998. 32
Statistical Methods for Rater Agreement
Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press, 1975 Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics, 1982. Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall, 1993.
33
Statistical Methods for Rater Agreement
6. The Tetrachoric and Polychoric Correlation Coefficients 6.0 Introduction This page describes the tetrachoric and polychoric correlation coefficients, explains their meaning and uses, gives examples and references, provides programs for their estimation, and discusses other available software. While discussion is primarily oriented to rater agreement problems, it is general enough to apply to most other uses of these statistics. A clear, concise description of the tetrachoric and polychoric correlation coefficients, including issues relating to their estimation, is found in Drasgow (1988). Olsson (1979) is also helpful. What distinguishes the present discussion is the view that the tetrachoric and polychoric correlation models are special cases of latent trait modeling. (This is not a new observation, but it is sometimes overlooked). Recognizing this opens up important new possibilities. In particular, it allows one to relax the distributional assumptions which are the most limiting feature of the "classical" tetrachoric and polychoric correlation models. 6.0.1 Summary The tetrachoric correlation (Pearson, 1901), for binary data, and the polychoric correlation, for ordered-category data, are excellent ways to measure rater agreement. They estimate what the correlation between raters would be if ratings were made on a continuous scale; they are, theoretically, invariant over changes in the number or "width" of rating categories. The tetrachoric and polychoric correlations also provide a framework that allows testing of marginal homogeneity between raters. Thus, these statistics let one separately assess both components of rater agreement: agreement on trait definition and agreement on definitions of specific categories. These statistics make certain assumptions, however. With the polychoric correlation, the assumptions can be tested. The assumptions cannot be tested with the tetrachoric correlation if there are only two raters; in some applications, though, theoretical considerations may justify the use of the tetrachoric correlation without a test of model fit. 6.1 Pros and Cons: Tetrachoric and Polychoric Correlation Coefficients 6.1.1 Pros: • • • • • •
These statistics express rater association in a familiar form--a correlation coefficient. They provide a way to separately quantify association and similarity of category definitions. They do not depend on number of rating levels; results can be compared for studies where the number of rating levels is different. They can be used even if different raters have different numbers of rating levels. The assumptions can be easily tested for the polychoric correlation. Estimation software is routinely available (e.g., SAS PROC FREQ, and PRELIS).
34
Statistical Methods for Rater Agreement
6.1.2 Cons: • •
Model assumptions not always appropriate--for example, if the latent trait is truly discrete. For only two raters, there is no way to test the assumptions of the tetrachoric correlation.
6.2 Intuitive Explanation Consider the example of two psychiatrists (Raters 1 and 2) making a diagnosis for presence/absence of Major Depression. Though the diagnosis is dichotomous, we allow that depression as a trait is continuously distributed in the population. +---------------------------------------------------------------+ | | | | | | * | | | * * | | | * * | | | * |* | | | * | * | | | ** | ** | | | *** | *** | | | *** | *** | | | ***** | ***** | | +--------------------------------+----------------> Y | | not depressed t depressed | | | +---------------------------------------------------------------+ Figure 1 (draft). Latent continuous variable (depression severity, Y); and discretizing threshold (t). In diagnosing a given case, a rater considers the case's level of depression, Y, relative to some threshold, t: if the judged level is above the threshold, a positive diagnosis is made; otherwise the diagnosis is negative. Figure 2 portrays the situation for two raters. It shows the distribution of cases in terms of depression level as judged by Rater 1 and Rater 2.
35
Statistical Methods for Rater Agreement
Figure 2. Joint distribution (ellipse) of depression severity as judged by two raters (Y1 and Y2); and discretizing thresholds (t1 an t2) a, b, c and d denote the proportion of cases that fall in each region defined by the two raters' thresholds. For example, a is the proportion below both raters' thresholds and therefore diagnosed negative by both. These proportions correspond to a summary of data as a 2 x 2 cross-classification of the raters' ratings. +------------------------------------------------+ | | | Rater 1 | | + | | +-------+-------+ | | -| a | b |a+b | | Rater 2 +-------+-------+ | | +| c | d |c+d | | +-------+-------+ | | a+c b+d 1 | | | +------------------------------------------------+ Figure 3 (draft). Crossclassification proportions for binary ratings by two raters. Again, a, b, c and d in Figure 3 represent proportions (not frequencies). Once we know the observed cross-classification proportions a, b, c and d for a study, it is a simple matter to estimate the model represented by Figure 2. Specifically, we estimate the location of the discretizing thresholds, t1 and t2, and a third parameter, rho, which determines the "fatness" of the ellipse. Rho is the tetrachoric correlation, or r*. It can be interpreted here as the correlation between judged disease severity (before application of thresholds) as viewed by Rater 1 and Rater 2. 36
Statistical Methods for Rater Agreement
The principle of estimation is simple: basically, a computer program tries various combinations for t1, t2 and r* until values are found for which the expected proportions for a, b, c and d in Figure 2 are as close as possible to the observed proportions in Figure 3. The parameter values that do so are regarded as (estimates of) the true, population values. The polychoric correlation, used when there are more than two ordered rating levels is a straightforward extension of the model above. The difference is that there are more thresholds, more regions in Figure 2, and more cells in Figure 3. But again the idea is to find the values for thresholds and r* that maximize similarity between model-expected and observed cross-classification proportions.
37
Statistical Methods for Rater Agreement
7. Detailed Description 7.0 Introduction In many situations, even though a trait may be continuous, it may be convenient to divide it into ordered levels. For example, for research purposes, one may classify levels of headache pain into the categories none, mild, moderate and severe. Even for trait usually viewed as discrete, one might still consider continuous gradations--for example, people infected with the flu virus exhibit varying levels of symptom intensity. The tetrachoric correlation and polychoric correlation coefficients are appropriate when the latent trait that forms the basis of ratings can be viewed as continuous. We will outline here the measurement model and assumptions for the tetrachoric correlation. The model and assumptions for the polychoric correlation are the same--the only difference is that there are more threshold parameters for the polychoric correlations, corresponding to the greater number ordered rating levels. 7.1 Measurement Model We begin with some notation and definitions. Let: X1 and X2 be the manifest (observed) ratings by Raters (or procedures, diagnostic tests, etc.) 1 and 2; these are discrete-valued variables; Y1, Y2 be latent continuous variables associated with X1 and X2; these are the prediscretized, continuous "impressions" of the trait level, as judged by Raters 1 and 2; T be the true, latent trait level of a case. A rating or diagnosis of a case begins with the case's true trait level, T. This information, along with "noise" (random error) and perhaps other information unrelated to the true trait which a given rater may consider (unique variation), leads to each rater's impression of the case's trait level (Y1 and Y2). Each rater applies discretizing thresholds to this judged trait level to yield a dichotomous or ordered-category rating (X1 and X2). Stated more formally, we have: Y1 = bT + u1 + e1, Y2 = bT + u2 + e2, where b is a regression coefficient, u1 and u2 are the unique components of the raters' impressions, and e1 and e2 represent random error or noise. It turns out that unique variation and error variation behave more or less the same in the model, and the former can be subsumed under the latter. Thus we may consider the simpler model: Y1 = b1T + e1, Y2 = b2T + e2. The tetrachoric correlation assumes that the latent trait T is normally distributed. As scaling is arbitrary, we specify that T ~ N(0, 1). Error is similarly assumed to be normally distributed 38
Statistical Methods for Rater Agreement
(and independent both between raters and across cases). For reasons we need not pursue here, the model loses no generality by assuming that var(e1) = var(e2). We therefore stipulate that e1, e2 ~ N(0, sigmae). A consequence of these assumptions is that Y1 and Y2 must also be normally distributed. To fix the scale, we specify that var(Y1) = var(Y2) = 1. It follows that b1 = b2 = b = the correlation of both Y1 and Y2 with the latent trait. We define the tetrachoric correlation, r*, as r* = b2 A simple "path diagram" may clarify this: +-------------------------------------+ | | | | | b b | | Y1 <--- T ---> Y2 | | | | | +-------------------------------------+ Figure 4 (draft). Path diagram. Here b is the path coefficient that reflects the influence of T on both Y1 and Y2. Those familiar with the rules of path analysis will see that the correlation of Y1 and Y2 is simply the product of their degree of dependence on T--that is b2. As an aside, one might consider that the value of b is interesting in its own right, inasmuch as it offers a measure of the association of ratings with the true latent trait--i.e., a measure of rating validity or accuracy. The tetrachoric correlation r* is readily interpretable as a measure of the association between the ratings of Rater 1 and Rater 2. Because it estimates the correlation that exists between the pre-discretized judgements of the raters, it is, in theory, not affected by (1) the number of rating levels, or (2) the marginal proportions for rating levels (i.e., the 'base rates.') The fact that this association is expressed in the familiar form of a correlation is also helpful. The assumptions of the tetrachoric correlation coefficient may be expressed as follows: • • • • • •
The trait on which ratings are based is continuous. The latent trait is normally distributed. Rating errors are normally distributed. Var(e) is homogeneous across levels of T. Errors are independent between raters. Errors are independent between cases.
Assumptions 1--4 can be alternatively expressed as the assumption that Y1 and Y2 follow a bivariate normal distribution.
39
Statistical Methods for Rater Agreement
We will assume that the one has sufficient theoretical understanding of the application to accept the assumption of latent continuity. The second assumption--that of a normal distribution for T--is potentially more questionable. Absolute normality, however, is probably not necessary; a unimodal, roughly symmetrical distribution may be close enough. Also, the model implicitly allows for a monotonic transformation of the latent continuous variables. That is, a more exact way to express Assumptions 1-4 is that one can obtain a bivariate normal distribution by some monotonic transformation of Y1 and Y2. The model assumptions can be tested for the polychoric correlation. This is done by comparing the observed numbers of cases for each combination of rating levels with those predicted by the model. This is done with the likelihood ratio chi-squared test, G 2 (Bishop, Fienberg & Holland, 1975), which is similar the usual Pearson chi-squared test (the Pearson chi-square test can also be used; for more information on these tests, see the FAQ for testing model fit on the Latent Class Analysis web site. The G2 test is assessed by considering the associated p value, with the appropriate degrees of freedom (df). The df are given by: df = RC - R - C where R is the number of levels used by the first rater and C is the number of levels used by the second rater. As this is a "goodness-of-fit" test, it is standard practice to set the alpha level fairly high (e.g., .10). A p value lower than the alpha level is evidence of model fit. For the tetrachoric correlation R = C = 2, and there are no df with which to test the model. It is possible to test the model, though, when there are more than two raters.
7.2 Using the Polychoric Correlation to Measure Agreement Here are the steps one might follow to use the tetrachoric or polychoric correlation to assess agreement in a study. For convenience, we will mainly refer to the polychoric correlation, which includes the tetrachoric correlation as a special case. i)
Calculate the value of the polychoric correlation.
For this a computer program, such as those described in the software section, is required. ii)
Evaluate model fit.
The next step is to determine if the assumptions of the polychoric correlation are empirically valid. This is done with the goodness-of-fit test that compares observed crossclassification frequencies to model-predicted frequencies described previously. As noted, this test cannot be done for the tetrachoric correlation. PRELIS includes a test of model fit when estimating the polychoric correlation. It is unknown whether SAS PROC FREQ includes such a test.
40
Statistical Methods for Rater Agreement
iii) Assess magnitude and significance of correlation. Assuming that model fit is acceptable, the next step is to note is the magnitude of the polychoric correlation. Its value is interpreted in the same way as a Pearson correlation. As the value approaches 1.0, more agreement on the trait definition is indicated. Values near 0 indicate little agreement on the trait definition. One may wish to test the null hypothesis of no correlation between raters. There are at least two ways to do this. The first makes use of the estimated standard error of the polychoric correlation under the null hypothesis of r* = 0. At least for the tetrachoric correlation, there is a simple closed-form expression for this standard error (Brown, 1977). Knowing this value, one may calculate a z value as: r* z = ----------sigmar*(0) where the denominator is the standard error of r* where r* = 0. One may then assess statistical significance by evaluating the z value in terms of the associated tail probabilities of the standard normal curve. The second method is via a chi-squared test. If r* = 0, the polychoric correlation model is the same as the model of statistical independence. It therefore seems reasonable to test the null hypothesis of r* = 0 by testing the statistical independence model. Either the Pearson (X 2) or likelihood-ratio (G2) chi-squared statistics can be used to test the independence model. The df for either test is (R - 1)(C - 1). A significant chi-squared value implies that r* is not equal to 0. [I now question whether the above is correct. For the polychoric correlation, data may fail the test of independence even with when r* = 0 (i.e., there may be some other kind of 'structure' to the data). If so, a better alternative would be to calculate a difference G2 statistic as: G2H0 - G2H1, where G2H0 is the likelihood-ratio chi-squared for the independence model and G2H1 is the likelihood-ratio chi-squared for the polychoric correlation model. The difference G2 can be evaluated as a chi-squared value with 1 df. -- JSU, 27 Jul 00] iv) Testing equality of thresholds. Equality of thresholds between raters can be tested by estimating what may be termed a threshold-constrained polychoric correlation. That is, one estimates the polychoric correlation with the added constraint(s) that the threshold(s) of Rater 1 is/are the same Rater 2's threshold(s). A difference G2 test is then made comparing the G2 statistic for this constrained model with the G2 for the unconstrained polychoric correlation model. The difference G2 statistic is evaluated as a chi-squared value with df = R - 1, where R is the number of rating levels (this test only applies when both raters use the same number of rating levels).
41
Statistical Methods for Rater Agreement
7.3 Extensions and Generalizations Here we briefly note some extensions and generalizations of the tetrachoric/polychoric correlation approach to analyzing rater agreement: •
Modifying latent distribution assumptions. When the assumption of latent bivariate normality is empirically or theoretically implausible, other distributional assumptions can be made. One may exploit the fact that the tetrachoric/polychoric correlation model is isomorphic with a latent trait model. Within the latter framework one can have non-normal latent trait distributions, including skewed and nonparametric distributions. Skewed distributions. A new page describing what might be colloquially called "skewed tetrachoric or polychoric correlation," but would be more accurately termed the latent correlation with a skewed latent distribution has been added. This page also describes a simple computer program to implement the model for binary ratings. Nonnparametric distributions. Example 3 below describes an alternative approach based on a nonparametric latent trait distribution.
•
•
Modifying measurement error assumptions. One can easily relax the assumptions concerning measurement error. Hutchinson (2000) described models where the variance of measurement error differs according to the latent trait level of the object being rated. In theory, one could also consider non-Gaussian distributions for measurement error. More than two raters. When there are more than two raters, the tetrachoric/polychoric correlation model generalizes to a latent trait model with normal-ogive (Gaussian cdf) response functions. Latent trait models can be used to (a) estimate the tetrachoric/polychoric correlation among all rater pairs; (b) simultaneously test whether all raters have the same definition of the latent trait; and (c) simultaneously test for equivalence of thresholds among all raters.
7.3.1 Examples Example 1. Tetrachoric Correlation Table 1 summarizes hypothetical ratings by two raters on presence (+) or absence (-) of schizophrenia. -------------------------Rater 2 --------Rater 1 - + Total --------------------------40 10 50 + 20 30 50 -------------------------Total 60 40 100 -------------------------42
Statistical Methods for Rater Agreement
Table 1 (draft) For these data, the tetrachoric correlation (std. error) is: rho 0.6071 (0.1152) which is much larger than the Pearson correlation of 0.4082 calculated for the same data. The thresholds (std. errors) for the two raters are estimated as: Rater 1 0.0000 (0.1253) Rater 2 0.2533 (0.1268) Example 2. Polychoric Correlation Table 2 summarizes number of lambs born to 227 ewes over two years. These data were previously analyzed by Tallis (1962) and Drasgow (1988). Tallis suggested that the number of lambs born is a manifestation of the ewe's fertility--a continuous and potentially normally distributed variable. Clearly the situation is more complex than the simple "continuous normal variable plus discretizing thresholds" assumptions allow for. We consider the data simply for the sake of a computational example. ----------------------------------Lambs Lambs born in 1952 born in -----------------1953 None 1 2 Total ----------------------------------None 58 52 1 111 1
26
58
3
87
2 8 12 9 29 ----------------------------------Total 92 122 13 227 ----------------------------------Table 2 (draft) Drasgow (1988; see also Olsson, 1979) described two different ways to calculate the polychoric correlation. The first method, the joint maximum likelihood (ML) approach, estimates all model parameters--i.e., rho and the thresholds--at the same time. The second method, two-step ML estimation, first estimates the thresholds from the one-way marginal frequencies, then estimates rho, conditional on these thresholds, via maximum likelihood. For the tetrachoric correlation, both methods produce the same results; for the polychoric correlation, they may produce slightly different results. The data in Table 2 are analyzed with the POLYCORR program (Uebersax, 2000). Application of the joint ML approach produces the following estimates (standard errors): rho 0.4192 (0.0761) threshold 2, 1952 -0.2421 (0.0836) threshold 3, 1952 1.5938 (0.1372) threshold 2, 1953 -0.0297 (0.0830) 43
Statistical Methods for Rater Agreement
threshold 3, 1953 1.1331 (0.1063) With two-step estimation the results are: rho 0.4199 (0.0747) threshold 2, 1952 -0.2397 threshold 3, 1952 1.5781 threshold 2, 1953 -0.0276 threshold 3, 1953 1.1371 However the G2 statistic testing model fit for the joint ML and two-step estimates are 11.54 and 11.55, respectively, each with 3 df. The corresponding p-values, less than .01, suggest poor model fit and implausibility of the polychoric model assumptions. Acceptable fit could possibly be obtained by considering a skewed latent trait distribution. Example 3. Polychoric Correlation with Relaxed Distributional Assumptions The data in Table 3, previously analyzed by Hutchinson (2000), summarize ratings on the health of 460 trees and shrubs by two raters. Rating levels denote increasing levels of plant health; i.e., 1 indicates the lowest level, and 6 the highest level. --------------------------------------------Rating Rating of Rater 1 of Rater --------------------------2 1 2 3 4 5 6 Total --------------------------------------------1 30 1 0 0 0 0 31 2 0 10 2 0 0 0 12 3 0 4 8 3 1 0 16 4 0 3 3 37 9 0 52 5 0 0 1 25 71 49 146 6 0 0 0 2 20 181 203 --------------------------------------------Total 30 18 14 67 101 230 460 --------------------------------------------Table 3 (draft). Ratings of plant health by two judges The polychoric correlation (std. error) for these data is .954 using joint estimation. However there is reason to doubt the assumptions of the standard polychoric correlation model; the G 2 model fit statistic is 57.33 on 24 df (p < .001). Hutchinson (2000) showed that the data can be fit by allowing measurement error variance to differ from low to high levels of the latent trait. Instead, we relax the assumption of a normally distributed latent trait. Using the LLCA program (Uebersax, 1993a) a latent trait model with a nonparametric latent trait distribution was fit to the data. The distribution was represented as six equally-spaced locations (located latent classes) along a unidimensional continuum, the density at each location (latent class prevalence) being estimated. Model fit, assessed by the G2 statistic was 15.65 on 19 df (p = .660). The LLCA program gave the correlation of each variable with the latent trait as .963. This value squared, .927, estimates what the correlation of the raters would be if they made their ratings on a continuous scale. This is a generalization of the polychoric correlation, though perhaps we should reserve that term for the latent bivariate normal case. Instead, we simply term this the latent correlation between the raters. 44
Statistical Methods for Rater Agreement
(To see the input file for the LLCA program, click here.) The distribution of the latent trait estimated by the model is follows: .5 + * D | * e .4 + * n | * s .3 + * i | * * t .2 + * * y | * * .1 + * * * | * * * * * +----*------*------*------*------*------*----2.5 -1.5 -0.5 0.5 1.5 2.5 Latent Trait Level Figure 5 (draft). Estimated latent trait distribution The shape suggests that the latent trait could be economically modeled with an asymmetric parametric distribution, such as a beta or exponential distribution. 7.4 Factor analysis and SEM A new, separate wepage has been added on the topic of factor analysis and SEM with tetrachoric and polychoric correlations. 7.4.1 Programs for tetrachoric correlation TetMat is my free program to estimate a matrix of tetrachoric correlations. It also supplies other useful information such as one- and two-way marginal frequencies and rates, asymptotic standard errors of rho, p-values, confidence ranges, and thresholds. Provisions are made to smooth a potentially improper correlation matrix by the method of Knol and ten Berge (1989). Tcorr is a simple utility for estimating a single tetrachoric correlation coefficient and its standard error. Just enter the frequencies of a fourfold table and get the answer. Also supplies threshold estimates. Dirk Enzmann has written an SPSS macro to estimate a matrix of tetrachoric correlations. He also has a standalone version. Jim Fleming also has a program to estimate a matrix of tetrachoric correlations and optionally smoothe of a poorly conditioned matrix. Brown's (1977) algorithm AS 116, a Fortran subroutine to calculate the tetrachoric correlation and its standard error, can be found at StatLib. Alternatively, you can download my program, Tcorr, above, which includes simple source code with an actual working version of Brown's subroutine. 45
Statistical Methods for Rater Agreement
TESTFACT is a very sophisticated program for item analysis using both classical and modern psychometric (IRT) methods. It includes provisions for calculating tetrachoric correlations. 7.4.2 Programs for polychoric and tetrachoric correlation POLYCORR is a program I've written to estimate the polychoric correlation and its standard error using either joint ML or two-step estimation. Goodness-of-fit and a lot of other information are also provided. Note: this program is just for a single pair of variables, or a few considered two at a time. It does not estimate a matrix of tetra- or polychoric correlations. o o
Basic version. This handles square tables only (i.e., models where both items have the same number of levels). Advanced version. This allows non-square tables and has other advanced technical features, such as the ability to combine cells during estimation.
PRELIS. A useful program for estimating a matrix of polychoric or tetrachoric correlations is PRELIS. It includes a goodness-of-fit test for each pair of variables. Standard errors can be requested. PRELIS uses two-step estimation. Because it is supplied with LISREL, PRELIS is widely available. Most university computation centers probably already have copies and/or site licenses. Mplus can estimate a matrix of polychoric and tetrachoric correlations and estimate their standard errors. Two-step estimation is used. Features similar to PRELIS/LISREL. MicroFACT will estimate polychoric and tetrachoric correlations and standard errors. Provisions for smoothing an improper correlation matrix are supplied. No goodnessof-fit tests. A free student version that handles up to thirty variables can be downloaded. Also does factor analysis. SAS A single polychoric or tetrachoric correlation can be calculated with the PLCORR option of SAS PROC FREQ. Example: proc freq; tables var1*var2 / plcorr maxiter=100; run; Joint estimation is used. The standard error is supplied, but not thresholds. No goodness-of-fit test is performed. A SAS macro, %POLYCHOR , can construct a matrix of polychoric or tetrachoric correlations. For tetrachoric correlations, if there is a single 0 frequency in the 2x2 crossclassification table for a pair of variables (see Figure 3 above), plcorr and %POLYCHOR may unnecessarily supply a missing value result, at least if maxiter is left at the default value of 20. So far I have found this problem is avoided by setting maxiter higher, e.g., to 40, 50 or 100. (Increasing the value of maxiter should not
46
Statistical Methods for Rater Agreement
significantly increase run times). In any case, it's a good idea to check your SAS log, which will contain a message if estimation did not converge for any pair of variables. The macro is relatively slow (e.g., on a PC, a 50 x 50 matrix can take 5 minutes to estimate; a 100 x 100 matrix four times as long). SPSS SPSS has no intrinsic procedure to estimate polychoric correlations. As noted above, Dirk Enzmann has written an SPSS macro to estimate a matrix of tetrachoric correlations. R John Fox has written an R program to estimate the polychoric correlation and its standard error with R. A goodness-of-fit test is performed. Another R program for polychorics has been written by David Duffy. Stata Stata's internal function for tetrachoric correlations is a very rough approximation (e.g., actual tetrachoric correlation = .5172, Stata reports .6169!) based on Edwards and Edwards (1984) and is unsuitable for many or most applications. A more accurate external module has been written by Stas Kolenikov to estimate a matrix of polychoric or tetrachoric correlations and their standard errors. 7.4.3 Generalized latent correlation The glc program generalizes the tetrachoric correlation to estimate the latent correlation between binary variables assuming a skewed latent trait distribution. The skewed distribution is modeled as a mixture of two Gaussian distributions, the parameters of which the user supplies; that is, one specifies in advance the shape of the latent trait distribution, based on prior beliefs/knowledge. This program is much simpler to use than those desribed below. Several sets of data (summarized as a series of 2× tables) can be analyzed in a single run. The LTMA program can similarly be used to estimate a generalized polychoric correlation, based on a latent trait mixture model (Uebersax & Grove, 1993). This is basically a fancier version of the glc program: (1) it handles ordered-categorical as well as dichotomous variables, and (2) it will estimate the shape of the latent trait distribution from the data (again, modeling it as a mixture of two component Gaussians). The LLCA program can be used to estimate a polychoric correlation with nonparametric distributional assumptions. The latent trait is represented as a sequence of latent classes on a single continuum (Uebersax 1993b). That is, the latent trait distribution is modeled as a "histogram," where the densities at each point are estimated, rather as a continuous parametric distribution. References Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press, 1975 47
Statistical Methods for Rater Agreement
Brown MB. Algorithm AS 116: the tetrachoric correlation and its standard error. Applied Statistics, 1977, 26, 343-351. Drasgow F. Polychoric and polyserial correlations. In Kotz L, Johnson NL (Eds.), Encyclopedia of statistical sciences. Vol. 7 (pp. 69-74). New York: Wiley, 1988. Edwards JH, Edwards AWF. Approximating the tetrachoric correlation coefficient. Biometrics, 1984, 40, 563. Harris B. Tetrachoric correlation coefficient. In Kotz L, Johnson NL (Eds.), Encyclopedia of statistical sciences. Vol. 9 (pp. 223-225). New York: Wiley, 1988. Hutchinson TP. Kappa muddles together two sources of disagreement: tetrachoric correlation is better. Research in Nursing and Health, 1993, 16, 313-315. Hutchinson TP. Assessing the health of plants: Simulation helps us understand observer disagreements. Environmetrics, 2000, 11, 305-314. Joreskog KG, Sorbom, D. PRELIS User's Manual, Version 2. Chicago: Scientific Software, Inc., 1996. Knol DL, ten Berge JMF. Least-squares approximation of an improper correlation matrix by a proper one. Psychometrika, 1989, 54, 53-61. Loehlin JC. Latent variable models, 3rd ed. Lawrence Erlbaum, 1999. Olsson U. Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 1979, 44(4), 443-460. Pearson K. Mathematical contributions to the theory of evolution. VII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society of London, Series A, 1900, vol. 195, pp. 1-47. Tallis GM. The maximum likelihood estimation of correlation from contingency tables. Biometrics, 1962, 342-353. Uebersax JS. LLCA: Located latent class analysis. Computer program documentation, 1993a. Uebersax JS. Statistical modeling of expert ratings on medical treatment appropriateness. Journal of the American Statistical Association, 1993b, 88, 421-427. Uebersax JS. POLYCORR: A program for estimation of the standard and extended polychoric correlation coefficient. Computer program documentation, 2000. Uebersax JS, Grove WM. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 1993, 49, 823-835.
48
Statistical Methods for Rater Agreement
8. Latent Trait Models for Rater Agreement 8.0 Introduction Of all the methods discussed here for analyzing rater agreement, latent trait modeling is arguably the best method for handling ordered category ratings. The latent trait model is intrinsically plausible. More than most other approaches, it applies a natural view of rater decisionmaking. If one were interested only in developing a good model of how experts make ratings--without concern for the subject of agreement--one could easily arrive at the latent trait model. The latent trait model is closely related to signal detection theory, modern psychometric theory, and factor analysis. The latent trait agreement model is also very flexible and can be adapted to specific needs of a given study. Given its advantages, that this method is not more often used is surprising. The likely explanation is its relative unfamiliarity and a mistaken perception that it is difficult or esoteric. In truth, it is no more complex that many standard methods for categorical data analysis. The basic principles of latent trait models for rater agreement are sketched here (this will be expanded as time permits). For more details, the reader may consult Uebersax (1992; oriented to non-statisticians) and Uebersax (1993; a technical exposition), or other references listed in the bibliography below. If there are only two raters, the latent trait model is the same as the polychoric correlation coefficient model. 8.1 Measurement Model The essence of the latent trait agreement model is contained in the measurement model,
Y = bT + e (1) where: “T” is the latent trait level of a given case; “Y” is the perception or impression of a given rater of the case's trait level; “b” is a regression coefficient; and “e” is measurement error. The latent trait is what the ratings intend measure--for example, disease severity, subject ability, or treatment effectiveness; this corresponds to the "signal" emitted by the case being rated. The term e corresponds to random measurement error or noise. The combined effect of T and e is to produce a continuous variable, Y, which is the rater's impression of the signal. These continuous impressions are converted to ordered category ratings as the rater applies thresholds associated with the rating categories.
49
Statistical Methods for Rater Agreement
Model parameters are estimated from observed data. The basic estimated parameters are: (1) parameters that describe the distribution of the latent trait in the sample or population; (2) the regression coefficient, b, for each rater; and (3) the threshold locations for each rater. Model variations may have more or fewer parameters. Parameters are estimated by a computer algorithm that iteratively tests and revises parameter values to find those which best fit the observed data; usually "best" means the maximum likelihood estimates. Many different algorithms can be used for this. 8.2 Evaluating the Assumptions The assumptions of the latent trait model are very mild. (Moreover, it should be noted that the assumptions are tested by evaluating the fit of a model to the observed data). The existence of a continuous latent trait, a simple additive model of "signal plus noise," and thresholds that map a rater's continuous impressions into discrete rating categories are very plausible. One has latitude in choosing the form of the latent trait distribution. A normal (Gaussian) distribution is most often assumed. If, as in many medical applications, this is considered unsuitable, one can consider an asymmetric distribution; this is readily modeled as say, a beta distribution which can be substituted for a normal distribution with no difficulty. Still more flexible are versions that use a nonparametric latent trait distribution. This approach models the latent trait distribution in a way analogous to a histogram, where the user controls the number of bars, and each bar's height is optimized to best fit the data. In this way nearly any latent trait distribution can be well approximated. The usual latent trait agreement model makes two assumptions about measurement error. The first is that it is normally distributed. The second is that, for any rater, measurement error variance is constant. Similar assumptions are made in many statistical models. Still one might wish to relax them. Hutchinson (2000) showed how non-constant measurement error variance can be easily included in latent trait agreement models. For example, measurement error variance can be lower for cases with very high or low latent trait levels, or may increase from low to high levels of the latent trait, 8.3 What the Model Provides The latent trait agreement model supplies parameters that separately evaluate the degree of association between raters, and differences in their category definitions. The separation of these components of agreement and disagreement enable one to precisely target interventions to improve rater consistency. Association is expressed as a correlation between each rater's impressions (Y) and the latent trait. A higher correlation means that a rater's impressions are more strongly associated with the "average" impression of all other raters. A simple statistical test permits assessment of the significance of rater differences in their correlation with the latent trait. One can also use the model to express association between a pair of raters as a correlation between one rater's 50
Statistical Methods for Rater Agreement
impressions and those of the other; this measure is related to the polychoric correlation coefficient. Estimated rater thresholds can be displayed graphically. Their inspection, with particular attention given to the distance between successive thresholds of a rater, shows how raters may differ in the definition and widths of the rating categories. Again, these differences can be statistically tested. Finally, the model can be used to measure the extent to which one rater's impressions may be systematically higher or lower than those of other raters--that is, for the existence of rater bias. References Heinen T. Latent class and discrete latent trait models: Similarities and differences. Thousand Oaks, California: Sage, 1996. Hutchinson TP. Assessing the health of plants: Simulation helps us understand observer disagreements. Environmetrics, in press (2000). Johnson VE, Albert JH. Modeling ordinal data. New York: Springer, 1999. Uebersax JS. A review of modeling approaches for the analysis of observer agreement. Investigative Radiology, 1992, 27, 738-743. Uebersax JS. Statistical modeling of expert ratings on medical treatment appropriateness. Journal of the American Statistical Association, 1993, 88, 421-427. Uebersax JS, Grove WM. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 1993, 49, 823-835.
51
Statistical Methods for Rater Agreement
9. Odds Ratio and Yule's Q 9.0 Introduction The odds ratio is an important option for testing and quantifying the association between two raters making dichotomous ratings. It should probably be used more often with agreement data than it currently is. The odds ratio can be understood with reference to a 2×2 crossclassification table:
Rater 2 Rater 1
+
-
+
a
b
a+b
-
c
d
c+d
a+c
b+d
Total
By definition, the odds ratio, OR, is [a/(a+b)] / [b/(a+b)] OR = -----------------------, [c/(c+d)] / [d/(c+d)]
(1)
but this reduces to a/b OR = -----, c/d
(2)
or, as OR is usually calculated, ad OR = ----. (3) bc The last equation shows that OR is equal to the simple crossproduct ratio of a 2×2 table. 9.1 Intuitive explanation The concept of "odds" is familiar from gambling. For instance, one might say the odds of a particular horse winning a race are "3 to 1"; this means the probability of the horse winning is 3 times the probability of not winning. In Equation (2), both the numerator and denominator are odds. The numerator, a/b, gives the odds of a positive versus negative rating by Rater 2 given that Rater 1's rating is positive. The denominator, c/d, gives the odds of a positive versus negative rating by Rater 2 given that Rater 1's rating is negative.
52
Statistical Methods for Rater Agreement
OR is the ratio of these two odds--hence its name, the odds ratio. It indicates how much the odds of Rater 2 making a positive rating increase for cases where Rater 1 makes a positive rating. This alone would make the odds ratio a potentially useful way to assess association between the ratings of two raters. However, it has some other appealing features as well. Note that: a/b a/c d/b d/c ad OR = ----- = ----- = ----- = ----- = ----. c/d b/d c/a b/a bc From this we see that the odds ratio can be interpreted in various ways. Generally, it shows the relative increase in the odds of one rater making a given rating, given that the other rater made the same rating--the value is invariant regardless of whether one is concerned with a positive or negative rating, or which rater is the reference and which the comparison. The odds ratio can be interpreted as a measure of the magnitude of association between the two raters. The concept of an odds ratio is also familiar from other statistical methods (e.g., logistic regression). 9.2 Yule's Q OR can be transformed to a -1 to 1 scale by converting it to Yule's Q (or a slightly different statistic, Yule's Y.) For example, Yule's Q is OR - 1 Q = --------. OR + 1 9.3 Log-odds ratio It is often more convenient to work with the log of the odds ratio than with the odds ratio itself. The formula for the standard error of log(OR) is very simple: = square-root(1/a + 1/b + 1/c + 1/d). Knowing this standard error, one can easily test the significance of log(OR) and/or construct confidence intervals. The former is accomplished by calculating: log(OR)
z = log(OR)/
log(OR)
.
and referring to a table of the cumulative distribution of the standard normal curve to determine the p-value associated with z. Confidence limits are calculated as: log(OR) ± zL ×
.
log(OR)
53
Statistical Methods for Rater Agreement
where zL is the z value defining the appropriate confidence limits, e.g., zL = 1.645 or 1.96 for a two-sided 90% or 95% confidence interval, respectively. Confidence limits for OR may be calculated as: exp[log(OR) ± zL ×
].
log(OR)
Alternatives are to estimate confidence intervals by the nonparametric bootstrap (for description, see the Raw agreement indices page) or to construct exact confidence intervals by considering all possible distributions of the cases in a 2×2 table. Once one has used log OR or OR to assess association between raters, one may then also perform a test of marginal homogeneity, such as the McNemar test. 9.4 Pros and Cons: the Odds Ratio 9.4.1 Pros
• • •
The odds ratio is very easily calculated. Software for its calculation is readily available, e.g., SAS PROC FREQ and SPSS CROSSTABS. It is a natural, intuitively acceptable way to express magnitude of association.
•
The odds ratio is linked to other statistical methods.
•
If underlying trait is continuous, the value of OR depends on the level of each rater's threshold for a positive rating. That is not ideal, as it implies the basic association between raters changes if their thresholds change. Under certain distributional assumptions (socalled "constant association" models), this problem can be eliminated, but the assumptions introduce extra complexity.
•
While the odds ratio can be generalized to ordered category data, this again introduces new assumptions and complexity. (See the Loglinear, association, and quasisymmetry models page).
9.4.2 Cons
54
Statistical Methods for Rater Agreement
9.5 Extensions and alternatives 9.5.1 Extensions •
More than two categories. In an N×N table (where N > 2), one might collapse the table into various 2×2 tables and calculate log(OR) or OR for each. That is, for each rating category k = 1, ..., N, one would construct the 2×2 table for the crossclassification of Level k vs. all other levels for Raters 1 and 2, and calculate log OR or OR. This assesses the association between raters with respect to the Level k vs. not-Level k distinction. This method is probably more appropriate for nominal ratings than for orderedcategory ratings. In either case, one might consider instead using Loglinear, association, or quasi-symmetry models.
•
Multiple raters. For more than two raters, a possibility is to calculate log(OR) or OR for all pairs of raters. One might then report, say, the average value and range of values across all rater pairs.
9.5.2 Alternatives Given data by two raters, the following alternatives to the odds ratio may be considered. • •
•
• •
In a 2×2 table, there is a close relationship between the odds ratio and loglinear modeling. The latter can be used to assess both association and marginal homogeneity. Cook and Farewell (1995) presented a model that considers formal decomposition of a 2×2 table into independent components which reflect (1) the odds ratio and (2) marginal homogeneity. The tetrachoric and polychoric correlations are alternatives when one may assume that ratings are based on a latent continuous trait which is normally distributed. With more than two rating categories, extensions of the polychoric correlation are available with more flexible distributional assumptions. Association and quasi-symmetry models can be used for N×N tables, where ratings are nominal or ordered-categorical. These methods are related to the odds ratio. When there are more than two raters, latent trait and latent class models can be used. A particular type of latent trait model called the Rasch model is related to the odds ratio.
References Either of the books by Agresti are excellent starting points. Agresti A. Categorical data analysis. New York: Wiley, 1990. Agresti A. An introduction to categorical data analysis. New York: Wiley, 1996. Bishop YMM, Fienberg SE, Holland PW. Discrete nultivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press, 1975
55
Statistical Methods for Rater Agreement
Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal agreement: two families of agreement measures. Canadian Journal of Statistics, 1995, 23, 333-344. Fleiss JL. Statistical methods for rates and proportions, 2nd Ed. New York: John Wiley, 1981. Khamis H. Association, measures of. In Armitage P, Colton T (eds.), The Encyclopedia of Biostatistics, Vol. 1, pp. 202-208. New York: Wiley, 1998. Somes GW, O'Brien, KF. Odds ratio estimators. In Kotz L, Johnson NL (eds.), Encyclopedia of statistical sciences, Vol. 6, pp. 407-410. New York: Wiley, 1988. Sprott DA, Vogel-Sprott MD. The use of the log-odds ratio to assess the reliability of dichotomous questionnaire data. Applied Psychological Measurement, 1987, 11, 307316.
56
Statistical Methods for Rater Agreement
10. Agreement on Interval-Level Ratings 10.0 Introduction Here we depart from the main subject of our inquiry--agreement on categorical ratings--to consider interval-level ratings. Methods for analysis of interval-level rating agreement are better established than is true with categorical data. Still, there is far from complete agreement about which methods are best, and many, if not most published studies use less than ideal methods. Our view is that the basic premise outlined for analyzing categorical data, that there are different components of agreement and that these should be separately measured, applies equally to interval-level data. It is only by separating the different components of agreement that we can tell what steps are needed to improve agreement. Before considering specific methods, it is helpful to consider a common question: When should ratings be treated as ordered-categorical and when as interval-level data? Some guidelines are as follows: 1. If the rating levels have integer anchors on a Likert-type scale, treat the ratings as interval data. By a Likert-type scale, it is meant that the actual form on which raters record ratings contains an explicit scale such as the following two exmples: 2. 3. 4. lowest highest 5. level level 6. 1 2 3 4 5 6 7 (circle one) 7. 8. 9. 10. 11. 12. 13. 14. 15.
strongly strongly disagree agree 1 2 3 4 5 6 7 check level __ __ __ __ __ __ __ that applies
These examples contain verbal anchors only but there are other examples where each verbal label. The basic principle here strongly implies that raters should regard or approximately evenly-spaced.
for the two extreme levels, integer is anchored with a is that the format itself the rating levels as exactly
16. If rating categories have been chosen by the researcher to represent at least approximately equally-spaced levels, strongly consider treating the data as interval level. For example, for rating the level of cigarette consumption, one has latitude in
57
Statistical Methods for Rater Agreement
defining categories such as "1-2 cigarettes per day," "1/2 pack per day," "1 pack per day," etc. It is my observation at least, that researchers instinctively choose categories that represent more or less equal increments in a construct of "degree of smoking involvement," justifying treatment of the data as interval level. 17. If the categories are few in number and the rating level anchors are chosen/worded/formatted in such a way that does not imply any kind of equal spacing to rating levels, treat the data as ordered categorical. This may apply even when response labels are labelled with integers--for example, response levels of "1. None," "2. Mild," "3. Moderate," and "4. Severe." Note that here one could as easily substitute the letters A, B, C and D for the integers 1, 2, 3 and 4. 18. If the ratings will, in subsequent research, be statistically analyzed as interval-level data, then treat them as interval-level data for the reliability study. Conversely, if they will be analyzed as ordered-categorical in subsequent research, treat them as orderedcategorical in the reliability study. Some who are statistically sophisticated may insist that nearly all ratings of this type should be treated as ordered-categorical and analyzed with nonparametric methods. However, this view fails to consider that one may also err by applying nonparametric methods when ratings do meet the assumptions of interval-level data; specifically, by using nonparametric methods, significant statistical power may be lost. 10.1 General Issues In this section we consider two general issues. The first is an explanation of three different components of disagreement on interval-level ratings. The second issue concerns the general strategy for examining rater differences. Different causes may result in rater disagreement on a given case. With interval-level data, these various causes have effects that can be broadly grouped into three categories: effects on the correlation or association of raters' ratings, rater bias, and rater differences in the distribution of ratings. 10.2 Rater Association In making a rating, raters typically consider many factors. For example, in rating life quality, a rater may consider separate factors of satisfaction with social relationships, job satisfaction, economic security, health, etc. Judgments on these separate factors are combined by the rater to produce a single overall rating. Raters may vary in what factors they consider. Moreover, different raters may weight the same factors differently, or they may use different "algorithms" to combine information on each factor to produce a final rating. Finally, a certain amount of random error affects a rating process. A patient's symptoms may vary over time, raters may be subject to distractions, or the focus of a rater may vary. Because of such random error, we would not even expect two ratings by a single rater of the same case to always agree. The combined effect of these issues is to reduce the correlation of ratings by different raters. (This can be easily shown with formulas and a formal measurement model.) Said another 58
Statistical Methods for Rater Agreement
way, to the extent that raters' ratings correlate less than 1, we have evidence that the raters are considering or weighting different factors and/or of random error and noise in the rating process. When rater association is low, it implies that the study coordinator needs to apply training methods to improve the consistency of raters' criteria. Group discussion conferences may also be useful to clarify rater differences in their criteria, definitions, and interpretation of the basic construct. 10.3 Rater Bias Rater bias refers to the tendency of a rater to make ratings generally higher or lower than those of other raters. Bias may occur for several reasons. For example, in clinical situations, some raters may tend to "overpathologize" or "underpathologize." Some raters may also simply interpret the calibration of the rating scale differently so as to make generally higher or lower ratings. With interval-level ratings, rater bias can be assessed by calculating the mean rating of a rater across all cases that they rate. High or low means, relative to the mean of all raters, indicate positive or negative rater bias, respectively. 10.4 Rating Distribution Sometimes an individual rater will have, when one examines all ratings made by the rater, a noticeably different distribution than the distribution of ratings for all raters combined. The reasons for this are somewhat more complex than is true for differences in rater association and bias. Partly it may relate to rater differences in what they believe is the distribution of the trait in the sample or population considered. At present, we mainly regard this as an empirical issue: examination of the distribution of ratings by each rater may sometimes reveal important differences. When such differences exist, some attempt should be made to reduce them, as they are associated with decreased rater agreement. 10.5 Rater vs. Rater or Rater vs. Group In analyzing and interpreting results from a rater agreement study, and when more than two raters are involved, one often thinks in terms of a comparison of each rater with every other rater. This is relatively inefficient and, it turns out, often unnecessary. Most of the important information to be gained can be more easily obtained by comparing each rater to some measure of overall group performance. We term the former approach the Rater vs. Rater strategy, and the latter the Rater vs. Group strategy. Though it is the more common, the Rater vs. Rater approach requires more effort. For example, with 10 raters, one needs to consider a 10 x 10 correlation matrix (actually, 45 correlations between distinct rater pairs). In contrast, a Rater vs. Group approach, which might, for example, instead correlate each rater's ratings with the average rating across all raters, would summarize results with only 10 correlations. The prevalence of the Rater vs. Rater view is perhaps historical and accidental. Originally, most rater agreement studies used only two raters--so methods naturally developed for the analysis of rater pairs. As studies grew to include more raters, the same basic methods (e.g., kappa coefficients) were applied by considering all pairs of raters. What did not happen (as
59
Statistical Methods for Rater Agreement
seldom does when paradigms evolve gradually) is a basic re-examination of and new selection of methods. This is not to say that the Rater vs. Rater approach is always bad, or that the Rater vs. Group is always better. There is a place for both. Sometimes one wants to know the extent to which different rater pairs vary in their level of agreement; then the Rater vs. Rater approach is better. Other times one will wish merely to obtain information on the performance of each rater in order to provide feedback and improve rating consistency; then the Rater vs. Group approach may be better. (Of course, there is nothing to prevent the researcher from using both approaches.) It is important mainly that the researcher realize that they have a choice, and to make an informed selection of methods. 10.6 Measuring Rater Agreement We now direct attention to the question of which statistical methods to use to assess association, bias, and rater distribution differences in an agreement study. 10.7 Measuring Rater Association As already mentioned, from the Rater vs. Rater perspective, association can be summarized by calculating a Pearson correlation (r) of the ratings for each distinct pair of raters. Sometimes one may wish to report the entire matrix of such correlations. Other times it will make sense to summarize the data as a mean, standard deviation, minimum and maximum across all pairwise correlations. From a Rater vs. Group perspective, there are two relatively simple ways to summarize rater association. The first, already mentioned, is to calculate the correlation of each raters' ratings with the average of all raters' ratings (this generally presupposes that all raters rate the same set of cases or, at least, that each case is rated by the same number of raters.) The alternative is to calculate the average correlation of a rater with every other rater--that is, to consider row or column averages of the rater x rater correlation matrix. It should be noted that correlations produced by the former method will be, on average, higher than those produced by the latter. This is because average ratings are more reliable than individual ratings. However, the main interest will be to compare different raters in terms of their correlation with the mean rating, which is still possible; that is, the raters with the highest/lowest correlations with one method will also be those with the highest/lowest correlations with the other. A much better method, however, is factor analysis. With this method, one estimates the association of each rater with a latent factor. The factor is understood as a "proto-rating," or the latent trait of which each rater's opinions are an imperfect representation. (If one wanted to take an even stronger view, the factor could be viewed as representing the actual trait which is being rated.) The details of such analysis are as follows. •
•
Using any standard statistical software such as SAS or SPSS, one uses the appropriate routine to request a factor analysis of the data. In SAS, for example, one would use PROC FACTOR. A common factor model is requested (not principal components analysis).
60
Statistical Methods for Rater Agreement
• • •
A one-factor solution is specified; note that factor rotation does not apply with a onefactor solution, so do not request this. One has some latitude in choice of estimation methods, but iterated principal factor analysis is recommended. In SAS, this is called the PRINIT method. Do not request commonalities fixed at 1.0. Instead, let the program estimate commonalities. If the program requests that you specify starting commonality values, request that squared multiple correlations (SMC) be used.
In examining the results, two parts of the output should be considered. First are the loadings of each rater on the common factor. These are the same as the correlations of each rater's ratings with the common factor. They can be interpreted as the degree to which a rater's ratings are associated with the latent trait. The latent trait or factor is not, logically speaking, the same as the construct being measured. For example, a patient's level of depression (the construct) is a real entity. On the other hand, a factor or latent trait inferred from raters' ratings is a surrogate--it is the shared perception or interpretation of this construct. It may be very close to the true construct, or it may represent a shared misinterpretation. Still, lacking a "gold standard," and if we are to judge only on the basis of raters' ratings, the factor represents our best information about the level of the construct. And the correlation of raters with the factor represents our best guess as to the correaltion of raters' ratings with the true construct. Within certain limitations, therefore, one can regard the factor loadings as upper-bound estimates for the correlation of ratings with the true construct--that is, upper-bound estimate on the validity of ratings. If a loading is very high, then we only know that the validity of this rater is below this number--not very useful information. However, if the loading is low, then we know that the validity of the rater, which must be lower, is also low. Thus, in pursuing this method, we are potentially able to draw certain inferences about rating validity--or, at least, lack thereof, from agreement data (Uebersax, 1989). Knowledgeable statisticians and psychometricians recognize that the factor-analytic approach is appropriate, if not optimal, for this application. Still, one may encounter criticism from peers or reviewers who are perhaps overly attached to convention. Some may say, for example, that one should really use Cronbach's alpha or calculate the intraclass correlation with such data. One should not be overly concerned about such comments. (I recognize that it would help many researchers to be armed with references to published articles that back up what is said here. Such references exist and I'll try to supply them. In the meantime, you might try a literature search using keywords like "factor analysis" and "agreement" or "reliability.") While on this subject, it should be mentioned that there has been recent controversy about using the Pearson correlation vs. using the intraclass correlation vs. using a new coefficient of concordance. (Again, I will try to supply references.) I believe this controversey is misguided. Critics are correct in saying that, for example, the Pearson correlation only assesses certain types of disagreement. For example, if, for two raters, one rater's ratings are consistently X units higher than another rater's ratings, the two raters will have a perfect Pearson correlation, even though they disagree on every case. However, our perspective is that this is really a strength of the Pearson correlation. The goal should be to assess each component of rater agreement (association, bias, and distributional differences) separately. The problem with these other measures is precisely that they attempt to serve as omnibus indices that summarize all types of disagreement into a single number. In 61
Statistical Methods for Rater Agreement
so doing, they provide information of relatively little practical value; as they do not distinguish among different components of disagreement, they do not enable one to identify steps necessary to improve agreement. Here is a "generic" statement that one can adapt to answer any criticisms of this nature: "There is growing awareness that rater agreement should be viewed as having distinct components, and that these components should be assessed distinctly, rather than combined into a single omnibus index. To this end, a statistical modeling approach to such data has been advocated (Agresti, 1992; Uebersax, 1992)." 10.8 Measuring Rater Bias The simplest way to express rater bias is to calculate the mean rating level made by each rater. To compare rater differences (Rater vs. Rater approach), the simplest method is to perform a paired t-test between each pair of raters. One may wish to perform a Bonferonni adjustment to control the overall (across all comparisons) alpha level. However, this is not strictly necessary, especially if ones aims are more exploratory or oriented toward informing "remedial" intervention. Another possibility is a one-way Analysis of Variance (ANOVA), in which raters are viewed as the independent variable and ratings are the dependent variable. An ANOVA can assess whether there are bias differences among raters considering all raters simultaneouly (i.e., this is related to the Rater vs. Group approach). If the ANOVA approach is used, however, one will still want to list the mean ratings for each rater, and potentially perform "post hoc" comparisons of each rater-pair's means--this is more rigorous, but will likely produce results comparable to the t-test methods described above. If a paper is to be written for publication in a medical journal, my suggestion would be to perform paired t-tests for each rater pair and to report which or how many pairs showed significant differences. Then one should peform an ANOVA simply to obtain an overall pvalue (via an F-test) and add mention of this overall p value. If the paper will be sent to say, a psychology journal, it might be advisable to report results of a one-way ANOVA along with results of formal "post-hoc" comparisons of each rater pair. 10.9 Rater Distribution Differences It is possible to calculate statistical indices that reflect the similarity of one rater's ratings distribution with that of another, or between each rater's distribution and the distribution for all ratings. However such indices usually do not characterize precisely how two distributions differ--merely whether or not they do differ. Therefore, if this is of interest, it is probably more useful to rely on graphical methods. That is, one can graphically display the distribution of each rater's ratings, and the overall distribution, and base comparisons on these displays.
62
Statistical Methods for Rater Agreement
10.10 Using the Results Often rater agreement data is collected in during a specific training phase of a project. In other case, there is not a formal training phase, but it is nonetheless expected that results can be used to increase the consistency of future ratings. Several formal and informal methods can be used to assist these tasks. Two are described here. 10.11 The Delphi Method The Delphi Method is a technique developed at the RAND Corporation to aid to group decision making. The essential feature is the use of quantitative feedback given to each participant. The feedback consists of a numerical summary of that participant's decisions, opinions, or, as applies here, ratings, along with a summary of the average decisions, opinions or ratings across all participants. The assumption is that, provided with this feedback, a rater will begin to make decisions or ratings more consistent with the group norm. The method is easily adapted to multi-rater, interval-level data paradigms. It can be used in conjunction with each of the three components of rater agreement already described. 10.12 Rater Bias To apply the method to rater bias, one would first calculate the mean rating for each rater in the training phase. One would then prepare a figure showing the distribution of averages. Figure 1 is a hypothetical example for 10 raters using a 5-point rating scale. * ** * * * ** * *<---you |----+----|----+----|----+----|----+----| 1 2 3 4 5 Distribution of Average Rating Level Figure 1 A copy of the figure is given to each rater. Each is annotated to show the average for that rater, as shown in Figure 1. 10.13 Rater Association A similar figure is used to give quantitative feedback on the association of each rater's ratings with those of the other raters. If one has performed a factor analysis of ratings, then the figure would show the distribution of factor loadings across raters. If not, simpler alternatives are to display the distribution of the average correlation of each rater with the other raters, or the correlation of each rater's ratings with the average of all raters (or, alternatively, with the average of all raters other than that particular rater). 63
Statistical Methods for Rater Agreement
Once again, a specifically annotated copy of the distribution is given to each rater. Figure 2 is a hypothetical example for 10 raters. you-->* * * * * * * * * * |----+----|----+----|----+----|----+----|----+----| .5 .6 7 .8 .9 1.0 Distribution of Correlation with Common Factor Figure 2 10.14 Distribution of Ratings Finally, one might consider giving raters feedback on the distribution of their ratings and how this compares with raters overall. For this, each rater would receive a figure such as Figure 3, showing the distribution of ratings for all raters and for the particular rater.
| | | ** | ** % of | ** | ** ratings | ** ** ** | ** ** | ** ** ** ** | ** ** ** ** | ** ** ** ** ** | ** ** ** ** ** 0 +---+----+----+----+----+ +---+----+----+----+----+ 1 2 3 4 5 1 2 3 4 5 Rating Level Rating Level Distribution of Ratings for All Raters Figure 3
Your Distribution
Use of figures such as Figure 3 might be considered optional, as, to some extent, this overlaps with the information provided in the Rater Bias figures. On the other hand, it may make a rater's differences from the group norm more clear. 10.15 Discussion of Ambiguous Cases The second technique consists of having all raters or pairs of raters meet to discuss disagreedon cases. The first step is to identify the cases that are the subject of most disagreement. If all raters rate the same cases one can simply calculate, for each case, the standard deviation of the different ratings for that case. Cases with the largest standard deviations--say the top 10%--may be regarded as ambiguous cases. These cases may then be re-presented to the set of raters who 64
Statistical Methods for Rater Agreement
meet as a group, discuss features of these cases, and identify sources of disagreement. Alternatively, or if all rater do not rate the same cases, a similar method can be applied at the level of pairs of raters. That is, for each pair of raters I and J, the cases that are most disagreed on (cases with the greatest absolute difference between the rating by Rater I and the rating by Rater J) are reviewed by the two raters who meet to discuss these and iron out differences. References Agresti, A. (1992). Modelling patterns of agreementand disagreement. Statistical Methods in Medical Research, 1, 201-218. Uebersax, J. S. (1988). Validity inferences from interobserver agreement. Psychological Bulletin, 104, 405-416. Uebersax, J. S. (1992). A review of modeling approaches for the analysis of observer agreement. Investigative Radiology, 27, 738-743.
[email protected] http://recepozcan06.blogcu.com/
65