Non Parametric For Finance

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Non Parametric For Finance as PDF for free.

More details

  • Words: 53,126
  • Pages: 78
Statistical Science 2005, Vol. 20, No. 4, 317–337 DOI 10.1214/088342305000000412 © Institute of Mathematical Statistics, 2005

A Selective Overview of Nonparametric Methods in Financial Econometrics Jianqing Fan

Abstract. This paper gives a brief overview of the nonparametric techniques that are useful for financial econometric problems. The problems include estimation and inference for instantaneous returns and volatility functions of time-homogeneous and time-dependent diffusion processes, and estimation of transition densities and state price densities. We first briefly describe the problems and then outline the main techniques and main results. Some useful probabilistic aspects of diffusion processes are also briefly summarized to facilitate our presentation and applications. Key words and phrases: Asset pricing, diffusion, drift, GLR tests, simulations, state price density, time-inhomogeneous model, transition density, volatility. tive methods such as statistics, probability and applied mathematics are essential tools to solve the quantitative problems in finance. To name a few, complex financial products pose new challenges on their valuation and risk management. Sophisticated stochastic models have been introduced to capture the salient features of underlying economic variables and to price derivatives of securities. Statistical tools are used to identify parameters of stochastic models, to simulate complex financial systems and to test economic theories via empirical financial data. An important area of financial econometrics is study of the expected returns and volatilities of the price dynamics of stocks and bonds. Returns and volatilities are directly related to asset pricing, proprietary trading, security regulation and portfolio management. To achieve these objectives, the stochastic dynamics of underlying state variables should be correctly specified. For example, option pricing theory allows one to value stock or index options and hedge against the risks of option writers once a model for the dynamics of underlying state variables is given. See, for example, the books on mathematical finance by Bingham and Kiesel [20], Steele [105] and Duffie [42]. Yet many of the stochastic models in use are simple and convenient ones to facilitate mathematical derivations and statistical inferences. They are not derived from any economics theory and hence cannot be expected to fit all financial data. Thus, while the pricing theory gives

1. INTRODUCTION

Technological innovation and trade globalization have brought us into a new era of financial markets. Over the last three decades, a large number of new financial products have been introduced to meet customers’ demands. An important milestone occurred in 1973 when the world’s first options exchange opened in Chicago. That same year, Black and Scholes [23] published their famous paper on option pricing and Merton [90] launched the general equilibrium model for security pricing, two important landmarks for modern asset pricing. Since then the derivative markets have experienced extraordinary growth. Professionals in finance now routinely use sophisticated statistical techniques and modern computational power in portfolio management, securities regulation, proprietary trading, financial consulting and risk management. Financial econometrics is an active field that integrates finance, economics, probability, statistics and applied mathematics. This is exemplified by the books by Campbell, Lo and MacKinlay [28], Gouriéroux and Jasiak [60] and Cochrane [36]. Financial activities generate many new problems, economics provides useful theoretical foundation and guidance, and quantitaJianqing Fan is Professor, Benheim Center of Finance and Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544, USA (e-mail: [email protected]). 317

318

J. FAN

spectacularly beautiful formulas when the underlying dynamics is correctly specified, it offers little guidance in choosing or validating a model. There is always the danger that misspecification of a model leads to erroneous valuation and hedging strategies. Hence, there are genuine needs for flexible stochastic modeling. Nonparametric methods offer a unified and elegant treatment for such a purpose. Nonparametric approaches have recently been introduced to estimate return, volatility, transition densities and state price densities of stock prices and bond yields (interest rates). They are also useful for examining the extent to which the dynamics of stock prices and bond yields vary over time. They have immediate applications to the valuation of bond price and stock options and management of market risks. They can also be employed to test economic theory such as the capital asset pricing model and stochastic discount model [28] and answer questions such as if the geometric Brownian motion fits certain stock indices, whether the Cox– Ingersoll–Ross model fits yields of bonds, and if interest rate dynamics evolve with time. Furthermore, based on empirical data, one can also fit directly the observed option prices with their associated characteristics such as strike price, the time to maturity, risk-free interest rate, dividend yield and see if the option prices are consistent with the theoretical ones. Needless to say, nonparametric techniques will play an increasingly important role in financial econometrics, thanks to the availability of modern computing power and the development of financial econometrics. The paper is organized as follows. We first introduce in Section 2 some useful stochastic models for modeling stock prices and bond yields and then briefly outline some probabilistic aspects of the models. In Section 3 we review nonparametric techniques used for estimating the drift and diffusion functions, based on either discretely or continuously observed data. In Section 4 we outline techniques for estimating state price densities and transition densities. Their applications in asset pricing and testing for parametric diffusion models are also introduced. Section 5 makes some concluding remarks. 2. STOCHASTIC DIFFUSION MODELS

Much of financial econometrics is concerned with asset pricing, portfolio choice and risk management. Stochastic diffusion models have been widely used for describing the dynamics of underlying economic variables and asset prices. They form the basis of many

spectacularly beautiful formulas for pricing contingent claims. For an introduction to financial derivatives, see Hull [78]. 2.1 One-Factor Diffusion Models

Let St denote the stock price observed at time t. The time unit can be hourly, daily, weekly, among others. Presented in Figure 1(a) are the daily log-returns, defined as 







log(St ) − log S(t−1) ≈ St − S(t−1) /S(t−1) , of the Standard and Poor’s 500 index, a value-weighted index based on the prices of the 500 stocks that account for approximately 70% of the total U.S. equity (stock) market capitalization. The styled features of the returns include that the volatility tends to cluster and that the (marginal) mean and variance of the returns tend to be constant. One simplified model to capture the second feature is that 



log(St ) − log S(t−1) ≈ µ0 + σ0 εt , where {εt } is a sequence of independent normal random variables. This is basically a random walk hypothesis, regarding the stock price movement as an independent random walk. When the sampling time unit  gets small, the above random walk can be regarded as a random sample from the continuous-time process: (1)

d log(St ) = µ0 + σ1 dWt ,

where {Wt } is a standard √ one-dimensional Brownian motion and σ1 = σ0 / . The process (1) is called geometric Brownian motion as St is an exponent of Brownian motion Wt . It was used by Osborne [92] to model the stock price dynamic and by Black and Scholes [23] to derive their celebrated option price formula. Interest rates are fundamental to financial markets, consumer spending, corporate earnings, asset pricing, inflation and the economy. The bond market is even bigger than the equity market. Presented in Figure 1(c) are the interest rates {rt } of the two-year U.S. Treasury notes at a weekly frequency. As the interest rates get higher, so do the volatilities. To appreciate this, Figure 1(d) plots the pairs {(rt−1 , rt − rt−1 )}. Its dynamic is very different from that of the equity market. The interest rates should be nonnegative. They possess heteroscedasticity in addition to the mean-revision property: As the interest rates rise above the mean level α, there is a negative drift that pulls the rates down; while when the interest rates fall below α, there is a positive force that drives the rates up. To capture these two

319

A SELECTIVE OVERVIEW

F IG . 1. (a) Daily log-returns of the Standard and Poor’s 500 index from October 21, 1980 to July 29, 2004. (b) Scatterplot of the returns against logarithm of the index ( price level). (c) Interest rates of two-year U.S. Treasury notes from June 4, 1976 to March 7, 2003 sampled at weekly frequency. (d) Scatterplot of the difference of yields versus the yields.

main features, Cox, Ingersoll and Ross [37] derived the following model for the interest rate dynamic: (2)

1/2

drt = κ(α − rt ) dt + σ rt

dWt .

For simplicity, we will refer it to as the CIR model. It is an amelioration of the Vasicek model [106], (3)

drt = κ(α − rt ) dt + σ dWt ,

which ignores the heteroscedasticity and is also referred to as the Ornstein–Uhlenbeck process. While this is an unrealistic model for interest rates, the process is Gaussian with explicit transition density. It fact, the time series sampled from (3) follows the autoregressive model of order 1, (4)

Yt = (1 − ρ)α + ρYt−1 + εt ,

where Yt = rt , ε ∼ N(0, σ 2 (1 − ρ 2 )/(2κ)) and ρ = exp(−κ). Hence, the process is well understood

and usually serves as a test case for proposed statistical methods. There are many stochastic models that have been introduced to model the dynamics of stocks and bonds. Let Xt be an observed economic variable at time t. This can be the price of a stock or a stock index, or the yield of a bond. A simple and frequently used stochastic model is dXt = µ(Xt ) dt + σ (Xt ) dWt .

(5)

The function µ(·) is often called a drift or instantaneous return function and σ (·) is referred to as a diffusion or volatility function, since µ(Xt ) = lim −1 E(Xt+ − Xt |Xt ), →0

σ (Xt ) = lim −1 var(Xt+ |Xt ). 2

→0

320

J. FAN

The time-homogeneous model (5) contains many famous one-factor models in financial econometrics. In an effort to improve the flexibility of modeling interest dynamics, Chan et al. [29] extends the CIR model (2) to the CKLS model, (6)

γ

dXt = κ(α − Xt ) dt + σ Xt dWt .

Aït-Sahalia [3] introduces a nonlinear mean reversion: while interest rates remain in the middle part of their domain, there is little mean reversion, and at the end of the domain, a strong nonlinear mean reversion emerges. He imposes the nonlinear drift of the form (α0 Xt−1 + α1 + α2 Xt + α2 Xt2 ). See also Ahn and Gao [1], which models the interest rates by Yt = Xt−1 , in which the Xt follows the CIR model. Economic conditions vary over time. Thus, it is reasonable to expect that the instantaneous return and volatility depend on both time and price level for a given state variable such as stock prices and bond yields. This leads to a further generalization of model (5) to allow the coefficients to depend on time t: (7)

dXt = µ(Xt , t) dt + σ (Xt , t) dWt .

Since only a trajectory of the process is observed [see Figure 1(c)], there is not sufficient information to estimate the bivariate functions in (7) without further restrictions. [To consistently estimate the bivariate volatility function σ (x, t), we need to have data that eventually fill up a neighborhood of the point (t, x).] A useful specification of model (7) is β (t)

(8) dXt = {α0 (t) + α1 (t)Xt } dt + β0 (t)Xt 1

dWt .

This is an extension of the CKLS model (6) by allowing the coefficients to depend on time and was introduced and studied by Fan et al. [48]. Model (8) includes many commonly used time-varying models for the yields of bonds, introduced by Ho and Lee [75], Hull and White [79], Black, Derman and Toy [21] and Black and Karasinski [22], among others. The experience in [48] and other studies of the varying coefficient models [26, 31, 74, 76] shows that coefficient functions in (8) cannot be estimated reliably due to the collinearity effect in local estimation: localizing in the time domain, the process {Xt } is nearly constant and hence α0 (t) and α1 (t) and β0 (t) and β1 (t) cannot easily be differentiated. This leads Fan et al. [48] to introduce the semiparametric model (9)

β

dXt = {α0 (t) + α1 Xt } dt + β0 (t)Xt dWt

to avoid the collinearity.

2.2 Some Probabilistic Aspects

The question when there exists a solution to the stochastic differential equation (SDE) (7) arises naturally. Such a program was first carried out by Itô [80, 81]. For SDE (7), there are two different meanings of solution: strong solution and weak solution. See Sections 5.2 and 5.3 of [84]. Basically, for a given initial condition ξ , a strong solution requires that Xt is determined completely by the information up to time t. Under Lipschitz and linear growth conditions on the drift and diffusion functions, for every ξ that is independent of {Ws }, there exists a strong solution of equation (7). Such a solution is unique. See Theorem 2.9 of [84]. For the one-dimensional time-homogeneous diffusion process (5), weaker conditions can be obtained for the so-called weak solution. By an application of the Itô formula to an appropriate transform of the process, one can make the transformed process have zero drift. Thus, we can consider without loss of generality that the drift in (5) is zero. For such a model, Engelbert and Schmidt [45] give a necessary and sufficient condition for the existence of the solution. The continuity of σ suffices for the existence of the weak solution. See Theorem 5.5.4 of [84], page 333, and Theorem 23.1 of [83]. We will use several times the Itô formula. For the process Xt in (7), for a sufficiently regular function f ([84], page 153), 

df (Xt , t) =

∂f (Xt , t) ∂t



1 ∂ 2 f (Xt , t) 2 + (10) σ (Xt , t) dt 2 ∂x 2 ∂f (Xt , t) dXt . + ∂x The formula can be understood as the second-order Taylor expansion of f (Xt+ , t + ) − f (Xt , t) by noticing that (Xt+ − Xt )2 is approximately σ 2 (Xt , t). The Markovian property plays an important role in statistical inference. According to Theorem 5.4.20 of [84], the solution Xt to equation (5) is Markovian, provided that the coefficient functions µ and σ are bounded on compact subsets. Let p (y|x) be the transition density, the conditional density of Xt+ = y given Xt = x. The transition density must satisfy the forward and backward Kolmogorov equations ([84], page 282). Under the linear growth and Lipschitz conditions, and additional conditions on the boundary behavior of

321

A SELECTIVE OVERVIEW

the functions µ and σ , the solution to equation (1) is positive and ergodic. The invariant density is given by f (x) = 2C0 σ −2 (x) (11)



· exp −2

 x



µ(y)σ −2 (y) dy ,

.

where C0 is a normalizing constant and the lower limit of the integral does not matter. If the initial distribution is taken from the invariant density, then the process {Xt } is stationary with the marginal density f and transition density p . Stationarity plays an important role in time series analysis and forecasting [50]. The structural invariability allows us to forecast the future based on the historical data. For example, the structural relation (e.g., the conditional distribution, conditional moments) between Xt and Xt+ remains the same over time t. This makes it possible to use historical data to estimate the invariant quantities. Associated with stationarity is the concept of mixing, which says that the data that are far apart in time are nearly independent. We now describe the conditions under which the solution to the SDE (1) is geometrically mixing. Let Ht be the operator defined by (12)





(Ht g)(x) = E g(Xt )|X0 = x ,

x ∈ R,

where f is a Borel measurable bounded function on R. A stationary process Xt is said to satisfy the condition G2 (s, α) of Rosenblatt [95] if there exists an s such that Hs 22 =

E(Hs f )2 (X) ≤ α 2 < 1, Ef 2 (X) {f : Ef (X)=0} sup

namely, the operator is contractive. As a consequence of the semigroup (Hs+t = Hs Ht ) and contraction properties, the condition G2 implies [16, 17] that for any t ∈ [0, ∞), Ht 2 ≤ α t/s−1 . The latter implies, by the Cauchy–Schwarz inequality, that 



(13) ρ(t) = sup corr g1 (X0 ), g2 (Xt ) ≤ α t/s−1 , g1 ,g2

that is, the ρ-mixing coefficient decays exponentially fast. Banon and Nguyen [18] show further that for a stationary Markov process, ρ(t) → 0 is equivalent to (13), namely, ρ-mixing and geometric ρ-mixing are equivalent.

2.3 Valuation of Contingent Claims

An important application of SDE is the pricing of financial derivatives such as options and bonds. It forms a beautiful modern asset pricing theory and provides useful guidance in practice. Steele [105], Duffie [42] and Hull [78] offer very nice introductions to the field. The simplest financial derivative is the European call option. A call option is the right to buy an asset at a certain price K (strike price) before or at expiration time T . A put option gives the right to sell an asset at a certain price K (strike price) before or at expiration. European options allow option holders to exercise only at maturity while American options can be exercised at any time before expiration. Most stock options are American, while options on stock indices are European. The payoff for a European call option is (XT − K)+ , where XT is the price of the stock at expiration T . When the stock rises above the strike price K, one can exercise the right and make a profit of XT − K. However, when the stock falls below K, one renders one’s right and makes no profit. Similarly, a European put option has payoff (K − XT )+ . See Figure 2. By creating a portfolio with different maturities and different strike prices, one can obtain all kinds of payoff functions. As an example, suppose that a portfolio of options consists of contracts of the S&P 500 index maturing in six months: one call option with strike price $1,200, one put option with strike price $1,050 and $40 cash, but with short position (borrowing or −1 contract) on a call option with strike price $1,150 and on a put option with strike price $1,100. Figure 2(c) shows the payoff function of such a portfolio of options at the expiration T . Clearly, such an investor bets the S&P 500 index will be around $1,125 in six months and limits the risk exposure on the investment (losing at most $10 if his/her bet is wrong). Thus, the European call and put options are fundamental options as far as the payoff function at time T is concerned. There are many other exotic options such as Asian options, look-back options and barrier options, which have different payoff functions, and the payoffs can be path dependent. See Chapter 18 of [78]. Suppose that the asset price follows the SDE (7) and there is a riskless investment alternative such as a bond which earns compounding rate of interest rt . Suppose that the underlying asset pays no dividend. Let βt be the value of the riskless bond at time t. Then, with an initial investment β0 , βt = β0 exp

 t 0



rs ds ,

322

J. FAN

F IG . 2. (a) Payoff of a call option. (b) Payoff of a put option. (c) Payoff of a portfolio of four options with different strike prices and different (long and short) positions.

thanks to the compounding of interest. Suppose that a probability measure Q is equivalent to the original probability measure P , namely P (A) = 0 if and only if Q(A) = 0. The measure Q is called an equivalent martingale measure for deflated price processes of given securities if these processes are martingales with respect to Q. An equivalent martingale measure is also referred to as a “risk-neutral” measure if the deflater is the bond price process. See Chapter 6 of [42]. When the markets are dynamically complete, the price of the European option with payoff (XT ) with initial price X0 = x0 is 

(14) P0 = exp −

 T 0







rs ds E Q (XT )|X0 = x0 ,

where Q is the equivalent martingale measure for the deflated price process Xt /βt . Namely, it is the discounted value of the expected payoff in the risk neutral world. The formula is derived by using the so-called relative pricing approach, which values the price of the option from given prices of a portfolio consisting of a risk-free bond and a stock with the identical payoff as the option at the expiration. As an illustrative example, suppose that the price of a stock follows the geometric Brownian motion dXt = µXt dt + σ Xt dWt and that the risk-free rate r is constant. Then the deflated price process Yt = exp(−rt)Xt follows the SDE dYt = (µ − r)Yt dt + σ Yt dWt . The deflated price process is not a martingale as the drift is not zero. The risk-neutral measure is the one

that makes the drift zero. To achieve this, we appeal to the Girsanov theorem, which changes the drift of a diffusion process without altering the diffusion via a change of probability measure. Under the “riskneutral” probability measure Q, the process Yt satisfies dYt = σ Yt dWt , a martingale. Hence, the price process Xt = exp(rt)Yt under Q follows (15)

dXt = rXt dt + σ Xt dWt .

Using exactly the same derivation, one can easily generalize the result to the price process (5). Under the risk-neutral measure, the price process (5) follows (16)

dXt = rXt dt + σ (Xt ) dWt .

The intuitive explanation of this is clear: all stocks under the “risk-neutral” world are expected to earn the same rate as the risk-free bond. For the geometric Brownian motion, by an application of the Itô formula (10) to (15), we have under the “risk-neutral” measure (17)

log Xt − log X0 = (r − σ 2 /2)t + σ 2 Wt .

Note that given the initial price X0 , the price follows a log-normal distribution. Evaluating the expectation of (14) for the European call option with payoff (XT ) = (XT − K)+ , one obtains the Black–Scholes [23] option pricing formula P0 = x0 (d1 ) − K exp(−rT ) (d2 ), √ 2 /2)T }{σ T }−1 and /K) + (r + σ where d1 = {log(x 0 √ d2 = d1 − σ T . (18)

A SELECTIVE OVERVIEW

2.4 Simulation of Stochastic Models

Simulation methods provide useful tools for the valuation of financial derivatives and other financial instruments when the analytical formula (14) is hard to obtain. For example, if the price under the “riskneutral” measure is (16), the analytical formula for pricing derivatives is usually not analytically tractable and simulation methods offer viable alternatives (together with variance reduction techniques) to evaluate it. They also provide useful tools for assessing performance of statistical methods and statistical inferences. The simplest method is perhaps the Euler scheme. The SDE (7) is approximated as (19) Xt+ = Xt + µ(t, Xt ) + σ (t, Xt )1/2 εt , where {εt } is a sequence of independent random variables with the standard normal distribution. The time unit is usually a year. Thus, the monthly, weekly and daily data correspond, respectively, to  = 1/12, 1/52 and 1/252 (there are approximately 252 trading days per year). Given an initial value, one can recursively apply (19) to obtain a sequence of simulated data {Xj  , j = 1, 2, . . .}. The approximation error can be reduced if one uses a smaller step size /M for a given integer M to first obtain a more detailed sequence {Xj /M , j = 1, 2, . . .} and then one takes the subsequence {Xj  , j = 1, 2, . . .}. For example, to simulate daily prices of a stock, one can simulate hourly data first and then take the daily closing prices. Since the step size /M is smaller, the approximation (19) is more accurate. However, the computational cost is about a factor of M higher. The Euler scheme has convergence rate 1/2 , which is called strong order 0.5 approximation by Kloeden et al. [87]. The higher-order approximations can be obtained by the Itô–Taylor expansion (see [100], page 242). In particular, a strong order-one approximation is given by (20)

Xt+ = Xt + µ(t, Xt ) + σ (t, Xt )1/2 εt + 12 σ (t, Xt )σx (t, Xt ){εt2 − 1},

where σx (t, x) is the partial derivative function with respect to x. This method can be combined with a smaller step size method in the last paragraph. For the timehomogeneous model (1), an alternative form, without evaluating the derivative function, is given in (3.14) of [87]. The exact simulation method is available if one can simulate the data from the transition density. Given the

323

current value Xt = x0 , one draws Xt+ from the transition density p (·|x0 ). The initial condition can either be fixed at a given value or be generated from the invariant density (11). In the latter case, the generated sequence is stationary. There are only a few processes where exact simulation is possible. For GBM, one can generate the sequence from the explicit solution (17), where the Brownian motion can be simulated from independent Gaussian increments. The conditional density of Vasicek’s model (3) is Gaussian with mean α + (x0 − α)ρ and variance σ2 = σ 2 (1 − ρ 2 )/(2κ) as indicated by (4). Generate X0 from the invariant density N(α, σ 2 /(2κ)). With X0 , generate X from the normal distribution with mean α + (X0 − α) exp(−κ) and variance σ2 . With X , we generate X2 from mean α + (X − α) exp(−κ) and variance σ2 . Repeat this process until we obtain the desired length of the process. For the CIR model (2), provided that q = 2κα/σ 2 − 1 ≥ 0 (a sufficient condition for Xt ≥ 0), the transition density is determined by the fact that given Xt = x0 , 2cXt+ has a noncentral χ 2 distribution with degrees of freedom 2q + 2 and noncentrality parameter 2u, where c = 2κ/{σ 2 (1 − exp(−κ))}, u = cx0 exp(k). The invariant density is the Gamma distribution with shape parameter q + 1 and scale parameter σ 2 /(2κ). As an illustration, we consider the CIR model (7) with parameters κ = 0.21459, α = 0.08571, σ = 0.07830 and  = 1/12. The model parameters are taken from [30]. We simulated 1000 monthly data values using both the Euler scheme (19) and the strong order-one approximation (20) with the same random shocks. Figure 3 depicts one of their trajectories. The difference is negligible. This is in line with the observations made by Stanton [104] that as long as data are sampled monthly or more frequently, the errors introduced by using the Euler approximation are very small for stochastic dynamics that are similar to the CIR model. 3. ESTIMATION OF RETURN AND VOLATILITY FUNCTIONS

There is a large literature on the estimation of the return and volatility functions. Early references include [93] and [94]. Some studies are based on continuously observed data while others are based on discretely observed data. For the latter, some regard  tending to zero while others regard  fixed. We briefly introduce some of the ideas.

324

J. FAN

F IG . 3. Simulated trajectories (multiplied by 100) using the Euler approximation and the strong order-one approximation for a CIR model. Top panel: solid curve corresponds to the Euler approximation and the dashed curve is based on the order-one approximation. Bottom panel: the difference between the order-one scheme and the Euler scheme.

3.1 Methods of Estimation

We first outline several methods of estimation for parametric models. The idea can be extended to nonparametric models. Suppose that we have a sample {Xi , i = 0, . . . , n} from model (5). Then, the likelihood function, under the stationary condition, is (21)

log f (X0 ) +

n 





log p Xi |X(i−1) .

i=1

If the functions µ and σ are parameterized and the explicit form of the transition density is available, one can apply the maximum likelihood method. However, the explicit form of the transition density is not available for many simple models such as the CLKS model (6). Even for the CIR model (2), its maximum likelihood

estimator is very difficult to find, as the transition density involves the modified Bessel function of the first kind. One simple technique is to rely on the Euler approximation scheme (19). Then proceed as if the data come from the Gaussian location and scale model. This method works well when  is small, but can create some biases when  is large. However, the bias can be reduced by the following calibration idea, called indirect inference by Gouriéroux et al. [61]. The idea works as follows. Suppose that the functions µ and σ have been parameterized with unknown parameters θ . Use the Euler approximation (19) and the maximum likelihood method to obtain an estimate θˆ0 . For each given parameter θ around θˆ0 , simulate data from (5) and apply the crude method to obtain an estimate θˆ1 (θ ) which depends on θ . Since we simulated the data with the true

A SELECTIVE OVERVIEW

parameter θ , the function θˆ1 (θ ) tells us how to calibrate the estimate. See Figure 4. Calibrate the estimate via θˆ1−1 (θˆ0 ), which improves the bias of the estimate. One drawback of this method is that it is intensive in computation and the calibration cannot easily be done when the dimensionality of parameters θ is high. Another method for bias reduction is to approximate the transition density in (21) by a higher order approximation, and to then maximize the approximated likelihood function. Such a scheme has been introduced by Aït-Sahalia [4, 5], who derives the expansion of the transition density around a normal density function using Hermite polynomials. The intuition behind such an expansion is that the diffusion process Xt+ − Xt in (5) can be regarded as sum of many independent increments with a very small step size and hence the Edgeworth expansion can be obtained for the distribution of Xt+ − Xt given Xt . See also [43]. An “exact” approach is to use the method of moments. If the process Xt is stationary as in the interest-rate models, the moment conditions can easily be derived by observing 

−1

E lim  →0



E[g(Xt+ ) − g(Xt )|Xt ]

= lim −1 E[g(Xt+ ) − g(Xt )] = 0 →0

for any function g satisfying the regularity condition that the limit and the expectation are exchangeable.

325

The right-hand side is the expectation of dg(Xt ). By Itô’s formula (10), the above equation reduces to (22)

E[g (Xt )µ(Xt ) + g (Xt )σ 2 (Xt )/2] = 0.

For example, if g(x) = exp(−ax) for some given a > 0, then E exp(−aXt ){µ(Xt ) − aσ 2 (Xt )/2} = 0. This can produce an arbitrary number of equations by choosing different a’s. If the functions µ and σ are parameterized, the number of moment conditions can be more than the number of equations. One way to efficiently use this is the generalized method of moments introduced by Hansen [65], minimizing a quadratic form of the discrepancies between the empirical and the theoretical moments, a generalization of the classical method of moments which solves the moment equations. The weighting matrix in the quadratic form can be chosen to optimize the performance of the resulting estimator. To improve the efficiency of the estimate, a large system of moments is needed. Thus, the generalized method of moments needs a large system of nonlinear equations which can be expensive in computation. Further, the moment equations (22) use only the marginal information of the process. Hence, the procedure is not efficient. For example, in the CKLS model (6), σ and κ are estimable via (22) only through σ 2 /κ.

F IG . 4. The idea of indirect inference. For each given true θ , one obtains an estimate using the Euler approximation and the simulated data. This gives a calibration curve as shown. Now for a given estimate θˆ0 = 3 based on the Euler approximation and real data, one finds the calibrated estimate θˆ1−1 (3) = 2.080.

326

J. FAN

3.2 Time-Homogeneous Model

The Euler approximation can easily be used to estimate the drift and diffusion nonparametrically. Let Yi = −1 (X(i+1) − Xi ) and Zi = −1 (X(i+1) − Xi )2 . Then E(Yi |Xi ) = µ(Xi ) + O() and E(Zi |Xi ) = σ 2 (Xi ) + O(). Thus, µ(·) and σ 2 (·) can be approximately regarded as the regression functions of Yi and Zi on Xi , respectively. Stanton [104] applies kernel regression [102, 107] to estimate the return and volatility functions. Let K(·) be a kernel function and h be a bandwidth. Stanton’s estimators are given by n−1

µ(x) ˆ =

i=0

i=0



Kh (Xi − x)

rather than Zi to estimate the volatility function. This will further reduce the approximation errors in the volatility estimation. They show further that the conditional variance function can be estimated as well as if the conditional mean function is known in advance. Stanton [104] derives a higher-order approximation scheme up to order three in an effort to reduce biases. He suggests that higher-order approximations must outperform lower-order approximations. To verify such a claim, Fan and Zhang [53] derived the following order k approximation scheme: ∗ |Xi ) = µ(Xi ) + O(k ), E(Yi ∗ |Xi ) = σ 2 (Xi ) + O(k ), E(Zi

where

and n−1

σˆ 2 (x) =

i=0

Zi Kh (Xi − x)

n−1 i=0

Kh (Xi − x)

where Kh (u) = is a rescaled kernel. The consistency and asymptotic normality of the estimator are studied in [15]. Fan and Yao [49] apply the local linear technique (Section 6.3 in [50]) to estimate the return and volatility functions, under a slightly different setup. The local linear estimator [46] is given by µ(x) ˆ =

n−1 

Kn (Xi − x, x)Yi ,

i=0

where (24) Kn (u, x) = Kh (u)

Sn,2 (x) − uSn,1 (x) , Sn,2 (x)Sn,0 (x) − Sn,1 (x)2



j with Sn,j (x) = n−1 i=0 Kh (Xi − x)(Xi − x) , is the equivalent kernel induced by the local linear fit. In contrast to the kernel method, the local linear weights depend on both Xi and x. In particular, they satisfy n−1 

Kn (Xi − x, x) = 1

i=1

and n−1  i=1

∗ Yi = −1

,

h−1 K(u/ h)

(23)

2

ˆ i ) −1 X(i+1) − Xi − µ(X

(25)

Yi Kh (Xi − x)

n−1

These are the key properties for the bias reduction of the local linear method as demonstrated in [46]. Further, Fan and Yao [49] use the squared residuals

Kn (Xi − x, x)(Xi − x) = 0.

k 







2

ak,j X(i+j ) − Xi

j =1

and ∗ = −1 Zi

k 

ak,j X(i+j ) − Xi

j =1

 

and the coefficients ak,j = (−1)j +1 jk j are chosen to make the approximation error in (25) of order k . For example, the second approximation is 1.5(Xt+ − Xt ) − 0.5(Xt+2 − Xt+ ). By using the independent increments of Brownian motion, its variance is 1.52 + 0.52 = 2.5 times as large as that of the first-order difference. Indeed, Fan and Zhang [53] show that while higher-order approximations give better approximation errors, we have to pay a huge premium for variance inflation, ∗ |Xi ) = σ 2 (Xi )V1 (k)−1 {1 + O()}, var(Yi ∗ |Xi ) = 2σ 4 (Xi )V2 (k){1 + O()}, var(Zi

where the variance inflation factors V1 (k) and V2 (k) are explicitly given by Fan and Zhang [53]. Table 1 shows some of the numerical results for the variance inflation factor. The above theoretical results have also been verified via empirical simulations in [53]. The problem is no monopoly for nonparametric fitting—it is shared by

A SELECTIVE OVERVIEW TABLE 1 Variance inflation factors by using higher-order differences Order k

V1 (k) V2 (k)

1

2

3

1.00 1.00

2.50 3.00

4.83 8.00

4 9.25 21.66

5 18.95 61.50

the parametric methods. Therefore, the methods based on higher-order differences should seldomly be used unless the sampling interval is very wide (e.g., quarterly data). It remains open whether it is possible to estimate nonparametrically the return and the volatility functions without seriously inflating the variance with other higher-order approximation schemes.

327

As an illustration, we take the yields of the two-year Treasury notes depicted in Figure 1. Figure 5 presents nonparametrically estimated volatility functions, based on order k = 1 and k = 2 approximations. The local linear fit is employed with the Epanechnikov kernel and bandwidth h = 0.35. It is evident that the order two approximation has higher variance than the order one approximation. In fact, the magnitude of variance inflation is in line with the theoretical result: the increase √ of the standard deviation is 3 from order one to order two approximation. Various discretization schemes and estimation methods have been proposed for the case with high frequency data over a long time horizon. More precisely, the studies are under the assumptions that n → 0 and nn → ∞. See, for example, [12, 27, 39, 58, 59, 85, 109] and references therein. Arapis

F IG . 5. Nonparametric estimates of volatility based on order one and two differences. The bars represent two standard deviations above and below the estimated volatility. Top panel: order one fit. Bottom panel: order two fit.

328

J. FAN

and Gao [11] investigate the mean integrated square error of several methods for estimating the drift and diffusion and compare their performances. Aït-Sahalia and Mykland [9, 10] study the effects of random and discrete sampling when estimating continuous-time diffusions. Bandi and Nguyen [14] investigate small sample behavior of nonparametric diffusion estimators. Thorough study of nonparametric estimation of conditional variance functions can be found in [62, 69, 91, 99]. In particular, Section 8.7 of [50] gives various methods for estimating the conditional variance function. Wang [108] studies the relationship between diffusion and GARCH models. 3.3 Model Validation

Stanton [104] applies his kernel estimator to a Treasury bill data set and observes a nonlinear return function in his nonparametric estimate, particularly in the region where the interest rate is high (over 14%, say). This leads him to postulate the hypothesis that the return functions of short-term rates are nonlinear. Chapman and Pearson [30] study the finite sample properties of Stanton’s estimator. By applying his procedure to the CIR model, they find that Stanton’s procedure produces spurious nonlinearity, due to the boundary effect and the mean reversion. Can we apply a formal statistics test to Stanton’s hypothesis? The null hypothesis can simply be formulated: the drift is of a linear form as in model (6). What is the alternative hypothesis? For such a problem our alternative model is usually vague. Hence, it is natural to assume that the drift is a nonlinear smooth function. This becomes a testing problem with a parametric null hypothesis versus a nonparametric alternative hypothesis. There is a large body of literature on this. The basic idea is to compute a discrepancy measure between the parametric estimates and nonparametric estimates and to reject the parametric hypothesis when the discrepancy is large. See, for example, the book by Hart [73]. In an effort to derive a generally applicable principle, Fan et al. [54] propose the generalized likelihood ratio (GLR) tests for parametric-versus-nonparametric or nonparametric-versus-parametric hypotheses. The basic idea is to replace the maximum likelihood under a nonparametric hypothesis (which usually does not exist) by the likelihood under good nonparametric estimates. Section 9.3 of [50] gives details on the implementation of the GLR tests, including estimating P -values, bias reduction and bandwidth selection. The method has been successfully employed by Fan and

Zhang [53] for checking whether the return and volatility functions possess certain parametric forms. Another viable approach of model validation is to base it on the transition density. One can check whether the nonparametrically estimated transition density is significantly different from the parametrically estimated one. Section 4.3 provides some additional details. Another approach, proposed by Hong and Li [77], uses the fact that under the null hypothesis the random variables {Zi } are a sequence of i.i.d. uniform random variables where Zi = P (Xi |X(i−1) , θ ) and P (y|x, θ ) is the transition distribution function. They propose to detect the departure from the null hypothesis by comparing the kernel-estimated bivariate density of {(Zi , Zi+1 )} with that of the uniform distribution on the unit square. The transition-densitybased approaches appear more elegant as they check simultaneously the forms of drift and diffusion. However, the transition density does often not admit an analytic form and the tests can be computationally intensive. 3.4 Fixed Sampling Interval

For practical analysis of financial data, it is hard to determine whether the sampling interval tends to zero. The key determination is whether the approximation errors for small “” are negligible. It is ideal when a method is applicable whether or not “” is small. This kind of method is possible, as demonstrated below. The simplest problem to illustrate the idea is the kernel density estimation of the invariant density of the stationary process {Xt }. For the given sample {Xt }, the kernel density estimate for the invariant density is (26)

fˆ(x) = n−1

n 

Kh (Xi − x),

i=1

based on the discrete data {Xi , i = 1, . . . , n}. This method is valid for all . It gives a consistent estimate of f as long as the time horizon is long: n → ∞. We will refer to this kind of nonparametric method as state-domain smoothing, as the procedure localizes in the state variable Xt . Various properties, including consistency and asymptotic normality, of the kernel estimator (26) are studied by Bandi [13] and Bandi and Phillips [15]. Bandi [13] also uses the estimator (26), which is the same as the local time of the process spending at a point x except for a scaling constant, as a descriptive tool for potentially nonstationary diffusion processes. Why can the state-domain smoothing methods be employed as if the data were independent? This is due

A SELECTIVE OVERVIEW

to the fact that localizing in the state domain weakens the correlation structure and that nonparametric estimates use essentially only local data. Hence many results on nonparametric estimators for independent data continue to hold for dependent data as long as their mixing coefficients decay sufficiently fast. As mentioned at the end of Section 2.2, geometric mixing and mixing are equivalent for time-homogeneous diffusion processes. Hence, the mixing coefficients decay usually sufficiently fast for theoretical investigation. The localizing and whitening can be understood graphically in Figure 6. Figure 6(a) shows that there is very strong serial correlation of the yields of the twoyear Treasury notes. However, this correlation is significantly weakened for the local data in the neighborhood 8% ± 0.2%. In fact, as detailed in Figure 6(b), the in-

329

dices of the data that fall in the local window are quite far apart. This in turn implies the weak dependence for the data in the local window, that is, “whitening by windowing.” See Section 5.4 of [50] and Hart [72] for further details. The effect of dependence structure on kernel density estimation was thoroughly studied by Claeskens and Hall [35]. The diffusion function can also be consistently estimated when  is fixed. In pricing the derivatives of interest rates, Aït-Sahalia [2] assumes µ(x) = k(α − x). Using the kernel density estimator fˆ and estimated κ and α from a least-squares method, he applied (11) to estimate σ (·) : σˆ 2 (x) = 2 0x µ(u) ˆ fˆ(u) du/fˆ(x). He further established the asymptotic normality of such an estimator. Gao and King [56] propose tests of diffusion models based on the discrepancy between the paramet-

F IG . 6. (a) Lag 1 scatterplot of the two-year Treasury note data. (b) Lag 1 scatterplot of those data falling in the neighborhood 8% ± 0.2%—the points are represented by the times of the observed data. The numbers in the scatterplot show the indices of the data falling in the neighborhood. (c) Kernel density estimate of the invariant density.

330

J. FAN

ric and nonparametric estimates of the invariant density. The Aït-Sahalia method [2] easily illustrates that the volatility function can be consistently estimated for fixed . However, we do not expect that it is efficient. Indeed, we use only the marginal information of the data. As shown in (21), almost all information is contained in the transition density p (·|·). The transition density can be estimated as in Section 4.2 below whether  is small or large. Since the transition density and drift and volatility are in one-to-one correspondence for the diffusion process (5), the drift and diffusion functions can be consistently estimated via inverting the relationship between the transition density and the drift and diffusion functions. There is no simple formula for expressing the drift and diffusion in terms of the transition density. The inversion is frequently carried out via a spectral analysis of the operator H = exp(L), where the infinitesimal operator L is defined as Lg(x) =

σ 2 (x) g (x) + µ(x)g (x). 2

It has the property Lg(x) = lim −1 [E{g(Xt+ )|Xt = x} − g(x)] →0

by Itô’s formula (10). The operator H is the transition operator in that [see also (12)] H g(x) = E{g(X )|X0 = x}. The works of Hansen and Scheinkman [66], Hansen, Scheinkman and Touzi [67] and Kessler and Sørensen [86] consist of the following idea. The first step is to estimate the transition operator H from the data. From the transition operator, one can identify the infinitesimal operator L and hence the functions µ(·) and σ (·). More precisely, let λ1 be the largest negative eigenvalue of the operator L with eigenfunction ξ1 (x). Then Lξ1 = λ1 ξ1 , or equivalently, σ 2 ξ1 + 2µξ1 = 2λ1 ξ1 . This gives one equation of µ and σ . Another equation can be obtained via (11): (σ 2 f ) − 2µf = 0. Solving these two equations we obtain σ 2 (x) = 2λ1

 x 0

ξ1 (y)f (y) dy/[f (x)ξ1 (x)]

and another explicit expression for µ(x). Using semigroup theory ([44], Theorem IV.3.7), ξ1 is also an eigenfunction of H with eigenvalue exp(λ1 ). Hence, the proposal is to estimate the invariant density f and the transition density p (y|x), which implies the values of λ1 and ξ1 . Gobet [58] derives the optimal rate

of convergence for such a scheme, using a wavelet basis. In particular, [58] shows that for fixed , the optimal rates of convergence for µ and σ are of orders O(n−s/(2s+5) ) and O(n−s/(2s+3) ), respectively, where s is the degree of smoothness of µ and σ . 3.5 Time-Dependent Model

The time-dependent model (8) was introduced to accommodate the possibility of economic changes over time. The coefficient functions in (8) are assumed to be slowly time-varying and smooth. Nonparametric techniques can be applied to estimate these coefficient functions. The basic idea is to localizing in time, resulting in a time-domain smoothing. We first estimate the coefficient functions α0 (t) and α1 (t). For each given time t0 , approximate the coefficient functions locally by constants, α(t) ≈ a and β(t) = b for t in a neighborhood of t0 . Using the Euler approximation (19), we run a local regression: Minimize (27)

n−1 

(Yi − a − bXi )2 Kh (i − t0 )

i=0

with respect to a and b. This results in an estimate ˆ where aˆ and bˆ are the αˆ 0 (t0 ) = aˆ and αˆ 1 (t0 ) = b, minimizers of the local regression (27). Fan et al. [48] suggest using a one-sided kernel such as K(u) = (1 − u2 )I (−1 < u < 0) so that only the historical data in the time interval (t0 − h, t0 ) are used in the above local regression. This facilitates forecasting and bandwidth selection. Our experience shows that there are no significant differences between nonparametric fitting with one-sided and two-sided kernels. We opt for local constant approximations instead of local linear approximations in (27), since the local linear fit can create artificial albeit insignificant linear trends when the underlying functions α0 (t) and α1 (t) are indeed timeindependent. To appreciate this, for constant functions α1 and α2 a large bandwidth will be chosen to reduce the variance in the estimation. This is in essence fitting a global linear regression by (27). If the local linear approximations are used, since no variable selection procedures have been incorporated in the local fitting (27), the slopes of the local linear approximations will not be estimated as zero and hence artificial linear trends will be created for the estimated coefficients. The coefficient functions in the volatility can be estimated by the local approximated likelihood method. Let





Eˆ t = −1/2 Xt+ − Xt − αˆ 0 (t) + αˆ 1 (t)Xt 

A SELECTIVE OVERVIEW

be the normalized residuals. Then (28)

β (t) Eˆ t ≈ β0 (t)Xt 1 εt .

The conditional log-likelihood of Eˆ t given Xt can easily be obtained by the approximation (28). Using local constant approximations and incorporating the kernel weight, we obtain the local approximated likelihood at each time point and estimates of the functions β0 (·) and β1 (·) at that time point. This type of local approximated-likelihood method is related to the generalized method of moments of Hansen [65] and the ideas of Florens-Zmirou [55] and Genon-Catalot and Jacod [57]. Since the coefficient functions in both return and volatility functions are estimated using only historical data, their bandwidths can be selected based on a form of the average prediction error. See Fan et al. [48] for details. The local least-squares regression can also be applied to estimate the coefficient functions β0 (t) and β1 (t) via the transformed model [see (28)] log(Eˆ t2 ) ≈ 2 log β0 (t) + β1 (t) log(Xt2 ) + log(εt2 ), but we do not continue in this direction since the local least-squares estimate is known to be inefficient in the likelihood context and the exponentiation of an estimated coefficient function of log β0 (t) is unstable. The question arises naturally if the coefficients in the model (8) are really time-varying. This amounts, for example, to testing H0 : β0 (t) = β0 and β1 (t) = β1 .

331

Based on the GLR technique, Fan et al. [48] proposed a formal test for this kind of problem. The coefficient functions in the semiparametric model (9) can also be estimated by using the profile approximated-likelihood method. For each given β1 , one can easily estimate β0 (·) via the approximation (28), resulting in an estimate βˆ0 (·; β1 ). Regarding the nonparametric function β0 (·) as being parameterized by βˆ0 (·; β1 ), model (28) with β1 (t) ≡ β1 becomes a “synthesized” parametric model with unknown β1 . The parameter β1 can be estimated by the maximum (approximated) likelihood method. Note that β1 is estimated by using all the data points, while βˆ0 (t) = βˆ0 (t; βˆ1 ) is obtained by using only the local data points. See [48] for details. For other nonparametric methods of estimating volatility in time inhomogeneous models, see Härdle, Herwartz and Spokoiny [68] and Mercurio and Spokoiny [89]. Their methods are based on model (8) with α1 (t) = β1 (t) = 0. 3.6 State-Domain Versus Time-Domain Smoothing

So far, we have introduced both state- and timedomain smoothing. The former relies on the structural invariability implied by the stationarity assumption and depends predominantly on the (remote) historical data. The latter uses the continuity of underlying parameters and concentrates basically on the recent data. This is illustrated in Figure 7 using the yields of the threemonth Treasury bills from January 8, 1954 to July 16,

F IG . 7. Illustration of time- and state-domain smoothing using the yields of three-month Treasury bills. The state-domain smoothing is localized in the horizontal bars, while the time-domain smoothing is concentrated in the vertical bars.

332

J. FAN

2004 sampled at weekly frequency. On December 28, 1990, the interest rate was about 6.48%. To estimate the drift and diffusion around x = 6.48, the statedomain smoothing focuses on the dynamics where interest rates are around 6.48%, the horizontal bar with interest rates falling in 6.48% ± 0.25%. The estimated volatility is basically the sample standard deviation of the differences {Xi − X(i−1) } within this horizontal bar. On the other hand, the time-domain smoothing focuses predominantly on the recent history, say one year, as illustrated in the figure. The time-domain estimate of volatility is basically a sample standard deviation within the vertical bar. For a given time series, it is hard to say which estimate is better. This depends on the underlying stochastic processes and also on the time when the forecast is made. If the underlying process is continuous and stationary, such as model (5), both methods are applicable. For example, standing at December 28, 1990, one can forecast the volatility by using the sample standard deviation in either the horizontal bar or the vertical bar. However, the estimated precision depends on the local data. Since the sample variance is basically linear 2 }, the standard errors of in the squared differences {Zi both estimates can be assessed and used to guide the forecasting. For stationary diffusion processes, it is possible to integrate both the time-domain and state-domain estimates. Note that the historical data (with interest rates in 6.48% ± 0.25%) are far apart in time from the data used in the time-domain smoothing (vertical bar), except the last segment, which can be ignored in the statedomain fitting. The next-to-last segment with interest rates in 6.48% ± 0.25% is May 11 to July 20, 1988, 123 weeks prior to the last segment. Hence, these two estimates are nearly independent. The integrated estimate is a linear combination of these two nearly independent estimates. The weights can easily be chosen to minimize the variance of the integrated estimator, by using the assessed standard errors of the stateand time-domain estimators. The optimal weights are proportional to the variances of the two estimators, which depend on time t. This forms a dynamically integrated predictor for volatility estimation, as the optimal weights change over time. 3.7 Continuously Observed Data

At the theoretical level, one may also examine the problem of estimating the drift and diffusion functions assuming the whole process is observable up to time T .

Let us assume again that the observed process {Xt } follows the SDE (5). In this case σ 2 (Xt ) is the derivative of the quadratic variation process of Xt and hence is known up to time T . By (11), estimating the drift function µ(x) is equivalent to estimating the invariant density f . In fact, (29)

µ(x) = [σ 2 (x)f (x)] /[2f (x)].

The invariant density f can easily be estimated by kernel density estimation. When  → 0, the summation in (26) converges to (30)

fˆ(x) = T −1

 T 0

Kh (Xt − x) dt.

This forms a kernel density estimate of the invariant density based on the continuously observed data. Thus, an estimator for µ(x) can be obtained by substituting fˆ(x) into (29). Such an approach has been employed by Kutoyants [88] and Dalalyan and Kutoyants [40, 41]. They established the sharp asymptotic minimax risk for estimating the invariant density f and its derivative as well as the drift function µ. In particular, the functions f , f and µ can be estimated with rates T −1/2 , T −2s/(2s+1) and T −2s/(2s+1) , respectively, where s is the degree of smoothness of µ. These are the optimal rates of convergence. An alternative approach is to estimate the drift function directly from (23). By letting  → 0, one can easily obtain a local linear regression estimator for continuously observed data, which admits a similar form to (23) and (30). This is the approach that Spokoiny [103] used. He showed that this estimator attains the optimal rate of convergence and established further a data-driven bandwidth such that the local linear estimator attains adaptive minimax rates. 4. ESTIMATION OF STATE PRICE DENSITIES AND TRANSITION DENSITIES

The state price density (SPD) is the probability density of the value of an asset under the risk-neutral world (14) (see [38]) or equivalent martingale measure [71]. It is directly related to the pricing of financial derivatives. It is the transition density of XT given X0 under the equivalent martingale Q. The SPD does not depend on the payoff function and hence it can be used to evaluate other illiquid derivatives, once it is estimated from more liquid derivatives. On the other hand, the transition density characterizes the probability law of a Markovian process and hence is useful for validating Markovian properties and parametric models.

333

A SELECTIVE OVERVIEW

4.1 Estimation of the State Price Density

For some specific models, the state price density can be formed explicitly. For example, for the GBM (1) with a constant risk-free rate r, according to (17), the SPD is log-normal with mean log x0 + (r − σ 2 )/(2T ) and variance σ 2 . Assume that the SPD f ∗ exists. Then the European call option can be expressed as 

C = exp −

 T 0

rs ds

 ∞ K

(x − K)f ∗ (x) dx.

See (14) (we have changed the notation from P0 to C to emphasize the price of the European call option). Hence, (31)

f ∗ (K) = exp

 T 0

rs ds

 2 ∂ C

∂K 2

.

This was observed by Breeden and Litzenberger [25]. Thus, the state price density can be estimated from the European call options with different strike prices. With the estimated state price density, one can price new or less liquid securities such as over-the-counter derivatives or nontraded options using formula (14). In general, the price of a European call option depends on the current stock price S, the strike price K, the time to maturity T , the risk-free interest rate r and dividend yield rate δ. It can be written as C(S, K, T , r, δ). The exact form of C, in general, is hard to determine unless we assume the Black–Scholes model. Based on historical data {(Ci , Si , Ki , Ti , ri , δi ), i = 1, . . . , n}, where Ci is the ith traded-option price with associated characteristics (Si , Ki , Ti , ri , δi ), AïtSahalia and Lo [7] fit the nonparametric regression Ci = C(Si , Ki , Ti , ri , δi ) + εi to obtain an estimate of the function C and hence the SPD f ∗ . Due to the curse of dimensionality, the five-dimensional nonparametric function cannot be estimated well with practical range of sample sizes. Aït-Sahalia and Lo [7] realized that and proposed a few dimensionality reduction methods. First, by assuming that the option price depends only on the futures price F = S exp((r − δ)T ), namely, C(S, K, T , r, δ) = C(F, K, T , r) (the Black–Scholes formula satisfies such an assumption), they reduced the dimensionality from five to four. By assuming further that the option-pricing function is homogeneous of degree one in F and K, namely, C(S, K, T , r, δ) = KC(F /K, T , r),

they reduced the dimensionality to three. Aït-Sahalia and Lo [7] imposed a semiparametric form on the pricing formula, 



C(S, K, T , r, δ) = CBS F, K, T , r, σ (F, K, T ) , where CBS (F, K, T , r, σ ) is the Black–Scholes pricing formula given in (18) and σ (F, K, T ) is the implied volatility, computed by inverting the Black–Scholes formula. Thus, the problem becomes one of nonparametrically estimating the implied volatility function σ (F, K, T ). This is estimated by using a nonparametric regression technique from historical data, namely, σi = σ (Fi , Ki , Ti ) + εi , where σi is the implied volatility of Ci , by inverting the Black–Scholes formula. By assuming further that σ (F, K, T ) = σ (F /K, T ), the dimensionality is reduced to two. This is one of the options in [4]. The state price density f ∗ is nonnegative and hence the function C should be convex in the strike price K. Aït-Sahalia and Duarte [6] propose to estimate the option price under the convexity constraint using a local linear estimator. See also [70] for a related approach. 4.2 Estimation of Transition Densities

The transition density of a Markov process characterizes the law of the process, except the initial distribution. It provides useful tools for checking whether or not such a process follows a certain SDE and for statistical estimation and inference. It is the state price density of the price process under the risk neutral world. If such a process were observable, the state price density would be estimated using the methods to be introduced. Assume that we have a sample {Xi , i = 0, . . . , n} from model (5). The “double-kernel” method of Fan, Yao and Tong [51] is to observe that

(32)



E Wh2 (Xi − y)|X(i−1) = x ≈ p (y|x) as h2 → 0,

for a kernel function W . Thus, the transition density p (y|x) can be regarded approximately as the nonparametric regression function of the response variable Wh2 (Xi − y) on X(i−1) . An application of the local linear estimator (23) yields pˆ  (y|x) = (33)

n 



Kn X(i−1) − x, x

i=1

· Wh2 (Xi − y),



334

J. FAN

where the equivalent kernel Kn (u, x) was defined in (24). Fan, Yao and Tong [51] establish the asymptotic normality of such an estimator under stationarity and ρ-mixing conditions [necessarily decaying at geometric rate for SDE (5)], which gives explicitly the asymptotic bias and variance of the estimator. See also Section 6.5 of [50]. The cross-validation idea of Rudemo [98] and Bowman [24] can be extended to select bandwidths for estimating conditional densities. See [52, 63]. The transition distribution can be estimated by integrating the estimator (33) over y. By letting h2 → 0, the estimator is the regression of the indicator I (Xi < y) on X(i−1) . Alternative estimators can be obtained by an application of the local logistic regression and adjusted Nadaraya–Watson method of Hall et al. [64]. Early references on the estimation of the transition distributions and densities include [96, 97] and [95]. 4.3 Inferences Based on Transition Densities

With the estimated transition density, one can now verify whether parametric models such as (1)–(3), (6) are consistent with the observed data. Let p,θ (y|x) be the transition density under a parametric diffusion model. For example, for the CIR model (2), the parameter θ = (κ, α, σ ). As in (21), ignoring the initial value X0 , the parameter θ can be estimated by maximizing (p,θ ) =

n 





log p,θ Xi |X(i−1) .

i=1

Let θˆ be the maximum likelihood estimator. By the spirit of the GLR of Fan et al. [54], the GLR test for the null hypothesis H0 : p (y|x) = p,θ (y|x) is GLR = (pˆ  ) − (p,θˆ ), where pˆ is a nonparametric estimate of the transition density. Since the transition density cannot be estimated well over the region where data are sparse (usually at boundaries of the process), we need to truncate the nonparametric (and simultaneously parametric) evaluation of the likelihood at appropriate intervals. In addition to employing the GLR test, one can also compare directly the difference between the parametric and nonparametric fits, resulting in test statistics such as pˆ  − p,θˆ 2 and Pˆ − P,θˆ 2 for an appropriate norm  · , where Pˆ and P,θˆ are the estimates of the cumulative transition distributions under respectively the parametric and nonparametric models.

The transition density-based methods depend on two bandwidths and are harder to implement. Indeed, their null distributions are harder to determine than those based on the transition distribution methods. In comparison with the invariant density-based approach of Arapis and Gao [11], it is consistent against a much larger family of alternatives. One can also use the transition density to test whether an observed series is Markovian (from personal communication with Yacine Aït-Sahalia). For example, if a process {Xi } is Markovian, then p2 (y|x) =

 +∞ −∞

p (y|z)p (z|x) dz.

Thus,

+∞ one can use the distance between pˆ 2 (y|x) and −∞ pˆ  (y|z)pˆ  (z|x) dz as a test statistic. The transition density can also be used for parameter estimation. One possible approach is to find the parameter which minimizes the distance Pˆ − P,θ . In this case, the bandwidth should be chosen to optimize the performance for estimating θ . The approach is applicable whether or not  → 0. 5. CONCLUDING REMARKS

Enormous efforts in financial econometrics have been made in modeling the dynamics of stock prices and bond yields. There are directly related to pricing derivative securities, proprietary trading and portfolio management. Various parametric models have been proposed to facilitate mathematical derivations. They have risks that misspecifications of models lead to erroneous pricing and hedging strategies. Nonparametric models provide a powerful and flexible treatment. They aim at reducing modeling biases by increasing somewhat the variances of resulting estimators. They provide an elegant method for validating or suggesting a family of parametric models. The versatility of nonparametric techniques in financial econometrics has been demonstrated in this paper. They are applicable to various aspects of diffusion models: drift, diffusion, transition densities and even state price densities. They allow us to examine whether the stochastic dynamics for stocks and bonds are time varying and whether famous parametric models are consistent with empirical financial data. They permit us to price illiquid or nontraded derivatives from liquid derivatives. The applications of nonparametric techniques in financial econometrics are far wider than what has been presented. There are several areas where nonparametric methods have played a pivotal role. One example

A SELECTIVE OVERVIEW

is to test various versions of capital asset pricing models (CAPM) and their related stochastic discount models [36]. See, for example, the research manuscript by Chen and Ludvigson [34] in this direction. Another important class of models are stochastic volatility models [19, 101], where nonparametric methods can be also applied. The nonparametric techniques have been prominently featured in the RiskMetrics of J. P. Morgan. It can be employed to forecast the risks of portfolios. See, for example, [8, 32, 33, 47, 82] for related nonparametric techniques on risk management. ACKNOWLEDGMENTS

The author gratefully acknowledges various discussions with Professors Yacine Aït-Sahalia and Jia-an Yan and helpful comments of the editors and reviewers that led to significant improvement of the presentation of this paper. This research was supported in part by NSF Grant DMS-03-55179 and a direct allocation RGC grant of the Chinese University of Hong Kong. REFERENCES [1] A HN , D. H. and G AO , B. (1999). A parametric nonlinear model of term structure dynamics. Review of Financial Studies 12 721–762. [2] A ÏT-S AHALIA , Y. (1996). Nonparametric pricing of interest rate derivative securities. Econometrica 64 527–560. [3] A ÏT-S AHALIA , Y. (1996). Testing continuous-time models of the spot interest rate. Review of Financial Studies 9 385– 426. [4] A ÏT-S AHALIA , Y. (1999). Transition densities for interest rate and other nonlinear diffusions. J. Finance 54 1361–1395. [5] A ÏT-S AHALIA , Y. (2002). Maximum likelihood estimation of discretely sampled diffusions: A closed-form approximation approach. Econometrica 70 223–262. [6] A ÏT-S AHALIA , Y. and D UARTE , J. (2003). Nonparametric option pricing under shape restrictions. J. Econometrics 116 9–47. [7] A ÏT-S AHALIA , Y. and L O , A. W. (1998). Nonparametric estimation of state-price densities implicit in financial asset prices. J. Finance 53 499–547. [8] A ÏT-S AHALIA , Y. and L O , A. W. (2000). Nonparametric risk management and implied risk aversion. J. Econometrics 94 9–51. [9] A ÏT-S AHALIA , Y. and M YKLAND , P. (2003). The effects of random and discrete sampling when estimating continuoustime diffusions. Econometrica 71 483–549. [10] A ÏT-S AHALIA , Y. and M YKLAND , P. (2004). Estimators of diffusions with randomly spaced discrete observations: A general theory. Ann. Statist. 32 2186–2222. [11] A RAPIS , M. and G AO , J. (2004). Nonparametric kernel estimation and testing in continuous-time financial econometrics. Unpublished manuscript.

335

[12] A RFI , M. (1998). Non-parametric variance estimation from ergodic samples. Scand. J. Statist. 25 225–234. [13] BANDI , F. (2002). Short-term interest rate dynamics: A spatial approach. J. Financial Economics 65 73–110. [14] BANDI , F. and N GUYEN , T. (1999). Fully nonparametric estimators for diffusions: A small sample analysis. Unpublished manuscript. [15] BANDI , F. and P HILLIPS , P. C. B. (2003). Fully nonparametric estimation of scalar diffusion models. Econometrica 71 241–283. [16] BANON , G. (1977). Estimation non paramétrique de densité de probabilité pour les processus de Markov. Thése, Univ. Paul Sabatier de Toulouse, France. [17] BANON , G. (1978). Nonparametric identification for diffusion processes. SIAM J. Control Optim. 16 380–395. [18] BANON , G. and N GUYEN , H. T. (1981). Recursive estimation in diffusion models. SIAM J. Control Optim. 19 676– 685. [19] BARNDOFF -N IELSEN , O. E. and S HEPHARD , N. (2001). Non-Gaussian Ornstein–Uhlenbeck-based models and some of their uses in financial economics (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 63 167–241. [20] B INGHAM , N. H. and K IESEL , R. (1998). Risk-Neutral Valuation: Pricing and Hedging of Financial Derivatives. Springer, New York. [21] B LACK , F., D ERMAN , E. and T OY, W. (1990). A one-factor model of interest rates and its application to Treasury bond options. Financial Analysts Journal 46(1) 33–39. [22] B LACK , F. and K ARASINSKI , P. (1991). Bond and option pricing when short rates are lognormal. Financial Analysts Journal 47(4) 52–59. [23] B LACK , F. and S CHOLES , M. (1973). The pricing of options and corporate liabilities. J. Political Economy 81 637– 654. [24] B OWMAN , A. W. (1984). An alternative method of crossvalidation for the smoothing of density estimates. Biometrika 71 353–360. [25] B REEDEN , D. and L ITZENBERGER , R. H. (1978). Prices of state-contingent claims implicit in option prices. J. Business 51 621–651. [26] C AI , Z., FAN , J. and YAO , Q. (2000). Functionalcoefficient regression models for nonlinear time series. J. Amer. Statist. Assoc. 95 941–956. [27] C AI , Z. and H ONG , Y. (2003). Nonparametric methods in continuous-time finance: A selective review. In Recent Advances and Trends in Nonparametric Statistics (M. G. Akritas and D. N. Politis, eds.) 283–302. North-Holland, Amsterdam. [28] C AMPBELL , J. Y., L O , A. W. and M AC K INLAY, A. C. (1997). The Econometrics of Financial Markets. Princeton Univ. Press. [29] C HAN , K. C., K AROLYI , G. A., L ONGSTAFF , F. A. and S ANDERS , A. B. (1992). An empirical comparison of alternative models of the short-term interest rate. J. Finance 47 1209–1227. [30] C HAPMAN , D. A. and P EARSON , N. D. (2000). Is the short rate drift actually nonlinear? J. Finance 55 355–388. [31] C HEN , R. and T SAY, R. S. (1993). Functional-coefficient autoregressive models. J. Amer. Statist. Assoc. 88 298–308.

336

J. FAN

[32] C HEN , S. X. (2005). Nonparametric estimation of expected shortfall. Econometric Theory. To appear. [33] C HEN , S. X. and TANG , C. Y. (2005). Nonparametric inference of value-at-risk for dependent financial returns. J. Financial Econometrics 3 227–255. [34] C HEN , X. and L UDVIGSON , S. (2003). Land of Addicts? An empirical investigation of habit-based asset pricing model. Unpublished manuscript. [35] C LAESKENS , G. and H ALL , P. (2002). Effect of dependence on stochastic measures of accuracy of density estimators. Ann. Statist. 30 431–454. [36] C OCHRANE , J. H. (2001). Asset Pricing. Princeton Univ. Press. [37] C OX , J. C., I NGERSOLL , J. E. and ROSS , S. A. (1985). A theory of the term structure of interest rates. Econometrica 53 385–407. [38] C OX , J. C. and ROSS , S. (1976). The valuation of options for alternative stochastic processes. J. Financial Economics 3 145–166. [39] DACUNHA -C ASTELLE , D. and F LORENS , D. (1986). Estimation of the coefficients of a diffusion from discrete observations. Stochastics 19 263–284. [40] DALALYAN , A. S. and K UTOYANTS , Y. A. (2002). Asymptotically efficient trend coefficient estimation for ergodic diffusion. Math. Methods Statist. 11 402–427. [41] DALALYAN , A. S. and K UTOYANTS , Y. A. (2003). Asymptotically efficient estimation of the derivative of the invariant density. Stat. Inference Stoch. Process. 6 89–107. [42] D UFFIE , D. (2001). Dynamic Asset Pricing Theory, 3rd ed. Princeton Univ. Press. [43] E GOROV, A. V., L I , H. and X U , Y. (2003). Maximum likelihood estimation of time-inhomogeneous diffusions. J. Econometrics 114 107–139. [44] E NGEL , K.-J. and NAGEL , R. (2000). One-Parameter Semigroups for Linear Evolution Equations. Springer, Berlin. [45] E NGELBERT, H. J. and S CHMIDT, W. (1984). On onedimensional stochastic differential equations with generalized drift. Stochastic Differential Systems. Lecture Notes in Control and Inform. Sci. 69 143–155. Springer, Berlin. [46] FAN , J. (1992). Design-adaptive nonparametric regression. J. Amer. Statist. Assoc. 87 998–1004. [47] FAN , J. and G U , J. (2003). Semiparametric estimation of value-at-risk. Econom. J. 6 261–290. [48] FAN , J., J IANG , J., Z HANG , C. and Z HOU , Z. (2003). Time-dependent diffusion models for term structure dynamics. Statist. Sinica 13 965–992. [49] FAN , J. and YAO , Q. (1998). Efficient estimation of conditional variance functions in stochastic regression. Biometrika 85 645–660. [50] FAN , J. and YAO , Q. (2003). Nonlinear Time Series: Nonparametric and Parametric Methods. Springer, New York. [51] FAN , J., YAO , Q. and T ONG , H. (1996). Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. Biometrika 83 189–206. [52] FAN , J. and Y IM , T. H. (2004). A crossvalidation method for estimating conditional densities. Biometrika 91 819–834. [53] FAN , J. and Z HANG , C. (2003). A re-examination of diffusion estimators with applications to financial model validation. J. Amer. Statist. Assoc. 98 118–134.

[54] FAN , J., Z HANG , C. and Z HANG , J. (2001). Generalized likelihood ratio statistics and Wilks phenomenon. Ann. Statist. 29 153–193. [55] F LORENS -Z MIROU , D. (1993). On estimating the diffusion coefficient from discrete observations. J. Appl. Probab. 30 790–804. [56] G AO , J. and K ING , M. (2004). Adaptive testing in continuous-time diffusion models. Econometric Theory 20 844–882. [57] G ENON -C ATALOT, V. and JACOD , J. (1993). On the estimation of the diffusion coefficient for multi-dimensional diffusion processes. Ann. Inst. H. Poincaré Probab. Statist. 29 119–151. [58] G OBET, E. (2002). LAN property for ergodic diffusions with discrete observations. Ann. Inst. H. Poincaré Probab. Statist. 38 711–737. [59] G OBET, E., H OFFMANN , M. and R EISS , M. (2004). Nonparametric estimation of scalar diffusions based on low frequency data. Ann. Statist. 32 2223–2253. [60] G OURIÉROUX , C. and JASIAK , J. (2001). Financial Econometrics: Problems, Models, and Methods. Princeton Univ. Press. [61] G OURIÉROUX , C., M ONFORT, A. and R ENAULT, E. (1993). Indirect inference. J. Appl. Econometrics 8 suppl. S85–S118. [62] H ALL , P. and C ARROLL , R. J. (1989). Variance function estimation in regression: The effect of estimating the mean. J. Roy. Statist. Soc. Ser. B 51 3–14. [63] H ALL , P., R ACINE , J. and L I , Q. (2004). Cross-validation and the estimation of conditional probability densities. J. Amer. Statist. Assoc. 99 1015–1026. [64] H ALL , P., W OLFF , R. C. L. and YAO , Q. (1999). Methods for estimating a conditional distribution function. J. Amer. Statist. Assoc. 94 154–163. [65] H ANSEN , L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50 1029–1054. [66] H ANSEN , L. P. and S CHEINKMAN , J. A. (1995). Back to the future: Generating moment implications for continuoustime Markov processes. Econometrica 63 767–804. [67] H ANSEN , L. P., S CHEINKMAN , J. A. and T OUZI , N. (1998). Spectral methods for identifying scalar diffusions. J. Econometrics 86 1–32. [68] H ÄRDLE , W., H ERWARTZ , H. and S POKOINY, V. (2003). Time inhomogeneous multiple volatility modelling. J. Financial Econometrics 1 55–95. [69] H ÄRDLE , W. and T SYBAKOV, A. B. (1997). Local polynomial estimators of the volatility function in nonparametric autoregression. J. Econometrics 81 223–242. [70] H ÄRDLE , W. and YATCHEW, A. (2002). Dynamic nonparametric state price density estimation using constrained least-squares and the bootstrap. Discussion paper 16, Quantification and Simulation of Economics Processes, Humboldt-Universität zu Berlin. [71] H ARRISON , J. M. and K REPS , D. (1979). Martingales and arbitrage in multiperiod securities markets. J. Econom. Theory 2 381–408. [72] H ART, J. D. (1996). Some automated methods of smoothing time-dependent data. Nonparametr. Statist. 6 115–142.

A SELECTIVE OVERVIEW [73] H ART, J. D. (1997). Nonparametric Smoothing and Lackof-Fit Tests. Springer, New York. [74] H ASTIE , T. J. and T IBSHIRANI , R. J. (1993). Varyingcoefficient models (with discussion). J. Roy. Statist. Soc. Ser. B. 55 757–796. [75] H O , T. S. Y. and L EE , S.-B. (1986). Term structure movements and pricing interest rate contingent claims. J. Finance 41 1011–1029. [76] H ONG , Y. and L EE , T.-H. (2003). Inference on predictability of foreign exchange rates via generalized spectrum and nonlinear time series models. Review of Economics and Statistics 85 1048–1062. [77] H ONG , Y. and L I , H. (2005). Nonparametric specification testing for continuous-time models with applications to term structure of interest rates. Review of Financial Studies 18 37–84. [78] H ULL , J. (2003). Options, Futures, and Other Derivatives, 5th ed. Prentice Hall, Upper Saddle River, NJ. [79] H ULL , J. and W HITE , A. (1990). Pricing interest-ratederivative securities. Review of Financial Studies 3 573–592. [80] I TÔ , K. (1942). Differential equations determining Markov processes. Zenkoku Shijo Sugaku Danwakai 244 1352– 1400. (In Japanese.) [81] I TÔ , K. (1946). On a stochastic integral equation. Proc. Japan Acad. 22 32–35. [82] J ORION , P. (2000). Value at Risk: The New Benchmark for Managing Financial Risk, 2nd ed. McGraw–Hill, New York. [83] K ALLENBERG , O. (2002). Foundations of Modern Probability, 2nd ed. Springer, New York. [84] K ARATZAS , I. and S HREVE , S. E. (1991). Brownian Motion and Stochastic Calculus, 2nd ed. Springer, New York. [85] K ESSLER , M. (1997). Estimation of an ergodic diffusion from discrete observations. Scand. J. Statist. 24 211–229. [86] K ESSLER , M. and S ØRENSEN , M. (1999). Estimating equations based on eigenfunctions for a discretely observed diffusion process. Bernoulli 5 299–314. [87] K LOEDEN , P. E., P LATEN , E., S CHURZ , H. and S ØRENSEN , M. (1996). On effects of discretization on estimators of drift parameters for diffusion processes. J. Appl. Probab. 33 1061–1076. [88] K UTOYANTS , Y. A. (1998). Efficient density estimation for ergodic diffusion processes. Stat. Inference Stoch. Process. 1 131–155. [89] M ERCURIO , D. and S POKOINY, V. (2004). Statistical inference for time-inhomogeneous volatility models. Ann. Statist. 32 577–602. [90] M ERTON , R. (1973). Theory of rational option pricing. Bell J. Econom. Management Sci. 4 141–183.

337

[91] M ÜLLER , H.-G. and S TADTMÜLLER , U. (1987). Estimation of heteroscedasticity in regression analysis. Ann. Statist. 15 610–625. [92] O SBORNE , M. F. M. (1959). Brownian motion in the stock market. Operations Res. 7 145–173. [93] P HAM , D. T. (1981). Nonparametric estimation of the drift coefficient in the diffusion equation. Math. Operationsforsch. Statist. Ser. Statist. 12 61–73. [94] P RAKASA R AO , B. L. S. (1985). Estimation of the drift for diffusion process. Statistics 16 263–275. [95] ROSENBLATT, M. (1970). Density estimates and Markov sequences. In Nonparametric Techniques in Statistical Inference (M. L. Puri, ed.) 199–213. Cambridge Univ. Press. [96] ROUSSAS , G. G. (1969). Nonparametric estimation in Markov processes. Ann. Inst. Statist. Math. 21 73–87. [97] ROUSSAS , G. G. (1969). Nonparametric estimation of the transition distribution function of a Markov process. Ann. Math. Statist. 40 1386–1400. [98] RUDEMO , M. (1982). Empirical choice of histograms and kernel density estimators. Scand. J. Statist. 9 65–78. [99] RUPPERT, D., WAND , M. P., H OLST, U. and H ÖSSJER , O. (1997). Local polynomial variance function estimation. Technometrics 39 262–273. [100] S CHURZ , H. (2000). Numerical analysis of stochastic differential equations without tears. In Handbook of Stochastic Analysis and Applications (D. Kannan and V. Lakshmikantham, eds.) 237–359. Dekker, New York. [101] S HEPHARD , N., ed. (2005). Stochastic Volatility: Selected Readings. Oxford Univ. Press. [102] S IMONOFF , J. S. (1996). Smoothing Methods in Statistics. Springer, New York. [103] S POKOINY, V. (2000). Adaptive drift estimation for nonparametric diffusion model. Ann. Statist. 28 815–836. [104] S TANTON , R. (1997). A nonparametric model of term structure dynamics and the market price of interest rate risk. J. Finance 52 1973–2002. [105] S TEELE , J. M. (2001). Stochastic Calculus and Financial Applications. Springer, New York. [106] VASICEK , O. A. (1977). An equilibrium characterization of the term structure. J. Financial Economics 5 177–188. [107] WAND , M. P. and J ONES , M. C. (1995). Kernel Smoothing. Chapman and Hall, London. [108] WANG , Y. (2002). Asymptotic nonequivalence of GARCH models and diffusions. Ann. Statist. 30 754–783. [109] YOSHIDA , N. (1992). Estimation for diffusion processes from discrete observations. J. Multivariate Anal. 41 220–242.

Statistical Science 2005, Vol. 20, No. 4, 338–343 DOI 10.1214/088342305000000430 © Institute of Mathematical Statistics, 2005

Comment: A Selective Overview of Nonparametric Methods in Financial Econometrics Peter C. B. Phillips and Jun Yu

Abstract. These comments concentrate on two issues arising from Fan’s overview. The first concerns the importance of finite sample estimation bias relative to the specification and discretization biases that are emphasized in Fan’s discussion. Past research and simulations given here both reveal that finite sample effects can be more important than the other two effects when judged from either statistical or economic viewpoints. Second, we draw attention to a very different nonparametric technique that is based on computing an empirical version of the quadratic variation process. This technique is not mentioned by Fan but has many advantages and has accordingly attracted much recent attention in financial econometrics and empirical applications. Key words and phrases: Nonparametric method, continuous time models, financial time series, jackknife, realized volatility. econometric problems of fitting continuous time systems for much longer. The idea of statistically fitting diffusion models and continuously distributed lagged dependencies with discretely observed data has a long history dating back to some original work in econometrics by Koopmans [27] and subsequent work by Phillips [31], Bergstrom [14], Sims [35], Phillips [32] and Sargan [34]. Bartlett and Rajalakshman [13] and Bartlett [12] are two references in the early statistical literature on fitting linear diffusions. Bergstrom [15] provides a short history of some of this early work. Also, the history of mathematical finance and stochastic integration prior to 1970 has recently been overviewed in an interesting historical review by Jarrow and Protter [24]. Our comments on Fan’s paper will concentrate on two issues that relate in important ways to the paper’s focus on misspecification and discretization bias and the role of nonparametric methods in empirical finance. The first issue deals with the finite sample effects of various estimation methods and their implications for asset pricing. A good deal of recent attention in the econometric literature has focused on the benefits of full maximum likelihood (ML) estimation of diffusions and mechanisms for avoiding discretization bias in the construction of the likelihood. However, many of the

1. INTRODUCTION

In recent years there has been increased interest in using nonparametric methods to deal with various aspects of financial data. The paper by Fan gives an overview of some nonparametric techniques that have been used in the financial econometric literature, focusing on estimation and inference for diffusion models in continuous time and estimation of state price and transition density functions. Continuous time specifications have been heavily used in recent work, partly because of the analytic convenience of stochastic calculus in mathematical finance and partly because of the availability of highfrequency data sets for many financial series. While the early work in continuous-time finance began in the 1970s with the work of Merton [29] and Black and Scholes [16], economists have been looking at the Peter C. B. Phillips is Professor, Cowles Foundation for Research in Economics, Yale University, Box 208281, New Haven, Connecticut 06520-8281, USA (e-mail: [email protected]). Jun Yu is Associate Professor, School of Economics and Social Science, Singapore Management University, 90 Stamford Road, Singapore 178903 (e-mail: [email protected]). 338

A SELECTIVE OVERWIEV

problems of estimating dynamic models that are well known in discrete time series, such as the bias in ML estimation, also manifest in the estimation of continuous time systems and affect subsequent use of these estimates, for instance in derivative pricing. In consequence, a relevant concern is the relative importance of the estimation and discretization biases. As we will show below, the former often dominates the latter even when the sample size is large (at least 500 monthly observations, say). Moreover, it turns out that correction for the finite sample estimation bias continues to be more important when the diffusion component of the model is itself misspecified. Such corrections appear to be particularly important in models that are nonstationary or nearly nonstationary. The second issue we discuss deals with a very different nonparametric technique, which is not discussed by Fan, but which has recently attracted much attention in financial econometrics and empirical applications. This method involves the use of quadratic variation measures of realized volatility using ultra high frequency financial data. Like other nonparametric methods, empirical quadratic variation techniques also have to deal with statistical bias, which in the present case arises from the presence of microstructure noise. The field of research on this topic in econometrics is now very active. 2. FINITE SAMPLE EFFECTS

In his overview of diffusion equation estimation, Fan discusses two sources of bias, one arising from the discretization process and the second from misspecification. We review these two bias effects and then discuss the bias that comes from finite sample estimation effects. The attractions of Itô calculus have made it particularly easy to work with stochastic differential equations driven by Brownian motion. Diffusion processes in particular have been used widely in finance to model asset prices, including stock prices, interest rates and exchange rates. Despite their mathematical attractability, diffusion processes present some formidable challenges for econometric estimation. The primary reason for the difficulty is that sample data, even very highfrequency data, are always discrete and for many popular nonlinear diffusion models the transition density of the discrete sample does not have a closed form expression, as noted by Fan. The problem is specific to nonlinear diffusions, as consistent methods for estimating exact discrete models corresponding to linear systems

339

of diffusions have been available since Phillips [32]. A simple approach discussed in the paper is to use the Euler approximation scheme to discretize the model, a process which naturally creates some discretization bias. This discretization bias can lead to erroneous financial pricing and investment decisions. In consequence, the issue of discretization has attracted a lot of attention in the literature and many methods have been proposed to reduce the bias that it causes. Examples are Pedersen [30], Kessler [26], Durham and Gallant [18], Aït-Sahalia [2, 3] and Elerian, Chib and Shephard [19], among many others. Next, many diffusion models in practical use are specified in a way that makes them mathematically convenient. These specifications are typically not derived from any underlying economic theory and are therefore likely to be misspecified. Potential misspecifications, like discretization, can lead to erroneous financial decisions. Accordingly, specification bias has attracted a great deal of attention in the literature and has helped to motivate the use of functional estimation techniques that treat the drift and diffusion coefficients nonparametrically. Important contributions include Aït-Sahalia [1], Stanton [36], Bandi and Phillips [5] and Hong and Li [21]. While we agree that both discretization and specification bias are important issues, finite sample estimation bias can be of equal or even greater importance for financial decision making, as noted by Phillips and Yu [33] in the context of pricing bonds and bond options. The strong effect of the finite sample estimation bias in this context can be explained as follows. In continuous time specifications, the prices of bonds and bond options depend crucially on the mean reversion parameter in the associated interest rate diffusion equation. This parameter is well known to be subject to estimation bias when standard methods like ML are used. The bias is comparable to, but generally has larger magnitude than, the usual bias that appears in time series autoregression. As the parameter is often very close to zero in empirical applications (corresponding to near martingale behavior and an autoregressive root near unity in discrete time), the estimation bias can be substantial even in very large samples. To reduce the finite sample estimation bias in parameter estimation as well as the consequential bias that arises in asset pricing, Phillips and Yu [33] proposed the use of jackknife techniques. Suppose a sample of n observations is available and that this sample is decomposed into m consecutive sub-samples each with  observations (n = m × ). The jackknife estimator of a

340

P. C. B. PHILLIPS AND J. YU

parameter θ in the model is defined by (2.1)

θˆjack =

m

θˆi m θˆn − i=1 , m−1 m2 − m

where θˆn and θˆi are the extreme estimates of θ based on the entire sample and the i’th sub-sample, respectively. The parameter θ can be a coefficient in the diffusion process, such as the mean reversion parameter, or a much more complex function of the parameters of the diffusion process and the data, such as an asset price or derivative price. Typically, the full sample extreme estimator has bias of order O(n−1 ), whereas under mild conditions the bias in the jackknife estimate is of order O(n−2 ). The following simulation illustrates these various bias effects and compares their magnitudes. In the experiment, the true generating process is assumed to be the following commonly used model (CIR hereafter) of short term interest rates due to Cox, Ingersoll and Ross [17]: (2.2)







+ σ N 0, r(t) .

One thousand samples, each with 600 monthly observations (i.e.,  = 1/12), are simulated from the true model (2.2) with (κ, µ, σ ) being set at (0.1, 0.08, 0.02) , which are settings that are realistic in many financial applications. To investigate the effects of



We use this model to price the same bond and bond option. Vasicek [37] derived the expression for bond prices and Jamshidian [23] gave the corresponding formula for bond option prices. The transition density for the Vasicek model is



r(t + ) = κµ + (1 − κ)r(t)



dr(t) = κ µ − r(t) dt + σ dB(t).

(2.4)

r(t + )|r(t)

dr(t) = κ µ − r(t) dt + σ r 1/2 (t) dB(t).

The transition density of the CIR model is known to be ce−u−v (v/u)q/2 Iq (2(uv)1/2 ) and the marginal density is w1w2 r w2 −1 e−w1 r / (w2 ), where c = 2κ/(σ 2 (1 − e−κ )), u = cr(t)e−κ , v = cr(t + ), q = 2κµ/ σ 2 − 1, w1 = 2κ/σ 2 , w2 = 2κµ/σ 2 ,  is the sampling frequency, and Iq (·) is the modified Bessel function of the first kind of order q. The transition density together with the marginal density can be used for simulation purposes as well as to obtain the exact ML estimator of θ (= (κ, µ, σ ) ). In the simulation, we use this model to price a discount bond, which is a three-year bond with a face value of $1 and initial interest rate of 5%, and a one-year European call option on a three-year discount bond which has a face value of $100 and a strike price of $87. The reader is referred to [33] for further details. In addition to exact ML estimation, we may discretize the CIR model via the Euler method and estimate the discretized model using (quasi-) ML. The Euler scheme leads to the discretization (2.3)

discretization bias, we estimate model (2.3) by the (quasi-) ML approach. To investigate the finite sample estimation bias effects, we estimate model (2.2) based on the true transition density. To examine the effects of bias reduction in estimation, we apply the jackknife method (with m = 3) to the mean reversion parameter κ, the bond price and the bond option price. To examine the effects of specification bias, we fit each simulated sequence from the true model to the misspecified Vasicek model [37] to obtain the exact ML estimates of κ, the bond price and the option price from this misspecified model. The Vasicek model is given by the simple linear diffusion

(2.5)



∼ N µ(1 − e−κ )



+ e−κ rt , σ 2 (1 − e−2κ )/(2κ) . This transition density is utilized to obtain the exact ML estimates of κ, the bond price and the bond option price, all under the mistaken presumption that the misspecified model (2.4) is correctly specified. Table 1 reports the means and root mean square errors (RMSEs) for all these cases. It is clear that the finite sample estimation bias is more substantial than TABLE 1 Finite sample properties of ML and jackknife estimates of κ, bond price and option price for the (true) CIR model using a (correctly specified ) fitted CIR model and a (misspecified ) fitted Vasicek model (sample size n = 600) Parameter

κ

True value

0.1

0.8503

2.3920

0.1845 0.1319 0.1905 0.1397 0.0911 0.1205 0.1746 0.1175 0.0977 0.1628

0.8438 0.0103 0.8433 0.0111 0.8488 0.0094 0.8444 0.0088 0.8488 0.0120

1.8085 0.9052 1.7693 0.9668 2.1473 0.8704 1.8837 0.7637 2.2483 1.0289

Exact ML of CIR Euler ML of CIR Jackknife (m = 3) of CIR ML of Vasicek (misspecified) Jackknife (m = 2) of Vasicek (misspecified)

Mean RMSE Mean RMSE Mean RMSE Mean RMSE Mean RMSE

Bond price Option price

341

A SELECTIVE OVERWIEV

the discretization bias and the specification bias for all three quantities, at least in this experiment. In particular, κ is estimated by the exact ML method with 84.5% upward bias, which contributes toward the −0.76% bias in the bond price and the −24.39% bias in the option price. Relative to the finite sample bias, the bias in κ due to the discretization is almost negligible since the total bias in κ changes from 84.5% to 90.5%. (The increase in the total bias indicates that the discretization bias effect is in the same direction as that of the estimation bias.) The total bias changes from −0.76% to −0.82% in the bond price and from −24.39% to −26.03% in the option price. These changes are marginal. Similarly, relative to the finite sample bias, the bias in κ due to misspecification of the drift function is almost negligible since the total bias changes from 84.5% to 74.6%. (The decrease in the total bias indicates that the misspecification bias effect is in the opposite direction to that of the estimation bias.) The total bias changes from −0.76% to −0.69% in the bond price and from −24.39% to −21.25% in the option price. Once again, these changes are marginal. When the jackknife method is applied to the correctly specified model, the estimation bias is greatly reduced in all cases (from 84.5% to −8.9% for κ; from −0.76% to −0.18% for the bond price; and from −24.39% to −10.23% for the option price). Even more remarkably, when the jackknife method is applied to the incorrectly specified model (see the final row of Table 1), the estimation bias is also greatly reduced in all cases (from 84.5% to −2.3% for κ; from −0.76% to −0.18% for the bond price; and from −24.39% to −6.01% for the option price). These figures reveal that dealing with estimation bias can be much more important than ensuring correct specification in diffusion equation estimation, suggesting that general econometric treatment of the diffusion through nonparametric methods may not address the major source of bias effects on financial decision making. Although the estimation bias is not completely removed by the jackknife method, the bias reduction is clearly substantial and the RMSE of the jackknife estimate is smaller in all cases than that of exact ML. In sum, it is apparent from Table 1 that the finite sample estimation bias is larger in magnitude than either of the biases due to discretization and misspecification and correcting this bias is therefore a matter of importance in empirical work on which financial decisions depend. Although this demonstration of the relative importance of finite sample estimation bias in relation to discretization bias and specification bias is conducted in

a parametric context, similar results can be expected for some nonparametric models. For example, in the semiparametric model examined in [1], the diffusion function is nonparametrically specified and the drift function is linear, so that the mean reversion parameter is estimated parametrically as in the above example. In such cases, we can expect substantial finite sample estimation bias to persist and to have important practical implications in financial pricing applications. 3. REALIZED VOLATILITY

As noted in Fan’s overview, many models used in financial econometrics for modeling asset prices and interest rates have the fully functional scalar differential form dXt = µ(Xt ) dt + σ (Xt ) dBt ,

(3.1)

where both drift and diffusion functions are nonparametric and where the equation is driven by Brownian motion increments dBt . For models such as (3.1), we have (dXt )2 = σ 2 (Xt ) dt a.s. and hence the quadratic variation of Xt is (3.2)

[X]T = T

 T 0

(dXt ) dt = 2

 T 0

σ 2 (Xt ) dt,

σ 2 (X

where 0 t ) dt is the accumulated or integrated volatility of X. Were Xt observed continuously, [X]T and, hence, integrated volatility, would also be observed. For discretely recorded data, estimation of (3.2) is an important practical problem. This can be accomplished by direct nonparametric methods using an empirical estimate of the quadratic variation that is called realized volatility. The idea has been discussed for some time, an early reference being Maheswaran and Sims [28], and it has recently attracted a good deal of attention in the econometric literature now that very high frequency data has become available for empirical use. Recent contributions to the subject are reviewed in [4] and [8]. Suppose Xt is recorded discretely at equispaced points (, 2, . . . , n (≡ T )) over the time interval [0, T ]. Then, [X]T can be consistently estimated by the realized volatility of Xt defined by (3.3)

[X ]T =

n  

2

Xi − X(i−1) ,

i=2

as  → 0, as is well known. In fact, any construction of realized volatility based on an empirical grid of observations where the maximum grid size tends to zero will produce a consistent estimate. It follows that the

342

P. C. B. PHILLIPS AND J. YU

integrated volatility can be consistently estimated by this nonparametric approach, regardless of the form of µ(Xt ) and σ (Xt ). The approach has received a great deal of attention in the recent volatility literature and serves as a powerful alternative to the methods discussed by Fan, especially when ultra-high frequency data are available. While this approach is seemingly straightforward, it is not without difficulties. First, in order for the approach to be useful in empirical research, it is necessary to estimate the precision of the realized volatility estimates. Important contributions on the central limit theory of these empirical quadratic variation estimates by Jacod [22] and Barndorff-Nielson and Shephard [10, 11] has facilitated the construction of suitable methods of inference. Second, in practical applications, realized volatility measures such as (3.3) are usually contaminated by microstructure noise bias, especially at ultra high frequencies and tick-by-tick data. Noise sources arise from various market frictions and discontinuities in trading behavior that prevent the full operation of efficient financial markets. Recent work on this subject (e.g., [8, 9, 21, 38]) has developed various methods, including nonparametric kernel techniques, for reducing the effects of microstructure noise bias. 4. ADDITIONAL ISSUES

Given efficient market theory, there is good reason to expect that diffusion models like (3.1) may have nonstationary characteristics. Similar comments apply to term structure models and yield curves. In such cases, nonparametric estimation methods lead to the estimation of the local time (or sojourn time) of the corresponding stochastic process and functionals of this quantity, rather than a stationary probability density. Moreover, rates of convergence in such cases become path dependent and the limit theory for nonparametric estimates of the drift and diffusion functions in (3.1) is mixed normal. Asymptotics of this type require an enlarging time span of data as well as increasing in-fill within each discrete interval as n → ∞. An overview of this literature and its implications for financial data applications is given in [6]. Nonparametric estimates of yield curves in multifactor term structure models are studied in [25]. Not all models in finance are driven by Brownian motion. In some cases, one can expect noise to have to have some memory and, accordingly, models

such as (3.1) have now been extended to accommodate fractional Brownian motion increments. The stochastic calculus of fractional Brownian motion, which is not a semi-martingale, is not as friendly as that of Brownian motion and requires new constructs, involving Wick products and versions of the Stratonovich integral. Moreover, certain quantities, such as quadratic variation, that have proved useful in the recent empirical literature may no longer exist and must be replaced by different forms of variation, although the idea of volatility is still present. Developing a statistical theory of inference to address these issues in financial econometric models is presenting new challenges. ACKNOWLEDGMENTS

Peter C. B. Phillips gratefully acknowledges visiting support from the School of Economics and Social Science at Singapore Management University. Support was provided by NSF Grant SES-04-142254. Jun Yu gratefully acknowledges financial support from the Wharton-SMU Research Center at Singapore Management University. REFERENCES [1] A ÏT-S AHALIA , Y. (1996). Nonparametric pricing of interest rate derivative securities. Econometrica 64 527–560. [2] A ÏT-S AHALIA , Y. (1999). Transition densities for interest rate and other nonlinear diffusions. J. Finance 54 1361–1395. [3] A ÏT-S AHALIA , Y. (2002). Maximum likelihood estimation of discretely sampled diffusions: A closed-form approximation approach. Econometrica 70 223–262. [4] A NDERSEN , T. G., B OLLERSLEV, T. and D IEBOLD , F. X. (2005). Parametric and nonparametric volatility measurement. In Handbook of Financial Econometrics (Y. Aït-Sahalia and L. P. Hansen, eds.). North-Holland, Amsterdam. To appear. [5] BANDI , F. M. and P HILLIPS , P. C. B. (2003). Fully nonparametric estimation of scalar diffusion models. Econometrica 71 241–283. [6] BANDI , F. M. and P HILLIPS , P. C. B. (2005). Nonstationary continuous-time processes. In Handbook of Financial Econometrics (Y. Aït-Sahalia and L. P. Hansen, eds.). NorthHolland, Amsterdam. To appear. [7] BANDI , F. M. and RUSSELL , J. (2005). Volatility. In Handbook of Financial Engineering (J. R. Birge and V. Linetsky, eds.). To appear. [8] BANDI , F. M. and RUSSELL , J. (2005). Microstructure noise, realized volatility and optimal sampling. Working paper, Graduate School of Business, Univ. Chicago. [9] BARNDORFF -N IELSEN , O., H ANSEN , P., L UNDE , A. and S HEPHARD , N. (2005). Regular and modified kernel-based estimators of integrated variance: The case with independent noise. Working paper, Nuffield College.

A SELECTIVE OVERWIEV [10] BARNDORFF -N IELSEN , O. and S HEPHARD , N. (2002). Econometric analysis of realized volatility and its use in estimating stochastic volatility models. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 253–280. [11] BARNDORFF -N IELSEN , O. and S HEPHARD , N. (2004). Econometric analysis of realized covariation: High frequency based covariance, regression, and correlation in financial economics. Econometrica 72 885–925. [12] BARTLETT, M. S. (1955). An Introduction to Stochastic Processes. Cambridge Univ. Press. [13] BARTLETT, M. S. and R AJALAKSHMAN , D. V. (1953). Goodness of fit tests for simultaneous autoregressive series. J. Roy. Statist. Soc. Ser. B 15 107–124. [14] B ERGSTROM , A. (1966). Nonrecursive models as discrete approximations to systems of stochastic differential equations. Econometrica 34 173–182. [15] B ERGSTROM , A. (1988). The history of continuous-time econometric models. Econometric Theory 4 365–383. [16] B LACK , F. and S CHOLES , M. (1973). The pricing of options and corporate liabilities. J. Political Economy 81 637–654. [17] C OX , J., I NGERSOLL , J. and ROSS , S. (1985). A theory of the term structure of interest rates. Econometrica 53 385–407. [18] D URHAM , G. and G ALLANT, A. R. (2002). Numerical techniques for maximum likelihood estimation of continuoustime diffusion processes (with discussion). J. Bus. Econom. Statist. 20 297–338. [19] E LERIAN , O., C HIB , S. and S HEPHARD , N. (2001). Likelihood inference for discretely observed non-linear diffusions. Econometrica 69 959–993. [20] H ANSEN , P. and L UNDE , A. (2006). An unbiased measure of realized volatility. J. Bus. Econom. Statist. To appear. [21] H ONG , Y. and L I , H. (2005). Nonparametric specification testing for continuous-time models with applications to term structure of interest rates. Review of Financial Studies 18 37–84. [22] JACOD , J. (1994). Limit of random measures associated with the increments of a Brownian semimartingale. Working paper, Univ. P. and M. Curie, Paris. [23] JAMSHIDIAN , F. (1989). An exact bond option formula. J. Finance 44 205–209. [24] JARROW, R. and P ROTTER , P. (2004). A short history of stochastic integration and mathematical finance: The early years,

[25]

[26] [27] [28]

[29] [30]

[31]

[32]

[33] [34]

[35] [36]

[37] [38]

343

1880–1970. In A Festschrift for Herman Rubin (A. DasGupta, ed.) 75–91. IMS, Beachwood, OH. J EFFREY, A., K RISTENSEN , D., L INTON , O., N GUYEN , T. and P HILLIPS , P. C. B. (2004). Nonparametric estimation of a multifactor Heath–Jarrow–Morton model: An integrated approach. J. Financial Econometrics 2 251–289. K ESSLER , M. (1997). Estimation of an ergodic diffusion from discrete observations. Scand. J. Statist. 24 211–229. KOOPMANS , T., ed. (1950). Statistical Inference in Dynamic Economic Models. Wiley, New York. M AHESWARAN , S. and S IMS , C. A. (1993). Empirical implications of arbitrage-free asset markets. In Models, Methods and Applications of Econometrics (P. C. B. Phillips, ed.) 301–316. Blackwell, Cambridge, MA. M ERTON , R. (1973). Theory of rational option pricing. Bell J. Econom. and Management Sci. 4 141–183. P EDERSEN , A. (1995). A new approach to maximum likelihood estimation for stochastic differential equations based on discrete observations. Scand. J. Statist. 22 55–71. P HILLIPS , A. W. (1959). The estimation of parameters in systems of stochastic differential equations. Biometrika 46 67–76. P HILLIPS , P. C. B. (1972). The structural estimation of a stochastic differential equation system. Econometrica 40 1021–1041. P HILLIPS , P. C. B. and Y U , J. (2005). Jackknifing bond option prices. Review of Financial Studies 18 707–742. S ARGAN , J. D. (1974). Some discrete approximations to continuous time stochastic models. J. Roy. Statist. Soc. Ser. B 36 74–90. S IMS , C. (1971). Discrete approximations to continuous time distributed lags in econometrics. Econometrica 39 545–563. S TANTON , R. (1997). A nonparametric model of term structure dynamics and the market price of interest rate risk. J. Finance 52 1973–2002. VASICEK , O. (1977). An equilibrium characterization of the term structure. J. Financial Economics 5 177–188. Z HANG , L., M YKLAND , P. and A ÏT-S AHALIA , Y. (2005). A tale of two time scales: Determining integrated volatility with noisy high-frequency data. J. Amer. Statist. Assoc. 100 1394–1411.

Statistical Science 2005, Vol. 20, No. 4, 344–346 DOI 10.1214/088342305000000449 © Institute of Mathematical Statistics, 2005

Comment: A Selective Overview of Nonparametric Methods in Financial Econometrics Michael Sørensen there is no way we can know whether β0 is small or large because βˆn will always tend to be small. I suspect that the nonparametric methods outlined in Sections 3.2 and 3.5 might suffer from a similar shortcoming as they are based on the same type of approximation as the Euler scheme. A simple solution to this problem is to use an approximate likelihood function where the transition density is replaced by a normal distribution with mean equal to the exact conditional expectation F (x, θ ) = Eθ (X |X0 = x) and with the variance equal to the exact conditional variance (x; θ ) = Varθ (X |X0 = x). Here θ is the (typically multivariate) parameter to be estimated. This approach is exactly the same as using quadratic martingale estimating functions; see [3] and [20]. The estimators obtained from quadratic martingale estimating functions have the same nice properties for high frequency observations (small ) as the estimators based on the Euler likelihood, but they are consistent for any value of  and can thus be used whether or not  is small. In most cases there is no explicit expression for the functions F (x, θ ) and (x; θ ), so often they must be determined by simulation. This requires, however, only a modest amount of computation and is not a problem in practice. If a completely explicit likelihood is preferred, one can approximate F (x, θ ) and (x; θ ) by expansions of a higher order than those used in the Euler scheme; see [16]. The nonparametric method in Section 3.5 could probably be improved in a similar way by using in (27) and (28) the functions F (x, θ ) and (x; θ ) (or the higher-order expansions in [16]) instead of the firstorder approximation used in the Euler scheme.

1. INTRODUCTION

Professor Fan should be congratulated for his review that convincingly demonstrates the usefulness of nonparametric techniques to financial econometric problems. He is mainly concerned with financial models given by stochastic differential equations, that is, diffusion processes. I will therefore complement his selective review by discussing some important problems and useful methods for diffusion models that he has not covered. My concern will mainly, but not solely, be with parametric techniques. A recent comprehensive survey of parametric inference for discretely sampled diffusion models can be found in [19]. 2. GAUSSIAN LIKELIHOOD FUNCTIONS

In his brief review of parametric methods, Professor Fan mentions the Gaussian approximate likelihood function based on the Euler scheme and states that this method has some bias when the time between observations  is large. This is actually a very serious problem. As an example, consider a model with a linear drift of the form µ(x) = −β(x − α) (β > 0). The estimator βˆn of β obtained from the Gaussian approximate likelihood based on the Euler scheme converges to (1 − e−β0  )−1 as the number of observations n tends to infinity. Here β0 denotes the true parameter value. The limiting value of βˆn is always smaller than one, and the limit of βˆn is always smaller than −1 . Thus the asymptotic bias can be huge if  is large. A simulation study in [3] demonstrates that also for finite sample sizes an enormous bias can occur. When β0 is small so that (1 − e−β0  )−1 ≈ β0 , the asymptotic bias is negligible. The problem is, however, that if we use the approximate likelihood function based on the Euler scheme,

3. MARTINGALE ESTIMATING FUNCTIONS

More generally, martingale estimating functions provide a simple and versatile technique for estimation in discretely sampled parametric stochastic differential equation models that works whether or not  is small.

Michael Sørensen is Professor and Head, Department of Applied Mathematics and Statistics, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen Ø, Denmark (e-mail: [email protected]). 344

345

A SELECTIVE OVERVIEW

An estimator is obtained by solving the equation n  N  i=1 j =1



aj X(i−1) ; θ 

 



· fj (Xi ) − Hθ fj X(i−1) = 0, where H is the transition operator, and where the function a has the same dimension as the parameter θ . First suitable functions fj are chosen, and then the weight functions aj are determined so that an optimal estimating function in the sense of Godambe and Heyde [9] is obtained; see also [10]. Optimal estimating functions are approximations to the non-explicit score function. Usually Hθ fj must be determined by a modest amount of simulation, but Kessler and Sørensen [17] demonstrated how completely explicit estimating functions can be obtained if the functions fj are eigenfunctions of the operator Lθ (called the generator); see also [18] for details on how to explicitly find the optimal weight functions. A review of the theory of estimating functions for discretely sampled diffusiontype models can be found in [1]. For martingale estimating functions large sample results concerning estimators can be obtained via martingale limit theory. Under weak conditions, estimators are consistent, and optimal estimating functions tend to work well when the functions fj are chosen reasonably, that is, such that a good approximation to the score function can be obtained. At low sampling frequencies the estimators are, however, usually not efficient. The behavior of the estimators at high sampling frequencies can be investigated by considering an asymptotic scenario where the time between observations n is assumed to go to zero, as the sample size n tends to infinity, sufficiently slowly that the time horizon over which observations are made, nn , tends to infinity. It is well known that in this situation estimators of parameters appearing in the diffusion √ coefficient may converge at a suboptimal rate, 1/ nn . The reason is that there is a lot of information about the diffusion coefficient in the fine structure of diffusion trajectories, which some estimators do not capture. Recently Sørensen [22] has given conditions ensuring that a martingale estimating function √ provides estimators that are rate-optimal (rate 1/ n ) and efficient in the high-frequency asymptotic scenario. Optimal martingale estimating functions satisfy these conditions. Quadratic martingale estimating functions are always rate-optimal, and if they are obtained from Gaussian approximate likelihood functions they are efficient too. These results are closely related to the theory of small -optimality developed in [13] and [14].

4. NON-MARKOVIAN OBSERVATIONS

There are several situations in which observations from a diffusion process are non-Markovian. Most prominently this happens if a function of lower dimension of a multivariate diffusion is observed. An example is the stochastic volatility model that plays an important role as a model of financial time series since it is well known that a simple one-dimensional diffusion often cannot capture all the salient features of such data. Another example is given by the sums of diffusions proposed by Bibby, Skovgaard and Sørensen [2] as models of phenomena with more than one time scale. Other situations where diffusion data are nonMarkovian are in the presence of measurement error, or when only integrals of the diffusion over time-intervals are observed; see [4]. The latter is, for instance, the case when climate data from ice cores are analyzed by means of a diffusion model. When the data are nonMarkovian, it is usually not possible to find a tractable martingale estimating function, but an alternative is provided by the prediction-based estimating functions proposed in [21], which can be interpreted as approximations to martingale estimating functions. Asymptotic results for estimators based on nonMarkovian data are usually based on the assumption that the underlying diffusion process is strongly mixing. The condition ensuring exponential ρ-mixing cited in Section 2.2 is not easy to check for concrete diffusion models. A condition on the drift and diffusion coefficient that is easy to verify and that implies exponential ρ-mixing and α-mixing was given by GenonCatalot, Jeantheau and Larédo [6]. 5. NONPARAMETRIC METHODS

Let me conclude by drawing attention to some relatively early work on nonparametric methods for discretely sampled diffusion models. Wavelet methods for estimating the diffusion coefficient of a time-dependent model were proposed by Genon-Catalot, Larédo and Picard [7]. The first estimator of the diffusion coefficient mentioned in Section 3.2 was first proposed by Florens-Zmirou [5]. She considered a high frequency asymptotic scenario with fixed time span, that is, with nn constant, and proved that the asymptotic distribution of her estimator is a mixture of normal distributions where the mixing distribution is the distribution of the local time of the diffusion. If a datadependent normalization of the estimator is used, an asymptotic normal distribution is obtained. In a series

346

M. SØRENSEN

of important papers, Marc Hoffmann has studied optimal rates of convergence of nonparametric estimators of the drift and diffusion coefficient under the three asymptotic scenarios usually considered for diffusion models including optimal estimators; see [8, 11, 12]. Other estimators of the diffusion coefficient were proposed by Soulier [23] and Jacod [15]. REFERENCES [1] B IBBY, B. M., JACOBSEN , M. and S ØRENSEN , M. (2005). Estimating functions for discretely sampled diffusion-type models. In Handbook of Financial Econometrics (Y. AïtSahalia and L. P. Hansen, eds.). North-Holland, Amsterdam. To appear. [2] B IBBY, B. M., S KOVGAARD , I. M. and S ØRENSEN , M. (2005). Diffusion-type models with given marginal distribution and autocorrelation function. Bernoulli 11 191–220. [3] B IBBY, B. M. and S ØRENSEN , M. (1995). Martingale estimation functions for discretely observed diffusion processes. Bernoulli 1 17–39. [4] D ITLEVSEN , S. and S ØRENSEN , M. (2004). Inference for observations of integrated diffusion processes. Scand. J. Statist. 31 417–429. [5] F LORENS -Z MIROU , D. (1993). On estimating the diffusion coefficient from discrete observations. J. Appl. Probab. 30 790–804. [6] G ENON -C ATALOT, V., J EANTHEAU , T. and L ARÉDO , C. (2000). Stochastic volatility models as hidden Markov models and statistical applications. Bernoulli 6 1051–1079. [7] G ENON -C ATALOT, V., L ARÉDO , C. and P ICARD , D. (1992). Nonparametric estimation of the diffusion coefficient by wavelet methods. Scand. J. Statist. 19 317–335. [8] G OBET, E., H OFFMANN , M. and R EISS , M. (2004). Nonparametric estimation of scalar diffusions based on low frequency data. Ann. Statist. 32 2223–2253. [9] G ODAMBE , V. P. and H EYDE , C. C. (1987). Quasilikelihood and optimal estimation. Internat. Statist. Rev. 55 231–244.

[10] H EYDE , C. C. (1997). Quasi-Likelihood and Its Application. Springer, New York. [11] H OFFMANN , M. (1999). Adaptive estimation in diffusion processes. Stochastic Process. Appl. 79 135–163. [12] H OFFMANN , M. (1999). Lp estimation of the diffusion coefficient. Bernoulli 5 447–481. [13] JACOBSEN , M. (2001). Discretely observed diffusions: Classes of estimating functions and small -optimality. Scand. J. Statist. 28 123–149. [14] JACOBSEN , M. (2002). Optimality and small -optimality of martingale estimating functions. Bernoulli 8 643–668. [15] JACOD , J. (2000). Nonparametric kernel estimation of the coefficient of a diffusion. Scand. J. Statist. 27 83–96. [16] K ESSLER , M. (1997). Estimation of an ergodic diffusion from discrete observations. Scand. J. Statist. 24 211–229. [17] K ESSLER , M. and S ØRENSEN , M. (1999). Estimating equations based on eigenfunctions for a discretely observed diffusion process. Bernoulli 5 299–314. [18] L ARSEN , K. S. and S ØRENSEN , M. (2005). A diffusion model for exchange rates in a target zone. Math. Finance. To appear. [19] S ØRENSEN , H. (2004). Parametric inference for diffusion processes observed at discrete points in time: A survey. Internat. Statist. Rev. 72 337–354. [20] S ØRENSEN , M. (1997). Estimating functions for discretely observed diffusions: A review. In Selected Proceedings of the Symposium on Estimating Functions (I. V. Basawa, V. P. Godambe and R. L. Taylor, eds.) 305–325. IMS, Hayward, CA. [21] S ØRENSEN , M. (2000). Prediction-based estimating functions. Econom. J. 3 123–147. [22] S ØRENSEN , M. (2005). Efficient martingale estimating functions for discretely sampled ergodic diffusions. Preprint, Dept. Appl. Math. and Statistics, Univ. Copenhagen. [23] S OULIER , P. (1998). Nonparametric estimation of the diffusion coefficient of a diffusion process. Stochastic Anal. Appl. 16 185–200.

Statistical Science 2005, Vol. 20, No. 4, 347–350 DOI 10.1214/088342305000000458 © Institute of Mathematical Statistics, 2005

Comment: A Selective Overview of Nonparametric Methods in Financial Econometrics Per A. Mykland and Lan Zhang (or volatility) of the returns process Xt , will be stochastic processes, but these processes can depend on the past in ways that need not be specified, and can be substantially more complex than a Markov model. This is known as an Itô process. A main quantity of econometric interest is to obtain

We would like to congratulate Jianqing Fan for an excellent and well-written survey of some of the literature in this area. We will here focus on some of the issues which are at the research frontiers in financial econometrics but are not covered in the survey. Most importantly, we consider the estimation of actual volatility. Related to this is the realization that financial data is actually observed with error (typically called market microstructure), and that one needs to consider a hidden semimartingale model. This has implications for the Markov models discussed above. For reasons of space, we have not included references to all the relevant work by the authors that are cited, but we have tried to include at least one reference to each of the main contributors to the realized volatility area.

time series of the form i =

Ti

Here Ti− and Ti+ can, for example, be the beginning and the end of day number i. i is variously known as the integrated variance (or volatility) or quadratic variation of the process X. The reason why one can hope to obtain this series is as follows. If Ti− = t0 < t1 < · · · < tn = Ti+ spans day number i, define the realized volatility by (2)

1. THE ESTIMATION OF ACTUAL VOLATILITY: THE IDEAL CASE

ˆi = 

n−1 



j =0

2

Xtj +1 − Xtj .

Then stochastic calculus tells us that

The paper discusses the estimation of Markovian systems, models where the drift and volatility coefficients are functions of time t or state x. There is, however, scope for considering more complicated systems. An important tool in this respect is the direct estimation of volatility based on high-frequency data. One considers a system of, say, log securities prices, which follows: (1)

 Ti+ 2 − σt dt, i = 1, 2, . . . .

(3)

i =

lim

max |tj +1 −tj |→0

ˆ i. 

In the presence of high frequency financial data, in many cases with transactions as often as every few seconds, one can, therefore, hope to almost observe i . ˆ i , or One can then either fit a model to the series of  one can use it directly for portfolio management (as in [12]), options hedging (as in [29]), or to test goodness of fit [31]. There are too many references to the relationship (3) to name them all, but some excellent treatments can be found in [27], Section 1.5; [26], Theorem I.4.47 on page 52; and [33], Theorem II-22 on page 66. An early econometric discussion of this relationship can be found in [2]. To make it even more intriguing, recent work both from the probabilistic and econometric sides gives the mixed normal distribution of the error in the approximation in (3). References include [6, 25, 31].

dXt = µt dt + σt dBt ,

where Bt is a standard Brownian motion. Typically, µt , the drift coefficient, and σt2 , the instantaneous variance Per A. Mykland is Professor, Department of Statistics, The University of Chicago, Chicago, Illinois 60637, USA (e-mail: [email protected]). Lan Zhang is Assistant Professor, Department of Finance, University of Illinois at Chicago, Chicago, Illinois 60607, and Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA (e-mail: [email protected]).

The random variance of the normal error is 2 347

Ti+ −Ti− n

·

348

P. A. MYKLAND AND L. ZHANG

 Ti+ 4 − σt dH (t), where H is the quadratic variation of Ti

time. H (t) = t in the case where the ti are equidistant. Further econometric literature includes, in particular, [3, 4, 8, 9, 14, 18, 32]. Problems that are attached to the estimation of covariations between two processes are discussed in [22]. Estimating σt2 at each point t goes back to [13]; see also [30], but this has not caught on quite as much in econometric applications. 2. THE PRESENCE OF MEASUREMENT ERROR

The theory described above runs into a problem with real data. For illustration, consider how the realized volatility depends on sampling frequency for the stock (and day) considered in Figure 1. The estimator does not converge as the observation points ti become dense in the interval of this one day, but rather seems to take off to infinity. This phenomenon was originally documented in [2]. For transaction data, this picture is repeated for most liquid securities [19, 37]. In other words, the model (1) is wrong. What can one do about this? A lot of people immediately think that

the problem is due to jumps, but that is not the case. The limit in (3) exists even when there are jumps. The requirement for (3) to exist is that the process X be a semimartingale (we again cite Theorem I.4.47 of [26]), which includes both Itô processes and jumps. The inconsistency between the empirical results where the realized volatility diverges with finer sampling, and the semimartingale theory which dictates the convergence of the realized volatility, poses a problem, since financial processes are usually assumed to be semimartingales. Otherwise, somewhat loosely speaking, there would be arbitrage opportunities in the financial markets. For rigorous statements, see, in particular, [11]. The semimartingaleness of financial processes, therefore, is almost a matter of theology in most of finance, and yet, because of Figure 1 and similar graphs for other stocks, we have to abandon it. Our alternative model is that there is measurement error in the observation. At transaction number i, instead of seeing Xti from model (1) or, more generally, from a semimartingale, one observes (4)

Yti = Xti + εi .

F IG . 1. Plot of realized volatility for Alcoa Aluminum for January 4, 2001. The data is from the TAQ database. There are 2011 transactions on that day, on average one every 13.365 seconds. The most frequently sampled volatility uses all the data, and this is denoted as “frequency = 1.” “Frequency = 2” corresponds to taking every second sampling point. Because this gives rise to two estimators of volatility, we have averaged the two. And so on for “frequency = k” up to 20. The plot corresponds to the average realized volatility discussed in [37]. Volatilities are given on an annualized and square root scale.

349

A SELECTIVE OVERVIEW

We call this the hidden semimartingale model. The rationale is (depending on your subject matter) either that a transaction is a measurement of the underlying price Xti , and of course there is error, or that it is due to market microstructure, as documented by, among others, Roll [34], Glosten [15], Glosten and Harris [16], Brown [7], Harris [20] and Hasbrouck [21]. See [1] for a discussion of this. A natural model for the error is that it is either i.i.d. or a stationary process, as considered by Zhou [38], Gloter and Jacod [17], Zhang, Mykland and AïtSahalia [37], Bandi and Russell [5], Zhang [36], Aït-Sahalia, Mykland and Zhang [1] and Hansen and Lunde [19]. Under quite loose conditions, this alternative model is consistent with the plot in Figure 1. Instead of (3), one gets that the realized volatility becomes nE(ε1 − ε0 )2 + Op (n−1/2 ). In the early literature (as cited in the previous section), the problem is usually taken care of by (sic) reducing n. A variety of approaches that improve on this are documented in [37], to which we refer for an in depth discussion. As demonstrated by Zhang [36], the true volatility i can be consistently estimated at rate Op (n−1/4 ), as opposed to Op (n−1/2 ) when there is no error. This is not as slow as it seems, since n is quite large for liquid securities. An alternative description of the error is that it arises due to rounding (financial instruments are, after all, traded on a price grid). Research in this direction has been done by Delattre and Jacod [10] and by Zeng [35]. To first order, the rounding and additive error models are similar, as documented by Delattre and Jacod [10]; see also [28]. It is awkward that these models imply the existence of arbitrage. The size of the error, however, is so small that it is hard to take economic advantage of them, and this, presumably, is why such deviations can persist. 3. IMPLICATIONS FOR MARKOV MODELS

We now return to the subject to Jianqing Fan’s overview, namely the Markov case. It is clear that the model without observation error is not consistent with the data. This may not be a problem when working with, say, daily data, but would pose problems when using high-frequency (intraday) observations. It is presumably quite straightforward to extend the methods discussed in the paper to the case of observation error, and it would be interesting to see the results. The same applies to similar studies on Markov models by the “French school,” such as Hoffmann [23] and Jacod [24].

ACKNOWLEDGMENT

We gratefully acknowledge the support of the National Science Foundation under Grant DMS-02-04639. REFERENCES [1] A ÏT-S AHALIA , Y., M YKLAND , P. A. and Z HANG , L. (2005). How often to sample a continuous-time process in the presence of market microstructure noise. Review of Financial Studies 18 351–416. [2] A NDERSEN , T. G., B OLLERSLEV, T., D IEBOLD , F. X. and L ABYS , P. (2000). Great realizations. Risk 13 105–108. [3] A NDERSEN , T. G., B OLLERSLEV, T., D IEBOLD , F. X. and L ABYS , P. (2001). The distribution of realized exchange rate volatility. J. Amer. Statist. Assoc. 96 42–55. [4] A NDERSEN , T. G., B OLLERSLEV, T., D IEBOLD , F. X. and L ABYS , P. (2003). Modeling and forecasting realized volatility. Econometrica 71 579–625. [5] BANDI , F. M. and RUSSELL , J. R. (2005). Microstructure noise, realized volatility and optimal sampling. Technical report, Graduate School of Business, Univ. Chicago. [6] BARNDORFF -N IELSEN , O. E. and S HEPHARD , N. (2002). Econometric analysis of realized volatility and its use in estimating stochastic volatility models. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 253–280. [7] B ROWN , S. J. (1990). Estimating volatility. In Financial Options: From Theory to Practice (S. Figlewski, W. Silber and M. Subrahmanyam, eds.) 516–537. Business One-Irwin, Homewood, IL. [8] C HERNOV, M. and G HYSELS , E. (2000). A study towards a unified approach to the joint estimation of objective and risk neutral measures for the purpose of options valuation. J. Financial Economics 56 407–458. [9] DACOROGNA , M. M., G ENÇAY, R., M ÜLLER , U., O LSEN , R. B. and P ICTET, O. V. (2001). An Introduction to HighFrequency Finance. Academic Press, San Diego. [10] D ELATTRE , S. and JACOD , J. (1997). A central limit theorem for normalized functions of the increments of a diffusion process, in the presence of round-off errors. Bernoulli 3 1–28. [11] D ELBAEN , F. and S CHACHERMAYER , W. (1995). The existence of absolutely continuous local martingale measures. Ann. Appl. Probab. 5 926–945. [12] F LEMING , J., K IRBY, C. and O STDIEK , B. (2001). The economic value of volatility timing. J. Finance 56 329–352. [13] F OSTER , D. and N ELSON , D. (1996). Continuous record asymptotics for rolling sample variance estimators. Econometrica 64 139–174. [14] G ALLANT, A. R., H SU , C.-T. and TAUCHEN , G. (1999). Using daily range data to calibrate volatility diffusions and extract the forward integrated variance. Review of Economics and Statistics 81 617–631. [15] G LOSTEN , L. R. (1987). Components of the bid-ask spread and the statistical properties of transaction prices. J. Finance 42 1293–1307. [16] G LOSTEN , L. R. and H ARRIS , L. E. (1988). Estimating the components of the bid/ask spread. J. Financial Economics 21 123–142.

350

P. A. MYKLAND AND L. ZHANG

[17] G LOTER , A. and JACOD , J. (2000). Diffusions with measurement errors: I—Local asymptotic normality and II—Optimal estimators. Technical report, Univ. de Paris VI. [18] G ONCALVES , S. and M EDDAHI , N. (2005). Bootstrapping realized volatility. Technical report, Univ. de Montréal. [19] H ANSEN , P. R. and L UNDE , A. (2006). Realized variance and market microstructure noise. J. Bus. Econom. Statist. To appear. [20] H ARRIS , L. (1990). Statistical properties of the Roll serial covariance bid/ask spread estimator. J. Finance 45 579–590. [21] H ASBROUCK , J. (1993). Assessing the quality of a security market: A new approach to transaction-cost measurement. Review of Financial Studies 6 191–212. [22] H AYASHI , T. and YOSHIDA , N. (2005). On covariance estimation of non-synchronously observed diffusion processes. Bernoulli 11 359–379. [23] H OFFMANN , M. (1999). Lp estimation of the diffusion coefficient. Bernoulli 5 447–481. [24] JACOD , J. (2000). Nonparametric kernel estimation of the coefficient of a diffusion. Scand. J. Statist. 27 83–96. [25] JACOD , J. and P ROTTER , P. (1998). Asymptotic error distributions for the Euler method for stochastic differential equations. Ann. Probab. 26 267–307. [26] JACOD , J. and S HIRYAEV, A. N. (2003). Limit Theorems for Stochastic Processes, 2nd ed. Springer, New York. [27] K ARATZAS , I. and S HREVE , S. E. (1991). Brownian Motion and Stochastic Calculus, 2nd ed. Springer, New York. [28] KOLASSA , J. and M C C ULLAGH , P. (1990). Edgeworth series for lattice distributions. Ann. Statist. 18 981–985.

[29] M YKLAND , P. A. (2003). Financial options and statistical prediction initervals. Ann. Statist. 31 1413–1438. [30] M YKLAND , P. A. and Z HANG , L. (2001). Inference for volatility type objects and implications for hedging. Technical report, Dept. Statistics, Carnegie Mellon Univ. [31] M YKLAND , P. A. and Z HANG , L. (2002). ANOVA for diffusions. Technical report, Dept. Statistics, Univ. Chicago. [32] O OMEN , R. (2004). Properties of realized variance for a pure jump process: Calendar time sampling versus business time sampling. Technical report, Warwick Business School, Univ. Warwick. [33] P ROTTER , P. (2004). Stochastic Integration and Differential Equations: A New Approach, 2nd ed. Springer, New York. [34] ROLL , R. (1984). A simple implicit measure of the effective bid-ask spread in an efficient market. J. Finance 39 1127–1139. [35] Z ENG , Y. (2003). A partially-observed model for micromovement of asset process with Bayes estimation via filtering. Math. Finance 13 411–444. [36] Z HANG , L. (2004). Efficient estimation of stochastic volatility using noisy observations: A multi-scale approach. Technical report, Dept. Statistics, Carnegie Mellon Univ. [37] Z HANG , L., M YKLAND , P. A. and A ÏT-S AHALIA , Y. (2005). A tale of two time scales: Determining integrated volatility with noisy high-frequency data. J. Amer. Statist. Assoc. 100 1394–1411. [38] Z HOU , B. (1996). High-frequency data and volatility in foreign-exchange rates. J. Bus. Econom. Statist. 14 45–52.

Statistical Science 2005, Vol. 20, No. 4, 351–357 DOI 10.1214/088342305000000421 © Institute of Mathematical Statistics, 2005

Rejoinder: A Selective Overview of Nonparametric Methods in Financial Econometrics Jianqing Fan I am very grateful to the Executive Editor, Edward George, for organizing this stimulating discussion. I would like to take this opportunity to thank Professors Peter Phillips, Jun Yu, Michael Sørensen, Per Mykland and Lan Zhang for their insightful and stimulating comments, touching both practical, methodological and theoretical aspects of financial econometrics and their applications in asset pricing, portfolio allocation and risk management. They have made valuable contributions to the understanding of various financial econometric problems. The last two decades have witnessed an explosion of developments of data-analytic techniques in statistical modeling and analysis of complex systems. At the same time, statistical techniques have been widely employed to confront various complex problems arising from financial and economic activities. While the discipline has grown rapidly over the last two decades and has rich and challenging statistical problems, the number of statisticians involved in studying financial econometric problems is still limited. In comparison with statisticians working on problems in biological sciences and medicine, the group working on financial and econometric problems is dismally small. It is my hope that this article will provide statisticians with quick access to some important and interesting problems in financial econometrics and to catalyze the romance between statistics and finance. A similar effort was made by Cai and Hong [12], where various aspects of nonparametric methods in continuous-time finance are reviewed. It is my intention to connect financial econometric problems as closely to statistical problems as possible so that familiar statistical tools can be employed. With this in mind, I sometimes oversimplify the problems and techniques so that key features can be highlighted.

I am fully aware that financial econometrics has grown into a vast discipline itself and that it is impossible for me to provide an overview within a reasonable length. Therefore, I greatly appreciate what all discussants have done to expand the scope of discussion and provide additional references. They have also posed open statistical problems for handling nonstationary and/or non-Markovian data with or without market noise. In addition, statistical issues on various versions of capital asset pricing models and their related stochastic discount models [15, 19], the efficient market hypothesis [44] and risk management [17, 45] have barely been discussed. These reflect the vibrant intersection of the interfaces between statistics and finance. I will make some further efforts in outlining econometric problems where statistics plays an important role after brief response to the issues raised by the discussants. 1. BIASES IN STATISTICAL ESTIMATION

The contributions by Professors Phillips, Yu and Sørensen address the bias issues on the estimation of parameters in diffusion processes. Professors Phillips and Yu further translate the bias of diffusion parameter estimation into those of pricing errors of bonds and bond derivatives. Their results are very illuminating and illustrate the importance of estimation bias in financial asset pricing. Their results can be understood as follows. Suppose that the price of a financial asset depends on certain parameters θ (the speed of the reversion κ in their illustrative example). Let us denote it by p(θ ), which can be in one case the price of a bond and in another case the prices of derivatives of a bond. The value of the asset is now estimated by p(θˆ ) with θˆ being estimated from empirical data. When θˆ is overestimated (say), which shifts the whole distribution of θˆ to the left, the distribution of p(θˆ ) will also be shifted, depending on the sensitivity of p to θ . The sensitivity is much larger for bond derivatives when κ is close to zero (see Figure 2 of [46]), and hence the pricing errors are much larger. On the other hand, as the distribution

Jianqing Fan is Professor, Benheim Center of Finance and Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544, USA (e-mail: [email protected]). 351

352

J. FAN

of κ is shifted to the left, from Figure 2 of [46], both prices of bonds and their derivatives get smaller and so does the variance of pricing errors. Simulation studies in [46] suggest that these two effects cancel each other out in terms of mean square error. I agree with Phillips and Yu’s observation that discretization is not the main source of biases for many reasonable financial applications. Finite-sample estimation bias can be more severe. This partially answers the question raised by Professor Sørensen. On the other hand, his comments give theoretical insights into the bias due to discretization. For financial applications (such as modeling short-term rates) when the data are collected at monthly frequency, the bias {1 − exp(−κ)}/ − κ = −0.0019 and −0.00042, respectively, for κ = 0.21459 used in Figure 3 of [34] and for κ = 0.1 used in the discussion by Phillips and Yu. For weekly data, using the parameter κ = 0.0446 cited in [14], the discretization bias is merely 9.2 × 10−5 . For other types of applications, such as climatology, Professor Sørensen is right that the bias due to discretization can sometimes be substantial. It is both theoretically elegant and practically viable to have methods that work well for all situations. The quasimaximum likelihood methods and their modifications discussed by Professor Sørensen are attractive alternatives. As he pointed out, analytical solutions are rare and computation algorithms are required. This increases the chance of numerical instability in practical implementations. The problem can be attenuated with the estimates based on the Euler approximation as an initial value. The martingale method is a generalization of his quasi-maximum likelihood estimator, which aims at improving efficiency by suitable choice of weighting functions aj . However, unless the conditional density has multiplicative score functions, the estimation equations will not be efficient. This explains the observation made by Professor Sørensen that the methods based on martingale estimating functions are usually not efficient for low frequency data. The above discussion tends to suggest that when the Euler approximation is reasonable, the resulting estimates tend to have smaller variances. In addition to the discretization bias and finite sample estimation bias, there is model specification bias. This can be serious in many applications. In the example given by Professors Phillips and Yu, the modeling errors do not have any serious adverse effects on pricing bonds and their derivatives. However, we should be wary of generalizing this statement. Indeed, for the model parameters given in the discussion by

Phillips and Yu, the transition density of the CIR model has a noncentral χ 2 -distributions with degrees of freedom 80, which is close to the normal transition density given by the Vasicek model. Therefore, the model is not very seriously misspecified. Nonparametric methods reduce model specification errors by either global modeling such as spline methods or local approximations. This reduces significantly the possibility of specification errors. Since nonparametric methods are somewhat crude and often used as model diagnostic and exploration tools, simple and quick methods serve many practical purposes. For example, in time domain smoothing, the bandwidth h is always an order of magnitude larger than the sampling frequency . Therefore, the approximation errors due to discretization are really negligible. Similarly, for many realistic problems, the function approximation errors can be an order of magnitude larger than discretization errors. Hence, discretization errors are often not a main source of errors in nonparametric inference. 2. HIGH-FREQUENCY DATA

Professors Mykland, Zhang, Phillips and Jun address statistical issues for high-frequency data. I greatly appreciate their insightful comments and their elaborations on the importance and applications of the subject. Thanks to the advances in modern trading technology, the availability of high-frequency data over the last decade has significantly increased. Research in this area has advanced very rapidly lately. I would like to thank Professors Mykland and Zhang for their comprehensive overview on this active research area. With high-frequency data, discretization errors have significantly been reduced. Nonparametric methods become even more important for this type of large sample problem. The connections between the realized volatility and the time-inhomogeneous model can simply be made as follows. Consider a subfamily of models of (8) in [34], dXt = αt dt + σt dWt . For high-frequency data the sampling interval is very small. For the sampling frequency of a minute,  = 1/(252 ∗ 24 ∗ 60) ≈ 2.756 × 10−6 . Hence, standardized residuals in Section 2.5 of [34] become Et = −1/2 (Xt+ − Xt ) and the local constant estimate of the spot volatility reduces to σˆ j2

=

j −1 i=−∞

2 wj −i Ei ,

A SELECTIVE OVERVIEW

where {wi } are the weights induced by a kernel func tion satisfying ∞ i=1 wi = 1. Now, for the weights with a bounded support, the quadratic variation of the  process or integrated volatility tT σt2 dt is naturally esT /−1 timated by  i=t/ σˆ i2 , which is simply T /−1 



2

Xi − X(i−1) .

i=t/

This shows that our nonparametric estimation of the integrated volatility for high-frequency data is indeed the same as the realized volatility. As suggested by Professors Mykland, Zhang, Phillips and Yu, the applications of realized volatilities are not without difficulties. Market microstructure noises emerge at such a fine frequency of observation and market prices can contain multiple jumps due to the flux of information during a trading session. Figure 1 in the discussion by Mykland and Zhang demonstrates convincingly the existence of the market microstructure noise. Aït-Sahalia, Mykland and Zhang [1] and Zhang, Mykland and Aït-Sahalia [50] give comprehensive accounts of this under the assumption that the observed prices are the true ones contaminated with random noise of market microstructure: Yt = Xt + εt . However, they do not take into account that the price processes {Xt } may contain jumps in addition to random noises. An effort in this direction has been made recently by Fan and Wang [38] using wavelet techniques. 3. ESTIMATING COVARIANCE MATRICES

Covariance matrices play an important role in risk management and asset allocation. They are featured prominently in many financial econometrics problems. For example, the smallest and largest eigenvalues are related to the minimum and the maximum of the volatility of portfolios and their corresponding eigenvectors are related to portfolio allocation. See [40] for applications of covariance matrices to portfolio selection and [43] for their applications to other scientific problems. There are a couple of approaches to these kinds of problems, depending on the size of the covariance matrices. I hereby give a brief overview and address some of the open challenges. The simplest estimate of a covariance matrix is probably the sample covariance matrix of the log-returns of p assets over a period of n days prior to the current time t. This is indeed a nonparametric estimation of the covariance matrix localizing in time and has been

353

studied in multivariate analysis when p is finite and the underlying model is correct, that is, the covariance matrix remains the same in the n days prior to time t. See, for example, [26, 27, 47]. However, the impact of the biases in nonparametric methods on the estimation of eigenvalues and eigenvectors has not yet been thoroughly investigated. The sample covariance matrices can be augmented by using the information from the state domain, which is an extension of the method discussed in Section 3.6 of [34] and allows us to use the historical information. This is particularly useful for estimating the covariance matrices of bonds with different maturities. Useful parametric models such as affine models have been popularly used in interest rate modeling. See, for example, [20, 24, 23]. Nonparametric methods provide useful alternatives to estimating the covariance matrices and to validating parametric models. A naive extension involves high-dimensional smoothing in the state domain. But this can be avoided by localizing only on the yields of a few bonds with intermediate length of maturity. Another class of techniques is to use a form of GARCH model [28] to estimate covariance matrices. As noted in [30], the number of parameters grows rapidly with the dimensionality p. Various efforts have been made to reduce the complexity of the models. These include constant conditional correlation multivariate GARCH models [10], vectorized multivariate GARCH models [11], dynamic conditional correlation models [29, 31], orthogonal GARCH models [2], generalized orthogonal GARCH models [48] and conditionally uncorrelated component models [37]. For a survey, see [8]. In portfolio allocation and risk management, the number of stocks p can be well in the order of hundreds, which is typically in the same order as the sample size n. The sample covariance matrix may not be a good estimator of the population one. The estimated variance of a portfolio based on the sample covariance may far exceed the true one. The estimation errors can accumulate quickly when p grows with n. Indeed, Johnstone [43] shows that the largest eigenvalue of the covariance matrix is far larger than the population one. There are many studies on the behavior of random matrices when the dimensionality p grows with n. See, for example, [5, 22, 21, 49]. For a survey, see [4]. Estimating covariance matrices for large p is intrinsically challenging. For example, when p = 200, there are more than 20,000 free parameters. Yet, the

354

J. FAN

available sample size is usually in the order of hundreds or a few thousand. Longer time series (larger n) will increase modeling biases. Without imposing structures on the covariance matrices, they are hard to estimate. Thanks to the multi-factor models (see Chapter 6 of [13]), if a few factors can capture completely the cross-sectional risks, the number of parameters can be significantly reduced. For example, using the Fama– French three-factor models [32, 33], there are 4p instead of p(p +1)/2 parameters. Natural questions arise with this structured estimate of the covariance matrix, how large p can be such that the estimation error in the covariance matrix is negligible in asset allocation and risk management. The problems of this kind are interesting and remain open. Another possible approach to the estimation of covariance matrices is to use a model selection approach. First of all, according to Chapter 3 of [39], the Cholesky decomposition admits nice autoregressive interpretation. We may reasonably assume that the elements in the Cholesky decomposition of the covariance matrix are sparse. Hence, the penalized likelihood method [3, 35, 42] can be employed to select and estimate nonsparse elements. The sampling property of such a method remains unknown. Its impact on portfolio allocation and risk management needs to be studied. 4. STATISTICS IN DERIVATIVE PRICING

Over last three decades, option pricing has witnessed an explosion of new models that extend the original work of Black and Scholes [9]. Empirically pricing financial derivatives is innately related to statistical regression problems. This is well documented in papers such as [6, 7, 15, 16, 25, 41]. See also a brief review given by Cai and Hong [12]. For a given stochastic model with given structural parameters under the riskneutral measure, the prices of European options can be determined, which are simply the discounted expected payoffs under the risk-neutral measure. Bakshi, Cao and Chen [6] give the analytic formulas of option prices for five commonly used stochastic models, including the stochastic-volatility random-jump model. They then estimate the risk-neutral parameters by minimizing the discrepancies between the observed prices and the theoretical ones. With estimated risk-neutral parameters, option prices with different characteristics can be evaluated. They conduct a comprehensive study of the relative merits of competing option pricing models by computing pricing errors for new options. Dumas, Fleming and Whaley [25] model

the implied volatility function by a quadratic function of the strike price and time to maturity and determine these parameters by minimizing pricing errors. Based on the analytic formula of Bakshi, Cao and Chen [6] for option price under the stochastic volatility models, Chernov and Ghysels [16] estimate the risk neutral parameters by integrating information from both historical data and risk-neutral data implied by observed option prices. Instead of using continuoustime diffusion models, Heston and Nandi [41] assume that the stock prices under the risk-neutral world follow a GARCH model and derive a closed form for European options. They determine the structural parameters by minimizing the discrepancy between the empirical and theoretical option prices. Barone-Adesi, Engle and Mancini [7] estimate risk-neutral parameters by integrating the information from both historical data and option prices. Christoffersen and Jakobs [18] expand the flexility of the model by introducing longand short-run volatility components. The above approaches can be summarized as follows. Using the notation in Section 4.1 of [34], the theoretical option price with option characteristics (Si , Ki , Ti , ri , δi ) is governed by a parametric form C(Si , Ki , Ti , ri , δi , θ), where θ is a vector of structural parameters of the stock price dynamics under the riskneutral measure. The form depends on the underlying parameters of the stochastic model. This can be in one case a stochastic volatility model and in another case a GARCH model. The parameters are then determined by minimizing n 

{Ci − C(Si , Ki , Ti , ri , δi , θ)}2

i=1

or similar discrepancy measures. The success of a method depends critically on the correctness of model assumptions under the risk-neutral measure. Since these assumptions are not on the physical measure, they are hard to verify. This is why so many parametric models have been introduced. Their efforts can be regarded as searching an appropriate parametric form C(·; θ ) to better fit the option data. Nonparametric methods in Section 4.1 provide a viable alternative for this purpose. They can be combined with parametric approaches to improve the accuracy of pricing. As an illustration, let us consider the options with fixed (Si , Ti , ri , δi ) so that their prices are only a function of K or equivalently a function of the moneyness m = K/S, C = exp(−rT )

 ∞ K

(x − K)f ∗ (x) dx.

355

A SELECTIVE OVERVIEW

(a)

(b)

F IG . 1. (a) Scatterplot of the response variable computed based on option prices with consecutive strike price against the moneyness. (b) The implied volatilities of the options during the period July 7–11, 2003.

DenotingD = exp(rT )C/S and letting F¯ ∗ (x) = 1 − F ∗ (x) = x∞ f ∗ (y) dy be the survival function, then by integration by parts, D = −S

−1

 ∞ K

¯∗

(x − K) d F (x) = S

−1

 ∞

¯∗

F (x) dx.

K

By a change of variable, we have D=

 ∞

F¯ (u) du,

m

where F (u) = F ∗ (Su) is the state price distribution in the normalized scale (the stock price is normalized to $1). Let us write explicitly D(m) to stress the dependence of discounted option price on the moneyness m. Then  m2 D(m1 ) − D(m2 ) −1 F¯ (u) du = (m2 − m1 ) m2 − m1 m1 





m2 + m1 + O (m2 − m1 )2 . = F¯ 2

Assume that the moneyness mi = Ki /St has already been ordered for Nt options with strike prices {Ki , i = 1, . . . , Nt } traded at time t. Let xi = (mi + mi+1 )/2 = (Ki + Ki+1 )/(2S) and yi be the observed value of D(mi )−D(mi+1 ) , namely, mi+1 −mi yi = exp(rt Tt ){Ci − Ci+1 }/{Ki+1 − Ki }, i = 1, . . . , Nt − 1, where rt , Tt and St are, respectively, the risk-free interest rate, time to maturity and spot stock price at time t, and Ci+1 and Ci are the option prices at time t associated with strike prices Ki+1 and Ki . Then, estimating

the state price distribution becomes a familiar nonparametric regression problem, yi ≈ F¯ (xi ) + εi . In the above equation, the dependence on t is suppressed. Figure 1(a) shows the scatterplot of the pairs (xi , yi ) based on the closing call option prices (average of bid-ask prices) of the Standard and Poor’s 500 index with maturity of Tt = 75 − t days on the week of July 7 to July 11, 2003 (t = 0, . . . , 4). The implied volatility curve is given in Figure 1(b). It is not a constant and provides stark evidence against the Black–Scholes formula. The waterfall shape of the regression curve is very clear. The naive applications of nonparametric techniques will incur large approximation biases resulting in systematic pricing errors. One possible improvement is to use a parametric method such as the ad-hoc Black–Scholes model of Dumas, Fleming and Whaley [25] to estimate the main shape of the regression function and then use a nonparametric method to estimate the difference. This kind of idea has been investigated by Fan and Mancini [36]. When we aggregate the data in the week of July 7 to July 11, 2003, the times to maturity Tt vary slightly. Semiparametric techniques can be used to adjust for this effect. Similarly to many practical problems, we always have side information available that can be incorporated into modeling and analysis of the data. This reinforces the claim that pricing financial derivatives is fundamentally a statistical problem where statisticians can play an important role.

356

J. FAN

ACKNOWLEDGMENTS

This research was supported in part by NSF Grant DMS-03-55179. The author acknowledges gratefully various discussions with Dr. Loriano Mancini and would like to thank him for providing Figure 1 presented in this discussion. REFERENCES [1] A ÏT-S AHALIA , Y., M YKLAND , P. and Z HANG , L. (2005). How often to sample a continuous-time process in the presence of maket microstructure noise. Review of Financial Studies 18 351–416. [2] A LEXANDER , C. (2001). Orthogonal GARCH. In Mastering Risk 2 (C. Alexander, ed.) 21–38. Financial Times Prentice Hall, London. [3] A NTONIADIS, A. and FAN , J. (2001). Regularization of wavelet approximations (with discussion). J. Amer. Statist. Assoc. 96 939–967. [4] BAI , Z. D. (1999). Methodologies in spectral analysis of large-dimensional random matrices. A review (with discussion). Statist. Sinica 9 611–677. [5] BAI , Z. D. and S ILVERSTEIN , J. W. (1999). Exact separation of eigenvalues of large-dimensional sample covariance matrices. Ann. Probab. 27 1536–1555. [6] BAKSHI , G., C AO , C. and C HEN , Z. (1997). Empirical performance of alternative option pricing models. J. Finance 52 2003–2049. [7] BARONE -A DESI , G., E NGLE , R. F. and M ANCINI , L. (2004). GARCH options in incomplete markets. NCCRFinRisk Working Paper 155, Univ. Zurich. [8] BAUWENS , L., L AURENT, S. and ROMBOUTS , J. V. K. (2003). Multivariate GARCH models: A survey. CORE Discussion Paper 2003/31. [9] B LACK , F. and S CHOLES , M. (1973). The pricing of options and corporate liabilities. J. Political Economy 81 637–654. [10] B OLLERSLEV, T. (1990). Modelling the coherence in shortrun nominal exchange rates: A multivariate generalized ARCH model. Review of Economics and Statistics 72 498–505. [11] B OLLERSLEV, T., E NGLE , R. and W OOLDRIDGE , J. (1988). A capital asset pricing model with time-varying covariances. J. Political Economy 96 116–131. [12] C AI , Z. and H ONG , Y. (2003). Nonparametric methods in continuous-time finance: A selective review. In Recent Advances and Trends in Nonparametric Statistics (M. G. Akritas and D. N. Politis, eds.) 283–302. North-Holland, Amsterdam. [13] C AMPBELL , J. Y., L O , A. W. and M AC K INLAY, A. C. (1997). The Econometrics of Financial Markets. Princeton Univ. Press. [14] C HAPMAN , D. A. and P EARSON , N. D. (2000). Is the short rate drift actually nonlinear? J. Finance 55 355–388. [15] C HEN , X. and L UDVIGSON , S. (2005). Land of Addicts? An empirical investigation of habit-based asset pricing model. Unpublished manuscript. [16] C HERNOV, M. and G HYSELS , E. (2000). A study towards a unified approach to the joint estimation of objective and risk neutral measures for the purpose of options valuation. J. Financial Economics 56 407–458.

[17] C HRISTOFFERSEN , P. F. (2003). Elements of Financial Risk Management. Academic Press, San Diego. [18] C HRISTOFFERSEN , P. F. and JAKOBS , K. (2005). Option valuation with long-run and short-run volatility components. Unpublished manuscript. [19] C OCHRANE , J. H. (2001). Asset Pricing. Princeton Univ. Press. [20] DAI , Q. and S INGLETON , K. (2000). Specification analysis of affine term structure models. J. Finance 55 1943–1978. [21] D ETTE , H. (2002). Strong approximation of eigenvalues of large dimensional Wishart matrices by roots of generalized Laguerre polynomials. J. Approx. Theory 118 290–304. [22] D IACONIS , P. and S HAHSHAHANI , M. (1994). On the eigenvalues of random matrices. J. Appl. Probab. 31A 49–61. [23] D UFFEE , G. R. (2002). Term premia and interest rate forecasts in affine models. J. Finance 57 405–443. [24] D UFFIE , D. and K AN , R. (1996). A yield-factor model of interest rates. Math. Finance 6 379–406. [25] D UMAS , B., F LEMING , J. and W HALEY, R. E. (1998). Implied volatility functions: Empirical tests. J. Finance 53 2059–2106. [26] E ATON , M. L. and T YLER , D. E. (1991). On Wielandt’s inequality and its application to the asymptotic distribution of the eigenvalues of a random symmetric matrix. Ann. Statist. 19 260–271. [27] E ATON , M. L. and T YLER , D. (1994). The asymptotic distribution of singular values with applications to canonical correlations and correspondence analysis. J. Multivariate Anal. 50 238–264. [28] E NGLE , R. F., ed. (1995). ARCH, Selected Readings. Oxford Univ. Press. [29] E NGLE , R. F. (2002). Dynamic conditional correlation— a simple class of multivariate generalized autoregressive conditional heteroskedasticity models. J. Bus. Econom. Statist. 20 339–350. [30] E NGLE , R. F. and K RONER , K. F. (1995). Multivariate simultaneous generalized ARCH. Econometric Theory 11 122–150. [31] E NGLE , R. F. and S HEPPARD , K. (2001). Theoretical and empirical properties of dynamic conditional correlation multivariate GARCH. Preprint. [32] FAMA , E. and F RENCH, K. (1992). The cross-section of expected stock returns. J. Finance 47 427–465. [33] FAMA , E. and F RENCH , K. (1993). Common risk factors in the returns on stocks and bonds. J. Financial Economics 33 3–56. [34] FAN , J. (2005). A selective overview of nonparametric methods in financial econometrics. Statist. Sci. 20 317–337. [35] FAN , J. and L I , R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360. [36] FAN , J. and M ANCINI , L. (2005). Derivative pricing using semiparametric techniques. Unpublished manuscript. [37] FAN , J., WANG , M. and YAO , Q. (2005). Modelling multivariate volatilities via conditionally uncorrelated components. Unpublished manuscript. [38] FAN , J. and WANG , Y. (2005). Multiscale jump and volatility analysis for high-frequency financial data. Unpublished manuscript.

A SELECTIVE OVERVIEW [39] FAN , J. and YAO , Q. (2003). Nonlinear Time Series: Nonparametric and Parametric Methods. Springer, New York. [40] G OLDFARB , D. and I YENGAR , G. (2003). Robust portfolio selection problems. Math. Oper. Res. 28 1–37. [41] H ESTON , S. L. and NANDI , S. (2000). A closed-form GARCH option valuation model. Review of Financial Studies 13 585–625. [42] H UANG , J. Z., L IU , N. and P OURAHMADI , M. (2005). Covariance selection and estimation via penalized normal likelihood. Unpublished manuscript. [43] J OHNSTONE , I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327. [44] L O , A. (2000). Finance: A selective survey. J. Amer. Statist. Assoc. 95 629–635.

357

[45] M OFFATT, H. K. (2002). Risk Management: Value at Risk and Beyond. Cambridge Univ. Press. [46] P HILLIPS , P. C. B. and Y U , J. (2005). Jackknifing bond option prices. Review of Financial Studies 18 707–742. [47] T YLER , D. E. (1981). Asymptotic inference for eigenvectors. Ann. Statist. 9 725–736. [48] VAN DER W EIDE , R. (2002). GO-GARCH: A multivariate generalized orthogonal GARCH model. J. Appl. Econometrics 17 549–564. [49] W IEAND , K. (2002). Eigenvalue distributions of random unitary matrices. Probab. Theory Related Fields 123 202–224. [50] Z HANG , L., M YKLAND , P. and A ÏT-S AHALIA , Y. (2005). A tale of two time scales: Determining integrated volatility with noisy high-frequency data. J. Amer. Statist. Assoc. 100 1394–1411.

Statistical Science 2005, Vol. 20, No. 3, 205–209 DOI 10.1214/088342305000000205 © Institute of Mathematical Statistics, 2005

Darrell Huff and Fifty Years of How to Lie with Statistics J. Michael Steele

Abstract. Over the last fifty years, How to Lie with Statistics has sold more copies than any other statistical text. This note explores the factors that contributed to its success and provides biographical sketches of its creators: author Darrell Huff and illustrator Irving Geis. Key words and phrases: Darrell Huff, Irving Geis, How to Lie with Statistics, numeracy, graphs, crescent cow. studied both sociology and journalism at the University of Iowa, but even before completing his bachelor’s degree in 1938, Huff had worked as a reporter for The Herald of Clinton, Iowa and as a feature writer for The Gazette of Cedar Rapids. In 1937 he had also married Frances Marie Nelson, who would become his co-author, mother of his four daughters and wife of sixty-four years. In 1939, when Huff finished his work at the University of Iowa with a master’s degree, he made the move from newspapers to magazines. This was a golden age for the magazine industry and in some ways Iowa was at the heart of it. Through 1940, Huff served as an associate editor at Look magazine in Des Moines, and he then took a brief detour as editor-in-chief of D. C. Cook publishing in Elgin, Illinois. After two years in Elgin, he returned to Des Moines to become managing editor of the very influential Better Homes and Gardens [6]. Huff’s position at Better Homes and Gardens put him at the top of his profession, but these were turbulent times. In 1944 Huff was offered the executive editorship of Liberty magazine, and he and his family made the hard decision to move to New York. Almost forgotten now, Liberty was at the time a magazine of great national importance. It competed vigorously with the famous The Saturday Evening Post with which it shared many contributors, readers and advertisers. Even today, Liberty competes with The Saturday Evening Post for the attention and affection of collectors of cover illustrations and nostalgic advertising art. With the end of the Second World War, both New York and the editorship of Liberty lost some of their appeal to Huff. What had begun as an exciting adventure started to feel like a “rat race.” As Huff would write sixteen years later in an article [14], “I suppose the whole

1. TOUCHING A MILLION LIVES

In 1954 former Better Homes and Gardens editor and active freelance writer Darrell Huff published a slim (142 page) volume which over time would become the most widely read statistics book in the history of the world. In its English language edition, more than one-half million copies of How to Lie with Statistics have been sold. Editions in other languages have been available for many years, and new translations continue to appear. In 2003 the first Chinese edition was published by the Department of Economics of Shanghai University. There is some irony to the world’s most famous statistics book having been written by a person with no formal training in statistics, but there is also some logic to how this came to be. Huff had a thorough training for excellence in communication, and he had an exceptional commitment to doing things for himself. 2. DARRELL HUFF AND THE PATH TO A FREELANCER’S LIFE

Darrell Huff was born on July 15, 1913 in Gowrie, Iowa, a small farming community fifty miles from Ames, Iowa. Huff received his early education in Gowrie, had a lively curiosity about almost everything and eventually evolved an interest in journalism. He J. Michael Steele is C.F. Koo Professor of Statistics and Operations and Information Management, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 191046340, USA (e-mail: [email protected]). 205

206

J. M. STEELE

thing got started because I developed a habit of grinding my teeth in my sleep.” Huff then went on to explain that one night when the teeth grinding was particularly ominous, his wife woke him saying, “This has got to stop.” Huff thought she meant the teeth grinding and he agreed, but it turned out that Fran had other things in mind. “Let’s get away from all this,” she said. “You can go back to freelancing.” 3. A DO-IT-YOURSELFER’S CALIFORNIA IDYLL

Freelance writing and photography had supported Huff through college, but he had no illusions about the challenge he would face trying to support a family without a steady job. Nevertheless, his wife Fran was undaunted, and they decided to move to California, build their own home and make a living through freelance writing. In 1946 they headed west with their life savings of $4000 and a secondhand trailer towed behind a 1941 sedan. Today, such an action would be as sensible as looking for gold in the rivers, but those were different times. For $1500 the Huffs picked up ten acres in Valley of the Moon, five miles from Sonoma. Post-war rationing and a limited budget forced the Huffs to be creative, but after a few months of working on their house while living in their trailer, the Huffs moved into their first home of their own design. It was not finished, but it had a huge fireplace and an inspiring wall of glass facing the California foothills. The Sonoma home was the first of several homes that Huff would build with his own hands, including one in Carmel on the Monterey peninsula. It has 3000 square feet, four bedrooms, a study, three baths and a hexagonal living room that looks directly over the Pacific. It has been featured in numerous articles and has been honored by the National Building Museum. Darrell Huff lived in his Carmel home until he died on June 27, 2001. Frances Marie Huff lives there today, where she is often visited by her daughters. 4. OF HUFF’S MANY BOOKS

At his core, Darrell Huff was a magazine man, and almost all of his writing was directed at the daily concerns of the common person. Popular writing often goes undocumented into history, and today it is not completely certain how many books Huff wrote. Cataloguers of the Library of Congress list fourteen volumes to his credit, but they missed at least two ([12] and [20]). Of Huff’s several hundred feature articles, only one [14] is available on the Internet.

Huff’s first book, Pictures by Pete [7], appeared in 1944 with the subtitle A Career Story of a Young Commercial Photographer; in 1945 Huff broadened his theme with Twenty Careers of Tomorrow [8]. Perhaps the young Huff was thinking out loud as he contemplated his own future. In any event, Huff took an entirely new direction in 1946 with his third book, The Dog that Came True [9]. Doing a dog book seemed to put Huff off book writing for a while, and for the next nine years he devoted himself to article writing and to do-it-yourself projects, such as his remarkable California homes. Still, when Huff took up his next book project, he bought himself a piece of posterity. The book was How to Lie with Statistics [10], and it is the reason why we celebrate Darrell Huff in this volume. While the longevity of How to Lie with Statistics must have come as a surprise to Huff, the seeds for its initial success were carefully planted. Remarkably, it was reviewed twice in The New York Times, a feat that would be almost impossible today. First, on January 4, 1954, Burton Crane gave it a column [4] and-a-half in “The Business Bookshelf”; then on January 16, it got another half column [30] in Charles Poore’s “Books of The Times.” Both reviews were highly positive, though not especially insightful. In August of 1954, Huff got to follow up in the Times with a more informative feature article [11] of his own, “How to Spot Statistical Jokers.” At the bottom of the first column, Huff added a footnote that says “Darrell Huff is a specialist in reading between statistical lines. He is the author of the book How to Lie with Statistics.” This does indeed seem to be how the world most remembers him today. Darrell Huff was not one to argue with success, and the thoughts and skills that led to How to Lie with Statistics were to provide the basis for six further books that would deal with what we now call quantitative literacy. The first of these was the natural How to Take a Chance [12], where he again partnered with the brilliant illustrator Irving Geis. Even though the sales of this book were small in comparison with How to Lie with Statistics, it was still a solid success. Huff’s other efforts in quantitative literacy were also commercially successful, but less dramatically so. They include Score: The Strategy of Taking Tests [13] and Cycles in Your Life [15], two volumes which set a pattern that others would carry to even greater commercial success. In How to Figure the Odds on Everything [17], Huff polished and modernized his treatment of the themes that he first addressed in How to Take a Chance [12].

HUFF AND HOW TO LIE WITH STATISTICS

Darrell Huff’s last book, The Complete How to Figure It [19], was written with the design and illustration assistance of two of his daughters, Carolyn and Kristy. It was published in 1996 when Huff was eighty-three years old, and it is an unusual book that is in many ways strikingly original. Its modular structure suggests that it may have been brewing for many years. The remainder of Huff’s books ([16, 18, 20–23]) deal more directly with household projects than with quantitative literacy, but in each of these one finds considerable cutting and measuring—practical challenges that put to test even the most quantitatively literate. 5. FOUR SOURCES OF SUCCESS

When one asks what may have led to the remarkable success of How to Lie with Statistics, it is natural to consider four likely sources: the title, the illustrations, the style and the intellectual content. In each case, from the perspective of fifty years, one finds more than might have originally met the eye. The Title

Many statisticians are uncomfortable with Huff’s title. We spend much of our lives trying to persuade others of the importance and integrity of statistical analysis, and we are naturally uncomfortable with the suggestion that statistics can be used to craft an intentional lie. Nevertheless, the suggestion is valid. People do lie with statistics every day, and it is to Huff’s credit that he takes the media (and others) to task for having stretched, torn or mutilated the truth. Irving Geis, the illustrator of How to Lie with Statistics, also said [5] what many have thought: “Huff could have well titled it An Introduction to Statistics and it would have sold a few hundred copies for a year or two. But with that title, it’s been selling steadily since 1954.” Indeed, the title has done more than simply stand the test of time; it has been honored through the years by other authors who pursued their own variations, including How to Lie with . . . charts [24], maps [26], methodology [31] and even “Your Mac” [2]. The Illustrations—and the Illustrator

Although it is now commonplace to see cartoons in serious books, it was not always so. Huff and his illustrator, Irving Geis, helped to pave the way. As an experienced magazine editor, Huff may—or may not— have foreseen this development, but what Huff surely saw was the brilliance of his partner. Irving Geis was born in New York in 1908 as Irving Geisberg [1], but he attended high school in Anderson,

207

South Carolina and studied architecture at Georgia Tech before going to the University of Pennsylvania, where he obtained a Bachelor of Fine Arts degree in 1929. He found success as a magazine illustrator, and the first pictures that many of us ever saw of Sputniks orbiting, continents drifting or double helixes dividing, were the work of Irving Geis. Nevertheless, it was through his illustrations of complex molecules that Geis assured his own place in history. His involvement began with a 1961 Scientific American article by John Kendrew where Geis illustrated the first protein crystal structure to be discovered, that of sperm whale myoglobin. Large molecules provided Geis with a perfect venue for his talents, and he continued to illustrate them for the rest of his life. He died on July 22, 1997, and, in his funeral oration, Richard Dickerson [5] called Geis the “Leonardo da Vinci of protein structure.” The Style

In a word, Huff’s style was—breezy. A statistically trained reader may even find it to be breezy to a fault, but such a person never was part of Huff’s intended audience. My copy of The Complete Idiot’s Guide to the Roman Empire [27] is also breezy, but, since I am not a historian, that is just the way I like it. We all know now (history has taught us!) that many subjects can be made more accessible when they are lightened up with cartoons and an over-the-top casualness. Fifteen years of Idiot, Dummy and Cartoon guides have shown that this formula works—not every time—but often enough to guarantee that such guides will be bookstore staples for many years to come. It would require careful bibliographical research to determine how much credit Huff deserves for this formula, but even a quick look will show that many of the elements of the modern formula are already present in How to Lie with Statistics. In the publishing field, this is what one means by pioneering, original work. The Content

A great title, great illustrations and chatty quips will quickly run out of steam unless they are used in support of genuine intellectual content. The first four chapters (The Sample with the Built-in Bias, The Well-Chosen Average, The Little Figures That Are Not There and Much Ado about Practically Nothing) deal with material that is covered in any introductory statistics class, so to statisticians there is not much that is original here—or is there? These chapters take only forty-nine (breezy, cartoon-filled) pages, yet I suspect that many

208

J. M. STEELE

of us would be content to know that our students could be certain to have a mastery of these chapters a year after having completed one of our introductory classes. The next three chapters (The Gee-Whiz Graph, The One-Dimensional Picture and The Semiattached Figure) deal with graphics, and to me these are the most original in the book, which, incidentally, I first read as a high school student in Lubbock, Texas. What struck me then—and still strikes me now—was the utter devilishness of the “crescive cow” illustration, which uses a 1936 cow that is 25/8 taller than an 1860 cow to demonstrate the growth of the US cow population to 25 from 8 million between 1860 and 1932. Since we intuitively judge a cow more by her volume than by her height, this illustration is massively, yet slyly, deceptive. Today the graduate text for such insights would be Tufte’s beautiful book The Visual Display of Quantitative Information [32]. Huff’s next two chapters are more heterogeneous. Chapter 8, Post Hoc Rides Again, gives just a so-so discussion of cause and effect, but Chapter 9 makes up for the lull. In How to Statisticulate, Huff takes on an issue that statisticians seldom discuss, even though they should. When we find that someone seems to be “lying with statistics,” is he really lying or is he just exhibiting an unfortunate incompetence? Huff argues that it is often simple, rock-bottom, conniving dishonesty, and I believe that Huff is right. Huff’s last chapter is How to Talk Back to a Statistic, and it gets directly to what these days we cover in courses on critical thinking. He boils his technique down to just five questions: Who says so? How does he know? What’s missing? Did someone change the subject? Does it make sense? Today anyone can test the effectiveness of these questions simply by checking how well they deal with problems such as those collected by Best [3], Levitt and Dubner [25] and Paulos [28, 29]. 6. ABOUT THE SPECIAL ISSUE

This special section of Statistical Science collects seven further articles that address issues with which Huff and Geis would surely have had a natural rapport: • Joel Best: Lies, Calculations, and Constructions: Beyond How to Lie with Statistics • Mark Monmonier: Lying with Maps • Walter Krämer and Gerd Gigerenzer: How to Confuse with Statistics or: The Use and Misuse of Conditional Probabilities • Richard De Veaux and David Hand: How to Lie with Bad Data

• Charles Murray: How to Accuse the Other Guy of Lying with Statistics • Sally Morton: Ephedra • Stephen E. Fienberg and Paul C. Stern: In Search of the Magic Lasso: The Truth About the Polygraph. The first four of these articles explore a remarkable variety of newly discovered pathways by which statistical lies continue to flow into our collective consciousness. The fifth piece, by Charles Murray, then amusingly explores the Swiftian proposal that young social scientists may be somehow secretly coached in subtle techniques for suggesting that someone else might be lying. Finally, in the last two pieces, we see how statistics is used (or not used) for better or worse in the public policy domain. Morton reprises the role of statistics in the regulatory background of the controversial herb ephedra, and Fienberg and Stern take on the technology of lie detection, where great expectations and empirical evidence face irreconcilable differences. ACKNOWLEDGMENTS

I am especially pleased to thank Frances Marie Huff for her willingness to discuss the life and career of Darrell Huff. I also wish to thank Ed George, Dean Foster and Milo Schield for their advice and encouragement at the beginning of this project, Paul Shaman and Adi Wyner for their comments on earlier drafts, and George Casella for giving me the original green light. REFERENCES [1] A NONYMOUS (1998). Obituaries, ’20s. The Pennsylvania Gazette, University of Pennsylvania Alummi Magazine, April. [2] B EAMER , S. (1994). How to Lie With Your Mac. Hayden, Indianapolis. [3] B EST, J. (2001). Damned Lies and Statistics. Untangling Numbers from the Media, Politicians, and Activists. Univ. California Press, Berkeley. [4] C RANE , B. (1954). The Business Bookshelf. The New York Times, January 4, p. 41. [5] D ICKERSON , R. E. (1997). Irving Geis, Molecular artist, 1908–1997. Protein Science 6 2483–2484. [6] E THRIDGE , J. M. and KOPALA , B., eds. (1967). Contemporary Authors. A Biobiliographical Guide to Current Authors and their Works. Gale Research Company, Detroit. [7] H UFF , D. (1944). Pictures by Pete. A Career Story of a Young Commercial Photographer. Dodd, Mead, New York. [8] H UFF , D. (1945). Twenty Careers of Tomorrow. Whittlesey House, McGraw–Hill, New York. [9] H UFF , D. (1946). The Dog that Came True (illust. C. Moran and D. Thorne). Whittlesey House, McGraw–Hill, New York.

HUFF AND HOW TO LIE WITH STATISTICS [10] H UFF , D. (1954). How to Lie with Statistics (illust. I. Geis). Norton, New York. [11] H UFF , D. (1954). How to Spot Statistical Jokers. The New York Times, August 22, p. SM13. [12] H UFF , D. (1959). How to Take a Chance (illust. I. Geis). Norton, New York. [13] H UFF , D. (1961). Score. The Strategy of Taking Tests (illust. C. Huff ). Appleton–Century Crofts, New York. [14] H UFF , D. (1962). Living high on $6500 a year. The Saturday Evening Post 235 60–62. [Reprinted in Mother Earth News, January 1970.] Available online at www.motherearthnews. com/mothers_library/1970. [15] H UFF , D. (1964). Cycles in Your Life—The Rhythms of War, Wealth, Nature, and Human Behavior. Or Patterns in War, Wealth, Weather, Women, Men, and Nature (illust. A. Kovarsky). Norton, New York. [16] H UFF , D. (1968). How to Work With Concrete and Masonry (illust. C. and G. Kinsey). Popular Science Publishing, New York. [17] H UFF , D. (1972). How to Figure the Odds on Everything (illust. J. Huehnergarth). Dreyfus, New York. [18] H UFF , D. (1972). How to Save on the Home You Want (with F. Huff and the editors of Dreyfus Publications; illust. R. Doty). Dreyfus, New York. [19] H UFF , D. (1996). The Complete How to Figure It. Using Math in Everyday Life (illust. C. Kinsey; design K. M. Huff ). Norton, New York. [20] H UFF , D. and C OREY, P. (1957). Home Workshop Furniture Projects. Fawcett, New York.

209

[21] H UFF , D. and H UFF , F. (1963). How to Lower Your Food Bills. Your Guide to the Battle of the Supermarket. Macfadden–Bartell, New York. [22] H UFF , D. and H UFF , F. (1970). Complete Book of Home Improvement (illust. G. and C. Kinsey and Bray–Schaible Design, Inc.). Popular Science Publishing, New York. [23] H UFF , F. M. (1973). Family Vacations. More Fun For Less Money (with D. Huff and the editors of Dreyfus Publications; illust. J. Huehnergarth). Dreyfus, New York. [24] J ONES , G. E. (2000). How to Lie with Charts. iUniverse, Lincoln, NE. [25] L EVITT, S. D. and D UBNER , S. J. (2005). Freakonomics. A Rogue Economist Explores the Hidden Side of Everything. Morrow, New York. [26] M ONMONIER , M. (1996). How to Lie with Maps, 2nd ed. Univ. Chicago Press, Chicago. [27] N ELSON , E. (2002). The Complete Idiot’s Guide to the Roman Empire. Alpha Books, Indianapolis. [28] PAULOS , J. A. (1988). Innumeracy. Mathematical Illiteracy and Its Consequences. Hill and Wang, New York. [29] PAULOS , J. A. (1996). A Mathematician Reads the Newspaper. Anchor Books, New York. [30] P OORE , C. (1954). Books of The Times. The New York Times, January 16, p. 13. [31] S CHARNBERG , M. (1984). The Myth of Paradigm-Shift, or How to Lie with Methodology. Almqvist and Wiksell, Uppsala. [32] T UFTE , E. (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT.

Statistical Science 2005, Vol. 20, No. 3, 210–214 DOI 10.1214/088342305000000232 © Institute of Mathematical Statistics, 2005

Lies, Calculations and Constructions: Beyond How to Lie with Statistics Joel Best

Abstract. Darrell Huff’s How to Lie with Statistics remains the best-known, nontechnical call for critical thinking about statistics. However, drawing a distinction between statistics and lying ignores the process by which statistics are socially constructed. For instance, bad statistics often are disseminated by sincere, albeit innumerate advocates (e.g., inflated estimates for the number of anorexia deaths) or through research findings selectively highlighted to attract media coverage (e.g., a recent study on the extent of bullying). Further, the spread of computers has made the production and dissemination of dubious statistics easier. While critics may agree on the desirability of increasing statistical literacy, it is unclear who might accept this responsibility. Key words and phrases: acy.

Darrell Huff, social construction, statistical liter-

sociologist, I now realized that there was much more to say. In particular, I’d become interested in the role statistics play in drawing attention to social problems. During the early 1980s, the campaign to call attention to the problem of missing children used a simple, familiar recipe to mobilize public concern: (1) present terrifying examples (e.g., the most notorious case involved a six-year-old boy who wandered away from his mother in the local mall and disappeared until, weeks later, the authorities recovered part of his body); (2) explain that this example is but one instance of a larger problem and name that problem (e.g., that boy was a missing child ); and (3) give shocking statistics about the problem’s extent (e.g., each year, activists claimed, there are nearly two million cases of missing children, including 50,000 abducted by strangers). It was years

In the spring of 1965, I was a freshman taking Sociology 25, the introductory course in social statistics at the University of Minnesota. One day the TA in charge of our lab mentioned that this stuff could actually be interesting. There was, he said, a pretty good book called How to Lie with Statistics. I perked up; any book with that title promised to be fun. As a high-school debater, I’d had a favorite opening for rebuttals: “Disraeli1 said, ‘There are lies, damned lies, and statistics.’ While I certainly don’t want to accuse our opponents of lying, they have presented a lot of statistics. . . .” I checked Darrell Huff’s little book out of the library and I’d have to say it made as big an impression on me as anything else I read during my freshman year. I recommended the book to friends and, once I began teaching sociology myself, to countless students. I don’t think I read it again until the early 1990s. By that time, I’d encountered other, more sophisticated books on related topics, such as John Allen Paulos’ Innumeracy (1988), Edward Tufte’s The Visual Display of Quantitative Information (1983) and Mark Monmonier’s How to Lie with Maps (1996). How to Lie with Statistics remained a wonderful primer but, as a

1 This aphorism also gets attributed to Mark Twain. So far as I know, no one has been able to locate it in Disraeli’s writings, but it does appear in Twain’s autobiography, where Twain ascribes it to Disraeli. Given that Twain was not unwilling to take credit for a funny line, I had come to assume that he at least believed that it originated with Disraeli. However, Peter M. Lee of the University of York’s Department of Mathematics has traced the aphorism to Courtney’s (1895) reference to “. . . the words of the Wise Statesman, ‘Lies–damned lies– and statistics’ . . . ” (for a full discussion, see Lee’s Web page: www.york.ac.uk/depts/maths/histstat/lies.htm).

Joel Best is Professor and Chair, Department of Sociology & Criminal Justice, University of Delaware, Newark, Delaware 19716-2580, USA (e-mail: joelbest @udel.edu). 210

LIES, CALCULATIONS AND CONSTRUCTIONS: BEYOND HOW TO LIE WITH STATISTICS

before reporters began to challenge these widely circulated numbers, in spite of their obvious implausibility. (At that time, there were roughly 60 million Americans under age 18. Was it really possible that one in thirty—think of a child from every schoolroom in the nation—went missing each year?) Once I’d noticed the three-part (atrocity tale/problem name/inflated statistic) recipe for problem building, I began to appreciate just how often it was used. To be sure, the bad guys—that is, those with whom I disagreed—regularly adopted this combination of claims to arouse public opinion. But then, so did advocates for positions I personally supported. And, while undoubtedly some claims featuring bad statistics were disingenuous—Huffian lies, as it were—others seemed to be sincere—albeit innumerate—claims. People trying to draw attention to some social problem tend to be convinced that they’ve identified a big, serious problem. When they come upon a big numeric estimate for the problem’s size, they figure it must be about right, so they decide to repeat it. Since everybody in this process—the advocates making the claims, the reporters covering the story, and the audience for this media coverage—is likely to be more-or-less innumerate, it is easy for bad numbers—especially bad big numbers—to spread. And, of course, in today’s world the Internet guarantees a figure’s continued circulation. Ridiculous statistics live on, long after they’ve been thoroughly debunked; they are harder to kill than vampires. THE TROUBLE WITH LYING

In at least one way, Huff’s book may have made things worse. His title, while clever and—at least to former debaters—appealing, suggests that the problem is lying. Presumably lying with statistics involves knowingly spreading false numbers, or at least deceptive figures. Others have followed Huff’s lead. A surprisingly large number of book titles draw a distinction between statistics and lies. In addition to How to Lie with Statistics [also, How to Lie with Charts (Jones, 1995), How to Lie with Maps (Monmonier, 1996), etc.], we have How to Tell the Liars from the Statisticians (Hooke, 1983), The Honest Truth about Lying with Statistics (Homes, 1990), How Numbers Lie (Runyon, 1981), Thicker than Blood: How Racial Statistical Lie (Zuberi, 2001), and (ahem) my own Damned Lies and Statistics (Best, 2001) and More Damned Lies and Statistics (Best, 2004). Other books have chapters on the theme: “Statistics and

211

Damned Lies” (Dewdney, 1993), “Lying with Statistics” (Gelman and Nolan, 2002), and so on. Folk wisdom draws on the same theme: “Figures may not lie, but liars figure”; “You can prove anything with statistics.” You get the idea: there are good statistics, and then there are bad lies. Let’s call this the statistic-or-lie distinction. Of course, this is an appealing interpretation, particularly when the numbers bear on some controversy. I have statistical evidence. My opponent (the weasel) has lies. It has been my experience that almost everyone enjoys criticizing the other guy’s bad statistics. I have appeared on conservative radio talk shows where the hosts focused on dubious figures promoted by liberals, and I have been on shows with liberal hosts (they do exist!) who pointed to the bad numbers favored by conservatives. Our critical faculties come into play when we confront a statistic that challenges what we believe; we become analytic tigers pouncing on problems of improper sampling, dubious measurements, and so on. On the other hand, we tend to be more forgiving when we encounter numbers that seem to support what we’d like to think. Oh, maybe our figures aren’t perfect, but they’re certainly suggestive, so let’s avoid quibbling over minor matters. . . . It is my impression that the statistic-or-lie distinction is often implicitly endorsed in statistics instruction. Statistics courses naturally gravitate toward matters of calculation; after mastering each statistic, the class moves on to the next, more complicated one. If “lies” are mentioned, it tends to be in terms of “bias.” That is, students are warned that there are biased people who may deliberately choose to calculate statistics that will lend support to the position they favor. This reduces lying to a variant of the larger problem of bias—simply another flaw to be avoided in producing sound calculations. As a sociologist, I am not sure that the statistic-orlie distinction is all that helpful. It makes an implicit claim that, if statistics are not lies, they must be true— that is, really true in some objective sense. The image is that statistics are real, much as rocks are real, and that people can gather statistics in the way that rock collectors pick up stones. After all, we think, a statistic is a number, and numbers seem solid, factual, proof that somebody must have actually counted something. But that’s the point: people count. For every number we encounter, somebody had to do the counting. Instead of imagining that statistics are like rocks, we’d do better to think of them as being like jewels. Gemstones

212

J. BEST

may be found in nature, but people have to create jewels. Jewels must be selected, cut, polished and placed in settings so that they can be viewed from particular angles. In much the same way, people create statistics: They choose what to count, how to go about counting, which of the resulting numbers they will share with others, and which words they will use to describe and interpret those figures. Numbers do not exist independently of people; understanding numbers requires knowing who counted what, why they bothered counting and how they went about it. SOCIAL CONSTRUCTION AND STATISTICS

This is what sociologists mean when they speak of social construction. I know this term has gotten a bad rap. After being introduced by the sociologists Peter Berger and Thomas Luckmann in their 1966 book, The Social Construction of Reality, the notion of social construction was hijacked and put to all sorts of uses—some of them rather silly—by an assortment of literary critics and postmodernist thinkers. Ignore all that. Berger and Luckmann’s key observation is very simple: Without doubting that the real world (rocks and such) exists, it remains true that we understand that world through language, and we learn words and their meanings from other people, so our every thought is shaped by our culture’s system for categorizing the world. This means that everything we know is necessarily a social construction. Calling something a social construction doesn’t mean that it is false or arbitrary or wrong. When I think, “This rock is hard,” my notions of rockness and hardness derive from my culture, they are social constructions. But this does not imply that the thought is false or illusionary, that other members of my culture won’t agree that it’s a hard rock, or that if I whack my head with the rock, it won’t hurt. Much of what we know—of our social constructions—provides essential help in getting through the world. In my view, it helps to think about statistics in terms of construction, as well as calculation. Understand: I am not suggesting we replace the statistic-or-lie distinction with a calculation-or-construction distinction. Rather, my point is that every number is inevitably both calculated and constructed, because counting is one of the ways we construct the world. Anyone who has done any sort of research is painfully aware that this is true. All research involves choosing what to study and how to study it. This is why scientists include methods sections in their papers. When we say that science is a social construction, this does not mean that science is

fanciful or arbitrary; instead, it means that scientific knowledge is the result of people’s work. So, what do we gain when we think about statistics as socially constructed? For one thing, we can get past the statistic-or-lie distinction. Talking about lies leads us to concentrate on whether people knowingly, deliberately say things they know aren’t true. Thus: “Those tobacco executives knew full well that smoking was harmful; we can prove this because we have uncovered internal memoranda that make it clear they knew this; therefore they were lying when they said smoking was harmless.” Well, yes. But few bad statistics involve this sort of egregious bad faith. Misplaced enthusiasm is probably responsible for more dubious figures than conscious lying. Consider the case of anorexia deaths. Someone active in the struggle against eating disorders estimated that perhaps 150,000 Americans suffer from anorexia nervosa, and noted that this disease can be fatal (Sommers, 1994). Someone else—probably inadvertently—garbled this claim and announced that anorexia kills 150,000 each year. This dramatic number was repeated in best-selling books, in news stories and—here I speak from experience—countless term papers. It was a patently ridiculous number: most anorexics are young women; the annual death toll from all causes for women aged 15–44 was about 55,000; so what were the odds that 150,000 of those 55,000 were anorexic? Yet, were the various advocates, authors and journalists who repeated this very dubious number lying? I presume most of them thought it was true. After all, they believed that anorexia is a big problem, and 150,000 is a big number; moreover, other people said that was the right number, so why not repeat it? Does it help to call the sincere, albeit credulous, dissemination of a bad number a lie? Or what about a recent, widely publicized report that 30% of students in sixth through tenth grades have moderate or frequent involvement in bullying? This was the highlighted finding from an article in the Journal of the American Medical Association (Nansel et al., 2001), mentioned in the article’s abstract, in JAMA’s news release about the article, and in the extensive media coverage that resulted (Best, 2004). This article survived the peer review process in one of medicine’s premier journals; the study, conducted by researchers in the federal government, surveyed more than 10,000 students. But of course the researchers had to make choices when analyzing their data. Respondents were asked whether they had been bullied or had themselves bullied others and, if so, how often. Bullying that was

LIES, CALCULATIONS AND CONSTRUCTIONS: BEYOND HOW TO LIE WITH STATISTICS

reported occurring “sometimes” was designated “moderate,” while bullying at least once a week was labeled “frequent.” This produced a pie of data that could be sliced in various ways. The researchers carved the data to show that 30% of the students reported moderate or frequent involvement in bullying. But consider other possible slices: “involvement” meant either as a bully or a bullying victim; only 17% reported being victims of moderate or frequent bullying; and only 8% reported being victims of frequent bullying. All of this information is included in the text of the article. In other words, the claim that the study found 30% of students with moderate or frequent involvement in bullying was no lie. But it would have been equally true to state that 8% were frequent victims of bullying. The former statement was featured in the abstract and the press release; the latter was buried in the article. We can imagine that everyone involved in disseminating the results of this study—the newspaper editors trying to decide whether to run a story about this research, the wire-service reporter trying to write a story that would seem newsworthy, JAMA’s editors preparing news releases about that week’s articles, the authors hoping that their paper would be accepted by a top-tier journal and that their research would attract attention, even the funders who wanted to feel that their money had been well spent—found a statistic that implicated 30% of students in bullying more compelling than one that suggested 8% were frequent targets of bullies. If there is publication bias against studies with negative findings, so, too, is there a publicity bias favoring studies with dramatic results. But drawing a distinction between statistics and lies ignores this pattern in disseminating research results. TOWARD STATISTICAL LITERACY?

While many of Huff’s critiques remain perfectly applicable to contemporary statistics, there have been important developments during the intervening 50 years. In particular, personal computers have transformed the production and dissemination of statistics. The PC’s effects—including inexpensive software for generating sophisticated statistical analyses, bundled spreadsheet programs that allow users to create an extraordinary array of graphs and charts and professional designers able to create eye-catching graphics—have democratized the means of statistical production. Philosophers speak of the Law of the Instrument (originally stated, in an era less concerned with sexism, as: “If you give a small boy a hammer, he’ll find things to pound.”).

213

Tens of millions of people have been given statistical and spreadsheet software. We can hardly be surprised that we find ourselves surrounded by statistical presentations. Interpreting these numbers, however, requires two distinct sets of statistical skills. The first set concerns matters of calculation—the sort of lessons taught in statistics classes. But in order to assess, to criticize those numbers, we also need to appreciate issues of construction. That is, we need to worry about how statistics were brought into being. Who did the counting? What did they decide to count, and why? How did they go about it? There is a great deal of discussion these days about the desirability of increasing numeracy, quantitative literacy and particularly statistical literacy. Who can disagree? Certainly, part of the problem is that many people aren’t particularly adept at calculation. But, I would argue, genuine statistical literacy requires that people also become more alert to what I’ve called matters of construction. Anyone who reads the newspaper can find examples of stat wars, debates over social issues in which opponents lob competing numbers at each other. Statistical literacy ought to help people assess such competing claims, but that requires more than teaching them how to calculate and warning them to watch out for liars. It would help to also understand something about the place of statistics in contemporary policy rhetoric, about the processes by which numbers get produced and circulated and so on. But who’s going to teach these lessons? Here I think we might pause to consider the lessons from the “critical thinking” movement that became fashionable in academia in the late 1980s and early 1990s. It is no wonder that the cause of critical thinking gained widespread support. After all, virtually all academics consider themselves critical thinkers, and they would agree that their students need to become better critical thinkers. Yet, if you track the numbers of articles about critical thinking in the education literature, you discover a steep rise in the late 1980s, but then a peak, followed by a decline. This is a pattern familiar to sociologists: these dynamics characterize—in fact define—the fad. The celebration of critical thinking turned out to be just one more academic fad, a shortlived enthusiasm. Why, if everyone agreed that critical thinking was very important, did interest in the topic fade? I think the answer is that no one assumed ownership of the critical thinking problem. Sociologists interested in how

214

J. BEST

particular social issues gain and then lose public attention argue that an issue’s survival depends on someone assuming ownership of the problem, so that there are continuing, active efforts to keep it in the public eye. In the case of critical thinking, no discipline stepped up and took responsibility for teaching critical thinking. Rather, teaching critical thinking was seen as everybody’s responsibility, and that meant, in effect, that nobody was especially responsible for it. Without an intellectual owner to promote it, critical thinking slipped quietly from view, to be replaced by the next next thing. So—even if we agree that statistical literacy is important, and that we need to teach these skills, we still need to figure out who is going to do that teaching. I speak as an outsider, but I doubt that it will be statisticians. The statistics curriculum is based on mastering ever more complex matters of calculation. It may be desirable for students to learn, say, the principles for making good pie charts, but few Ph.D.s in statistics will be eager to teach those lessons. Statisticians are likely to consider teaching courses in statistical literacy beneath their talents, just as professors of English literature tend to avoid teaching freshman composition. Even though I am a sociologist who believes that the idea of social construction has much to contribute to the cause of statistical literacy, I also doubt that sociologists will claim ownership of statistical literacy. After all, statistical literacy is only tangentially related to sociologists’ core concerns. Similar reactions can be expected from psychologists, political scientists, and people in other disciplines. In other words, its advocates are likely to wind up agreeing that statistical literacy is important, so important that it needs to be taught throughout the curriculum. Once we reach that agreement, we will be well along the faddish trajectory taken by critical thinking.

We all know statistical literacy is an important problem, but we’re not going to be able to agree on its place in the curriculum. Which means that How to Lie with Statistics is going to continue to be needed in the years ahead. REFERENCES B ERGER, P. L. and L UCKMANN , T. (1966). The Social Construction of Reality. Doubleday, Garden City, NY. B EST, J. (2001). Damned Lies and Statistics. Untangling Numbers from the Media, Politicians, and Activists. Univ. California Press, Berkeley. B EST, J. (2004). More Damned Lies and Statistics. How Numbers Confuse Public Issues. Univ. California Press, Berkeley. C OURTNEY, L. H. (1895). To my fellow-disciples at Saratoga Springs. The National Review (London) 26 21–26. D EWDNEY, A. K. (1993). 200% of Nothing. Wiley, New York. G ELMAN, A. and N OLAN , D. (2002). Teaching Statistics. A Bag of Tricks. Oxford Univ. Press. H OMES , C. B. (1990). The Honest Truth about Lying with Statistics. Charles C. Thomas, Springfield, IL. H OOKE , R. (1983). How to Tell the Liars from the Statisticians. Dekker, New York. H UFF , D. (1954). How to Lie with Statistics. Norton, New York. J ONES , G. E. (1995). How to Lie with Charts. Sybex, San Francisco. M ONMONIER , M. (1996). How to Lie with Maps, 2nd ed. Univ. Chicago Press, Chicago. NANSEL , T. R. et al. (2001). Bullying behaviors among U.S. youth: Prevalence and association with psychosocial adjustment. J. American Medical Assoc. 285 2094–2100. PAULOS , J. A. (1988). Innumeracy. Mathematical Illiteracy and Its Consequences. Hill and Wang, New York. RUNYON , R. P. (1981). How Numbers Lie. Lewis, Lexington, MA. S OMMERS , C. H. (1994). Who Stole Feminism? Simon and Schuster, New York. T UFTE , E. (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT. Z UBERI , T. (2001). Thicker than Blood. How Racial Statistics Lie. Univ. Minnesota Press, Minneapolis.

Statistical Science 2005, Vol. 20, No. 3, 215–222 DOI 10.1214/088342305000000241 © Institute of Mathematical Statistics, 2005

Lying with Maps Mark Monmonier

Abstract. Darrell Huff’s How to Lie with Statistics was the inspiration for How to Lie with Maps, in which the author showed that geometric distortion and graphic generalization of data are unavoidable elements of cartographic representation. New examples of how ill-conceived or deliberately contrived statistical maps can greatly distort geographic reality demonstrate that lying with maps is a special case of lying with statistics. Issues addressed include the effects of map scale on geometry and feature selection, the importance of using a symbolization metaphor appropriate to the data and the power of data classification to either reveal meaningful spatial trends or promote misleading interpretations. Key words and phrases: statistical graphics.

Classification, deception, generalization, maps,

in cartographic generalizations of geographic data— hardly light material. Written with upper-division college students in mind, Maps, Distortion, and Meaning supplemented its 51 letter-size pages of academic prose and real-world examples with a bibliography listing 92 books and articles. By contrast, the first edition of How to Lie with Maps gleefully indulged in contrived Huffian examples and blithely ignored the scholarly record—a deficiency rectified five years later when the University of Chicago Press commissioned an expanded edition that added 72 relevant references, chapters on multimedia and national mapping programs, and four pages of color illustrations. Huff’s footsteps offered an easy trek through the forest of popular academic publishing. In addition to providing the conceptual model for an exposé of representational sleight of hand, How to Lie with Statistics attracted the benevolent eye of reviewers like John Swan (1992), who situated my book “in the fine tradition of Darrell Huff’s How to Lie with Statistics,” and Scott Kruse (1992), who opined that “what Huff did for statistics, Monmonier has done for cartography.” Quoting favorable reviews might sound boorishly vain, but these excerpts demonstrate that Huff’s book was not only well-known but an exemplar worth imitating. Lying with maps is, of course, a lot different from lying with statistics. Most maps are massive reductions of the reality they represent, and clarity demands that much of that reality be suppressed. The mapmaker who tries to tell the whole truth in a single map typically produces a confusing display, especially if the

1. INTRODUCTION

I never met Darrell Huff, but his insightful little book How to Lie with Statistics was a favorite long before I appropriated the first four words of its title for How to Lie with Maps, published in 1991. I don’t recall when I first became aware of Huff’s book—the oldest of two copies in my library is the 25th printing—but its title was irresistible. Equally intriguing were Huff’s straightforward examples, all served up in good humor, of how an unscrupulous or naive statistician could manipulate numbers and graphs to spin a questionable if not downright misleading interpretation of a correlation or time series. In the mid 1980s, when I taught a course titled Information Graphics, How to Lie with Statistics provided an engaging supplemental reading. Huff’s approach was as much an inspiration as his title. I already had the kernel of How to Lie with Maps in my comparatively obscure Maps, Distortion, and Meaning, published in 1977 by the Association of American Geographers as a “Resource Paper” for the Commission on College Geography. Information theory and communication models provided a conceptual framework for an illustrated excursion into the roles of map scale, projection, symbolization, and classification Mark Monmonier is Distinguished Professor of Geography, Maxwell School of Citizenship and Public Affairs, Syracuse University, Syracuse, New York 13244-1020, USA (e-mail: [email protected]. edu). 215

216

M. MONMONIER

area is large and the phenomenon at least moderately complex. Map users understand this and trust the mapmaker to select relevant facts and highlight what’s important, even if the map must grossly distort the earth’s geometry as well as lump together dissimilar features. When combined with the public’s naive acceptance of maps as objective representations, cartographic generalization becomes an open invitation to both deliberate and unintentional prevarication. At the risk of stretching the notion of lying, I’m convinced that inadvertent fabrication is far more common these days than intentional deceit. Moreover, because most maps now are customized, one-of-a-kind graphics that never make it into print or onto the Internet, prevaricating mapmakers often lie more to themselves than to an audience. Blame technology—a conspiracy between user-friendly mapping software (or notso-user-friendly geographic information systems) and high-resolution laser printers that can render crisp type and convincing symbols with little effort or thought. There’s a warning here I’m sure Darrell Huff would applaud: watch out for the well-intended mapmaker who doesn’t understand cartographic principles yet blindly trusts the equally naive software developer determined to give the buyer an immediate success experience— default settings are some of the worst offenders. Because lying with maps is so easy in our informationrich world, infrequent mapmakers need to understand the pitfalls of map generalization and map readers need to become informed skeptics. As this essay suggests, maps can lie in diverse ways. Among the topics discussed here are the effects of map scale on geometry and feature selection, the importance of using a symbolization metaphor appropriate to the data and the power of data classification to reveal meaningful spatial trends or promote misleading interpretations. 2. SELECTIVE TRUTH

An understanding of how maps distort reality requires an appreciation of scale, defined simply as the ratio of map distance to ground distance. For example, a map at 1:24,000, the scale of the U.S. Geological Survey’s most detailed topographic maps, uses a one-inch line to represent a road or stream 24,000 inches (or 2,000 feet) long. Ratio scales are often reported as fractions, which account for distinctions between “large-scale” and “small-scale.” Thus a quadrangle map showing a small portion of a county at 1/24,000 is very much a large-scale map when compared, for instance, to an atlas map showing the whole

world at 1/75,000,000—a markedly smaller fraction. (Planners and engineers sometimes confuse scale and geographic scope, the size of the area represented. It might seem counterintuitive that small-scale maps can cover vast regions while large-scale maps are much more narrowly focused, but when the issue is scale, not scope, “large” means comparatively detailed whereas “small” means highly generalized.) Mapmakers can report a map’s scale as a ratio or fraction, state it verbally using specific distance units— “one inch represents two miles” is more user friendly than 1:126,720—or provide a scale bar illustrating one or more representative distances. Bar scales, also called graphic scales, are ideal for large-scale maps because they promote direct estimates of distance, without requiring the user to locate or envision a ruler. What’s more, a graphic scale remains true when you use a photocopier to compress a larger map onto letter-size paper. Not so with ratio or verbal scales. However helpful they might be on large-scale maps, bar scales should never appear on maps of the world, a continent, or a large country, all of which are drastically distorted in some fashion when coastlines and other features are transferred from a spherical earth to a flat map. Because of the stretching and compression involved in flattening the globe, the distance represented by a one-inch line can vary enormously across a world map, and scale can fluctuate significantly along, say, a six-inch line. Because map scale varies not only from point to point but also with direction, a bar scale on a small-scale map invites grossly inaccurate estimates. Fortunately for hikers and city planners, earth curvature is not problematic for the small areas shown on large-scale maps; use an appropriate map projection, and scale distortion is negligible. What’s not negligible on most large-scale maps is the generalization required when map symbols with a finite width represent political boundaries, streams, streets and railroads. Legibility requires line symbols not much thinner than 0.02 inch. At 1:24,000, for instance, a 1/50-inch line represents a corridor 40 feet wide, appreciably broader than the average residential street, rural road or single-track railway but usually not troublesome if the mapmaker foregoes a detailed treatment of driveways, property lines, rivulets and rail yards. At 1:100,000 and 1:250,000, which cartographers typically consider “intermediate” scales, symbolic corridors 166.7 and 416.7 feet wide, respectively, make graphic congestion ever more likely unless the mapmaker weeds out less significant features, simplifies complex curves and displaces otherwise overlapping symbols.

LYING WITH MAPS

217

F IG . 2. Crude birth rates, 2000, by state, based on equal-intervals cut-points and plotted on a visibility base map.

F IG . 1. Juxtaposition of map excerpts at 1:24,000 (above) and 1:250,000, enlarged to 1:24,000 (below), illustrate some of the effects of scale on cartographic generalization. Both images show the same area, in and around Spring Mills, Maryland.

Figure 1 illustrates the effect of cartographic generalization on the U.S. Geological Survey’s treatment of Spring Mills, Maryland (south of Westminster) at scales of 1:24,000 and 1:250,000. Both excerpts cover the same area, but the upper panel is a same-size blackand-white excerpt from the larger-scale, 1:24,000 map, whereas the lower panel shows the corresponding portion of the 1:250,000 map enlarged to 1:24,000 to reveal the impact of noticeably wider symbolic corridors. At the smaller scale the hamlet of Spring Mills becomes an open circle, rather than a cluster of buildings, and the railroad and main highway are moved apart for clarity. Mapmakers compiling intermediatescale maps typically select features from existing largescale maps. When the difference between scales is substantial, as it is here, few features survive the cut, and those that do are usually smoothed or displaced. “White lies” like these are unavoidable if maps are to tell the truth without burying it in meaningless details. In a similar vein mapmakers use tiny picnic-bench symbols to locate public parks and small, highly simplified single-engine airplanes to represent airports. These icons work because they’re readily decoded, even without a map key. Legends and labels also help, especially for small-scale reference maps, on which mere points or circles substitute for complex city boundaries.

A geometric distortion especially useful in portraying statistical data for the United States is the “visibility base map” (Figure 2), which replaces the contorted outlines of Maine and Massachusetts with simplified five- and thirteen-point polygons, instantly recognizable because of their relative location and characteristic shape. Although simplified polygons can lighten the computational burden of real-time cartographic animation, the prime goal is to help viewers of small, columnwidth choropleth maps see and decode the otherwise obscure area symbols representing rates or other statistics for small states like Delaware and Rhode Island. (Choropleth map is the cartographic term for a map based on established areal units, like states or census tracts, grouped into categories, each represented by a specific color or graytone.) While purists might object to the visibility map’s caricatured shapes and grossly generalized coastlines, this type of simplification is no more outrageous than summarizing a spatially complex entity like California or New York with a statewide average. 3. CUT-POINTS AND FRIVOLOUS FILLS

Statistical data like the spatial series of birth rates in Figures 2 and 3 are easily distorted when mapmakers succumb to a software vendor’s sense of what works without probing the data to discover what’s meaningful. Whenever mapping software serves up an instant, no-thought, default classification for a choropleth map, the usual result is five categories based on either equalintervals or quantile classing. The method of grouping is almost always more problematic than the number of

218

M. MONMONIER

F IG . 3. Crude birth rates, 2000, by state, based on quantile cut-points and plotted on a visibility base map.

groups: unless the data contain fewer or only slightly more highly distinct clusters, five categories seems a reasonable compromise between a less informative two-, three- or four-category map and a comparatively busy map on which six or more symbols are less easily differentiated. Equal-intervals cut-points, computed by dividing the full range of data values into intervals of equal length, are computationally simpler than quantiles, which requires sorting the data and apportioning an equal number of places to each category. One must also make adjustments to avoid placing identical values in different categories. Because of these adjustments, Figure’s 3 categories vary in size from 9 to 11. Figures 2 and 3 offer distinctly different portraits of crude birth rates in the United States for the millennial year. My hunch is that the equal-intervals display (Figure 2), which recognizes well-above-average birth rates in Utah (21.9) and Texas (17.8), comes closer to getting it right than the quantile map (Figure 3), which lumps together states with rates between 15.8 and 21.9. Even so, viewers of the latter display might appreciate categories based on commonsense notions like lowest fifth and highest fifth. If the map author is at all concerned with full disclosure, a number line (univariate scatterplot-histogram)

F IG . 4. Number line describes variation in the data for Figures 2 and 3.

like Figure 4 is a must. This simple graphic quickly reveals pitfalls like the possible assignment of Arizona and Texas (17.5 and 17.8, resp.) to separate categories. Mapmakers who plot a number line are less likely to miss potentially significant groupings of data values, but there’s no guarantee that the data will form distinct categories neatly separated by readily apparent “natural breaks.” Although algorithmic strategies for finding natural breaks have been around for over three decades (Jenks and Caspall, 1971), classifications that minimize within-group variance are not necessarily revealing. Even so, a programmed natural-breaks solution is arguably better than a quantile scheme certain to ignore Utah’s exceptionally high birth rate or an equalinterval solution that might separate close outliers like Texas and Arizona. Optimization algorithms and standardized schemes like equal-intervals and quantiles are prone to miss cutpoints like the national average, which helps viewers compare individual states to the country as a whole. And for maps describing rates of change, programmed solutions readily overlook the intuitively obvious cutpoint at zero, which separates gains from losses. Although the large number of potentially meaningful cut-points precludes their use in a printed article or in an atlas intended for a national audience, a dynamic map included with exploratory data analysis software or available over the Internet could let users manipulate cut-points interactively. A software vendor interested in informed analysis as well as openness would, I hope, supplement moveable cut-points with a number line so that viewers could readily recognize outliers and clumpiness in the data as well as appreciate the value of looking at and presenting more than one map. The ability to explore data interactively can be an invitation to buttress specious arguments with biased maps. For example, a polemicist out to demonstrate that American fertility is dangerously low might devise a map like Figure 5, which assigns nearly threequarters of the states to its lowest category. Similarly, a demagogue arguing that birth rates are too high would no doubt prefer Figure 6, which paints much of the country an ominous black. Extreme views like these are useful reminders that maps are readily manipulated. Another hazard of mapping software is the ease with which naive users can create convincing choropleth maps with “count” variables like resident population or number of births. Although Figure 7 might look convincing, close inspection reveals nothing more than a

LYING WITH MAPS

219

F IG . 5. Crude birth rates, 2000, by state, categorized to suggest dangerously low rates overall.

F IG . 7. The darker-is-more-intense metaphor of choropleth maps offers a potentially misleading view of numbers of births.

pale shadow of population—states with more people, not surprisingly, register more births, whereas those with the smallest populations are in the lowest category. If you want to explore geographic differences in fertility, it’s far more sensible to look at birth rates as well as the total fertility index and other more sensitive fertility measures used in demography (Srinivasan, 1998). A map focusing on number of births, rather than a rate, has little meaning outside an education or marketing campaign pitched at obstetricians, new mothers or toy manufacturers. Whenever a map of count data makes sense, perhaps to place a map of rates in perspective, graphic

theory condemns using a choropleth map because its ink (or toner) metaphor is misleading. Graytone area symbols, whereby darker suggests “denser” or “more intense” while lighter implies “more dispersed” or “less intense,” are wholly inappropriate for count data, which are much better served by symbols that vary in size to portray differences in magnitude (Bertin, 1983). In other words, while rate data mesh nicely with the choropleth map’s darker-means-more rule, count data require bigger-means-more coding. Although college courses on map design emphasize this fundamental distinction between intensity data and count (magnitude) data, developers of geographic information systems and other mapping software show little interest in preventing misuse of their products. No warning pops up when a user asks for a choropleth map of count data, training manuals invoke choropleth maps of count data to illustrate commands and settings, and alternative symbols like squares or circles that vary with magnitude are either absent or awkwardly implemented. One developer—I won’t name names—not only requires users to digitize center points of states but also scales the graduated symbols by height rather than area, a fallacious strategy famously ridiculed by Huff’s pair of caricatured blast furnaces, scaled by height to compare steel capacity added during the 1930s and 1940s (Huff, 1954, page 71). Map viewers see these differences in height, but differences in area are more prominent if not overwhelming. Several remedies are indicated: improved software manuals, more savvy users, metadata (data about data) that can alert the software to incompatible symbols

F IG . 6. Crude birth rates, 2000, by state, categorized to suggest dangerously high rates overall.

220

M. MONMONIER

F IG . 8. The bigger-means-more metaphor of this dot-array map affords a more appropriate treatment of the count data in Figure 7.

and sophisticated display algorithms that automate dotarray symbols like those in Figure 8. I like the dot array because a state’s dots are not only countable but collectively constitute a magnitude symbol that visually sorts out promising and poor locations for a diaper factory. Although dot arrays are easily constructed with illustration software like Adobe Illustrator and Macromedia Freehand, describing the process in C++ would be a daunting undertaking if the programmer had to include quirky local solutions like rotating the dot array to fit South Carolina or extending it into the ocean to accommodate New Jersey. Equally reckless is the software industry’s insistence in promoting choropleth maps with widely varied hues. Although a spectral sequence from blue up through green, yellow, orange and red might make sense to fans of the USA Today weather chart, color maps that lack a temperature chart’s emotive hues and conveniently nested bands can be difficult to decode. While software developers and map authors might argue that all the information needed to read a multi-hue map is right there, in the legend, forcing the conscientious user to look back and forth between map and key is hardly as helpful as relying on the straightforward darker-ismore metaphor. Color can be a thicket for map authors, and because color artwork is not an option for this essay, I won’t go into it here aside from noting that color is convenient for maps on which a second visual variable portrays reliability (MacEachren, Brewer and Pickle, 1998)—the cartographic equivalent of error bars. 4. MAPS AND BIVARIATE CORRELATION

Just as cut-points can be manipulated to suggest that birth rates are dangerously low or high overall, pairs

of choropleth maps can purposely heighten or suppress perceptions of bivariate association. Figure 9 offers a telling example. The map at the top describes statelevel rates of population change between the 1960 and 1970 census enumerations, and the two lower maps show rates of net-migration over the same period. I call the upper map the referent because the data for the two lower maps were categorized to maximize and minimize visual similarity with this particular five-class categorization (Monmonier, 1977, pages 32–33). This three-map display originated with a comparatively innocent attempt to find cut-points that enhance the visual similarity of two maps. An iterative algorithm generated a large number of trial maps for the classed variable, evaluated each map’s assumed visual similarity to the referent and saved the cut-points if the new trail was more similar than the previous best match (Monmonier, 1976). My assumption that area alone, rather than shape or location, affects a state’s contribution to visual similarity is admittedly simplistic, but it seems reasonable that a pair of maps with matching graytones for Texas will look more similar on average than a pair of maps with matching graytones for Rhode Island. Although trial-and-error optimization might unreasonably inflate the visual similarity of two weakly or mildly associated variables, I chose as my classed variable the net-migration rate for the 1960s, which has a logical, highly positive (r = 0.93) relationship with population change, insofar as states with net losses or low rates of increase were plagued by net out-migration, while those that surged forward did so largely because many more people moved in than moved out. The result was the map at the lower left, which looks a great deal like the referent at the top. Since I developed Maps, Distortion, and Meaning shortly after describing the process in an article titled “Modifying objective functions and constraints for maximizing visual correspondence of choroplethic maps,” it’s not surprising that this coincidence inspired a wicked thought: Why not minimize correspondence visually by saving the cut-points with the worst assumed similarity? Altering a few lines of computer code yielded the map at the lower right, which looks most unlike the referent, largely because three of its five categories have only one member while a vast category ranging from −11.82 to 50.48 captures a lion’s share of the states. Word to the wary: if you see a choropleth map with one huge category and several very small ones, be suspicious.

221

LYING WITH MAPS

F IG . 9. The two lower maps are different representations of the same data. An optimization algorithm found cut-points intended to yield displays that look very similar (lower left) and very dissimilar (lower right) to the map at the top. Cut-points for the upper map include 0.0, which separates gains from losses, and 13.3, the national rate.

5. CONCLUDING COMMENT

As Darrell Huff eloquently demonstrated a half century ago, consumers of statistical analyses and data graphics must be informed skeptics. This plea is equally relevant to map users, who need to appreciate the perils and limitations of cartographic simplification as well its power and utility. Because abstract representations of data can distort almost as readily as they can reveal, analytical tools are also rhetorical instruments fully capable of “lying” in the hands of malevolent,

naive, or sloppily expedient authors. Huff’s engaging little book performed a vital public service by calling attention to the power of analytical tools for selfdeception as well as mass trickery.

REFERENCES B ERTIN, J. (1983). Semiology of Graphics. Univ. Wisconsin Press, Madison. H UFF, D. (1954). How to Lie with Statistics. Norton, New York.

222

M. MONMONIER

J ENKS, G. F. and C ASPALL, F. C. (1971). Error on choroplethic maps: Definition, measurement, reduction. Annals of the Association of American Geographers 61 217–244. K RUSE, S. (1992). Review of How to Lie with Maps. Whole Earth Review 74 109. M AC E ACHREN , A. M., B REWER, C. A. and P ICKLE, L. W. (1998). Visualizing georeferenced data: Representing reliability of health statistics. Environment and Planning A 30 1547–1561. M ONMONIER, M. (1976). Modifying objective functions and constraints for maximizing visual correspondence of choroplethic maps. Canadian Cartographer 13 21–34.

M ONMONIER, M. (1977). Maps, Distortion, and Meaning. Commission on College Geography Resource Paper 75-4. Association of American Geographers, Washington. M ONMONIER, M. (1991). How to Lie with Maps. Univ. Chicago Press, Chicago. M ONMONIER, M. (1996). How to Lie with Maps, 2nd ed. Univ. Chicago Press, Chicago. S WAN, J. (1992). Review of How to Lie with Maps. J. Information Ethics 1 86–89. S RINIVASAN, K. (1998). Basic Demographic Techniques and Applications. Sage, Thousand Oaks, CA.

Statistical Science 2005, Vol. 20, No. 3, 223–230 DOI 10.1214/088342305000000296 © Institute of Mathematical Statistics, 2005

How to Confuse with Statistics or: The Use and Misuse of Conditional Probabilities Walter Krämer and Gerd Gigerenzer

Abstract. This article shows by various examples how consumers of statistical information may be confused when this information is presented in terms of conditional probabilities. It also shows how this confusion helps others to lie with statistics, and it suggests both confusion and lies can be exposed by using alternative modes of conveying statistical information. Key words and phrases: heuristical reasoning.

Conditional probabilities, natural frequencies,

many other examples of statistical abuse that he discovered can be rephrased in terms of conditional probabilities. Below we survey various ways in which such reasoning can be misleading, and we provide some fresh examples. We also show that the potential for confusion is easily reduced by abandoning the conventional, “singularly clumsy terminology” of conditional probabilities in favor of presentation of information in terms of natural frequencies.

1. INTRODUCTION

“The notion of conditional probability is a basic tool of probability theory, and it is unfortunate that its great simplicity is somehow obscured by a singularly clumsy terminology” (Feller, 1968, page 114). Below we argue that what Feller has rightly called a ”singularly clumsy terminology,” in addition to obscuring the basic simplicity of concepts and ideas, easily lends itself to intentional and unintentional misinterpretation of statistical information of many sorts. Examples in Darrel Huff’s book are mainly in the chapter on semiattached figures, for instance, when discussing fatalities on highways on page 78: “Four times more fatalities occur on the highways at 7 p.m. than at 7 a.m.” Huff points out that this of course does not imply, as some newspaper had suggested, that it is more dangerous to drive in the evening than in the morning. Recast in the language of conditional probabilities, what Huff observes is that P (accident | 7 p.m.) should not be confused with P (7 p.m. | accident). Unfortunately, it was. Although the term conditional probability does not appear once in Huff’s remarkable book, it is clear that

2. FALLACIES IN ENUMERATION

One class of errors involving conditional probabilities comprises outright mistakes in computing them in the first place. One instance of consciously exploiting such computational errors in order to cheat the public is a game of cards called “Three Cards in a Hat,” which used to be offered to innocent passers-by at country fairs in Germany and elsewhere. One card is red on both sides, one is white on both sides, and the other is red on one side and white on the other. The cheat draws one card blindly, and shows, for example, a red face up. The cheat then offers a wager of 10 Deutschmarks that the hidden side is also red. The passer-by is assumed to argue like this: “The card is not the white-white one. Therefore, its hidden side is either red or white. As both cases are equally likely, the probability that the hidden side of the card on the table is red is 1/2, so the wager is fair and can be accepted.” In fact, of course, the red-red card has probability 2/3, since it can be drawn in two equally probable ways (one face up or the other face up, each of

Walter Krämer is Professor of Business and Economic Statistics, Fachbereich Statistik, Universität Dortmund, 44221 Dortmund, Germany (e-mail: [email protected]). Gerd Gigerenzer is Director of the Max Planck Institute for Human Development, Lentzeallee 94, 14195 Berlin, Germany (e-mail: [email protected]). 223

224

W. KRÄMER AND G. GIGERENZER

which will display red). The example therefore boils down to an incorrect enumeration of simple events in a Laplace experiment in the subpopulation composed of the remaining possibilities. As such, it has famous antecedents: The erroneous assignment by d’Alembert (1779, entry “Croix ou pile”) of a probability of 1/3 for heads-heads when twice throwing a coin, or the equally erroneous assertion by Leibniz (in a letter to L. Bourgnet from March 2, 1714, reprinted in Leibniz, 1887, pages 569–570) that, when throwing two dice, a sum of 11 is as likely as a sum of 12. A sum of 11, so he argued, can be obtained by adding 5 and 6, and a sum of 12 by adding 6 and 6. It did not occur to him that there are two equally probable ways of adding 5 and 6, but only one way to obtain 6 and 6. Given illustrious precedents such as these, it comes as no surprise that wrongly inferred conditional and unconditional probabilities are lurking everywhere. Prominent textbook examples are the paradox of the second ace or the problem of the second boy (see, e.g., Bar-Hillel and Falk, 1982), not to mention the famous car-and-goat puzzle, also called the Monty Hall problem, which has engendered an enormous literature of its own. These puzzles are mainly of interest as mathematical curiosities and they are rarely used for statistical manipulation. We shall not dwell on them in detail here, but they serve to point out what many consumers of statistical information are ill-prepared to master. 3. CONFUSING CONDITIONAL AND CONDITIONING EVENTS

German medical doctors with an average of 14 years of professional experience were asked to imagine using a certain test to screen for colorectal cancer. The prevalence of this type of cancer was 0.3%, the sensitivity of the test (the conditional probability of detecting cancer when there is one) was 50% and the false positive rate was 3% (Gigerenzer, 2002; Gigerenzer and Edwards, 2003). The doctors were asked: “What is the probability that someone who tests positive actually has colorectal cancer?” The correct answer is about 5%. However, the doctors’ answers ranged from 1% to 99%, with about half of them estimating this probability as 50% (the sensitivity) or 47% (the sensitivity minus the false positive rate). The most common fault was to confuse the conditional probability of cancer, given that the test is positive, with the conditional probability that the test is positive, given that the individual has cancer. An analogous error also occurs when people are asked to interpret the result of a statistical test of significance,

and sometimes there are disastrous consequences. In the fall of 1973 in the German city of Wuppertal, a local workman was accused of having murdered another local workman’s wife. A forensic expert (correctly) computed a probability of only 0.027 that blood found on the defendant’s clothes and on the scene of the crime by chance matched the victim’s and defendant’s blood groups, respectively. From this figure the expert then derived a probability of 97.3% for the defendant’s guilt, and later, this probability came close to 100% by adding evidence from textile fibers. Only a perfect alibi saved the workman from an otherwise certain conviction (see the account in Ziegler, 1974). Episodes such as this have undoubtedly happened in many courtrooms all over the world (Gigerenzer, 2002). On a formal level, a probability of 2.7% for the observed data, given innocence, was confused with a probability of 2.7% for innocence, given the observed data. Even in a Bayesian setting with certain a priori probabilities for guilt and innocence, one finds that a probability of 2.7% for the observed data given innocence does not necessarily translate into a probability of 97.3% that the defendant is guilty. And from the frequentist perspective, which is more common in forensic science, it is nonsense to assign a probability to either the null or to the alternative hypothesis. Still, students and, remarkably, teachers of statistics, often misread the meaning of a statistical test of significance. Haller and Krauss (2002) asked 30 statistics instructors, 44 statistics students and 39 scientific psychologists from six psychology departments in Germany about the meaning of a significant twosample t-test (significance level = 1%). The test was supposed to detect a possible treatment effect based on a control group and a treatment group. The subjects were asked to comment upon the following six statements (all of which are false). They were told in advance that several or perhaps none of the statements were correct. (1) You have absolutely disproved the null hypothesis that there is no difference between the population means. ! true/false ! (2) You have found the probability of the null hypothesis being true. ! true/false ! (3) You have absolutely proved your experimental hypothesis that there is a difference between the population means. ! true/false ! (4) You can deduce the probability of the experimental hypothesis being true. ! true/false !

HOW TO CONFUSE WITH STATISTICS OR: THE USE AND MISUSE OF CONDITIONAL PROBABILITIES

(5) You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. ! true/false ! (6) You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. ! true/false ! All of the statistics students marked at least one of the above faulty statements as correct. And, quite disconcertingly, 90% of the academic psychologists and 80% of the methodology instructors did as well! In particular, one third of both the instructors and the academic psychologists and 59% of the statistics students marked item 4 as correct; that is, they believe that, given a rejection of the null at level 1%, they can deduce a probability of 99% that the alternative is correct. Ironically, one finds that this misconception is perpetuated in many textbooks. Examples from the American market include Guilford (1942, and later editions), which was probably the most widely read textbook in the 1940s and 1950s, Miller and Buckhout (1973, statistical appendix by Brown, page 523) or Nunnally (1975, pages 194–196). Additional examples are collected in Gigerenzer (2000, Chapter 13) and Nickerson (2000). On the German market, there is Wyss (1991, page 547) or Schuchard-Fischer et al. (1982), who on page 83 of their best-selling textbook explicitly advise their readers that a rejection of the null at 5% implies a probability of 95% that the alternative is correct. In one sense, this error can be seen as a probabilistic variant of a classic rule of logic (modus tollens): (1) “All human beings will eventually die” and (2) “Socrates is a human being” implies (3) “Socrates will die.” Now, what if (1) is not necessarily true, only highly probable [in the sense that the statement “If A(= human being) then B(= eventual death)” holds not always, only most of the time]? Does this imply that its logical equivalent “If not B then not A” has the same large probability attached to it? This question has led to a lively exchange of letters in Nature (see Beck-Bornholdt and Dubben, 1996, 1997; or Edwards, 1996), which culminated in the scientific proof that the Pope is an alien: (1) A randomly selected human being is most probably not the Pope (the probability of selecting the Pope is 1 : 6 billion = 0.000 000 000 17). (2) John Paul II is the Pope. (3) Therefore, John Paul II is most probably not a human being. Setting aside the fact that John Paul II has not been randomly selected from among all human beings, one finds that this argument again reflects the

225

confusion that results from interchanging conditioning and conditional events. It is based on taking as equal the conditional probabilities P (not Pope | human) and P (not human | Pope). Since P (A¯ | B) = P (B¯ | A)

⇐⇒

P (A | B) = P (B | A),

this is equivalent to taking as equal, in a universe comprised of humans and aliens, the conditional probabilities P (Pope | human) and P (human | Pope), which is nonsense. Or in terms of rules of logic: If the statement “If human then not Pope” holds most of the time, one cannot infer, but sometimes does, that its logical equivalent “If Pope then not human” likewise holds most of the time. Strange as it may seem, this form of reasoning has even made its way into the pages of respectable journals. For instance, it was used by Leslie (1992) to prove that doom is near (the “doomsday argument”; see also Schrage, 1993). In this case the argument went: (1) If mankind is going to survive for a long time, then all human beings born so far, including myself, are only a small proportion of all human beings that will ever be born (i.e., the probability that I observe myself is negligible). (2) I observe myself. (3) Therefore, the end is near. This argument is likewise based on interchanging conditioning and conditional events. While it is perfectly true that the conditional probability that a randomly selected human being (from among all human beings that have ever been and will ever be born) happens to be me, given that doom is near, is much larger that the conditional probability of the same event, given that doom is far away, one cannot infer from this inequality that the conditional probability that doom is near, given my existence, is likewise much larger than the conditional probability that doom is far away, given my existence. More formally: while the inequality in the following expression is correct, the equality signs are not: P (doom is near | me) = P (me | doom is near)  P (me | doom far away) = P (doom far away | me). 4. CONDITIONAL PROBABILITIES AND FAVORABLE EVENTS

The tendency to confuse conditioning and conditional events can also lead to other incorrect conclusions. The most popular one is to infer from a conditional probability P (A | B) that is seen as “large”

226

W. KRÄMER AND G. GIGERENZER

that the conditional event A is “favorable” to the conditioning event B. This term was suggested by Chung (1942) and means that P (B | A) > P (B). This confusion occurs in various contexts and is possibly the most frequent logical error that is found in the interpretation of statistical information. Here are some examples from the German press (with the headlines translated into English): • “Beware of German tourists” (according to Der Spiegel magazine, most foreign skiers involved in accidents in a Swiss skiing resort came from Germany). • “Boys more at risk on bicycles” (the newspaper Hannoversche Allgemeine Zeitung reported that among children involved in bicycle accidents the majority were boys). • “Soccer most dangerous sport” (the weekly magazine Stern commenting on a survey of accidents in sports). • “Private homes as danger spots” (the newspaper Die Welt musing about the fact that a third of all fatal accidents in Germany occur in private homes). • “German shepherd most dangerous dog around” (the newspaper Ruhr-Nachrichten on a statistic according to which German shepherds account for a record 31% of all reported attacks by dogs). • “Women more disoriented drivers” (the newspaper Bild commenting on the fact that among cars that were found entering a one-way-street in the wrong direction, most were driven by women). These examples can easily be extended. Most of them result from unintentionally misreading the statistical evidence. When there are cherished stereotypes to conserve, such as the German tourist bullying his fellow vacationers, or women somehow lost in space, perhaps some intentional neglect of logic may have played a role as well. Also, not all of the above statements are necessarily false. It might, for instance, well be true that when 1000 men and 1000 women drivers are given a chance to enter a one-way street the wrong way, more women than men will actually do so, but the survey by Bild simply counted wrongly entering cars and this is certainly no proof of their claim. For example, what if there were no men on the street at that time of the day? And in the case of the Swiss skiing resort, where almost all foreign tourists came from Germany, the attribution of abnormal dangerous behavior to this class of visitors is clearly wrong.

In terms of favorable events, Der Spiegel, on observing that, among foreigners, P (German tourist | skiing accident) was “large,” concluded that the reverse conditional probability was also large, in particular, that being a German tourist increases the chances of being involved in a skiing accident: P (skiing accident | German tourist) > P (skiing accident). Similarly, Hannoversche Allgemeine Zeitung concluded from P (boy | bicycle accident) = large that P (bicycle accident | boy) > P (bicycle accident) and so on. In all these examples, the point of departure always was a large value of P (A | B), which then led to the—possibly unwarranted—conclusion that P (B | A) > P (B). From the symmetry P (B | A) > P (B)

⇐⇒

P (A | B) > P (A)

it is however clear that one cannot infer anything on A’s favorableness for B from P (A | B) alone, and that one needs information on P (A) as well. The British Home Office nevertheless once did so in its call for more attention to domestic violence (Cowdry, 1990). Among 1221 female murder victims between 1984 and 1988, 44% were killed by their husbands or lovers, 18% by other relatives, and another 18% by friends or acquaintances. Only 14% were killed by strangers. Does this prove that P (murder | encounter with husband) > P (murder | encounter with a stranger), that is, that marriage is favorable to murder? Evidently not. While it is perfectly fine to investigate the causes and mechanics of domestic violence, there is no evidence that the private home is a particularly dangerous environment (even though, as The Times mourns, “assaults . . . often happen when families are together”). 5. FAVORABLENESS AND SIMPSON’S PARADOX

Another avenue through which the attribute of favorableness can be incorrectly attached to certain events is Simpson’s paradox (Blyth, 1973), which in our context asserts that it is possible that B is favorable to A when C holds, B is favorable to A when C does not hold, yet overall, B is unfavorable to A. Formally, one has P (A | B ∩ C) > P (A) and ¯ > P (A), P (A | B ∩ C)

HOW TO CONFUSE WITH STATISTICS OR: THE USE AND MISUSE OF CONDITIONAL PROBABILITIES

yet P (A | B) < P (A). This paradox also extends to situations where C1 ∪ · · · ∪ Cn = , Ci ∩ Cj = φ (i = j ). For real-life examples see Wagner (1982) or Krämer (2002, 2004), e.g. One instance where Simpson’s paradox (to be precise: the refusal to take account of Simpson’s paradox) has been deliberately used to mislead the public is the debate on the causes of cancer in Germany. The official and fiercely defended credo of the Green movement has it that the increase in cancer deaths from well below 20% of all deaths after the war to almost 25% nowadays is mostly due to industrial pollution and chemical waste of all sorts. However, as Table 1 shows, among women, the probability of dying from cancer has actually decreased for young and old alike! Similar results hold for men. These data refer only to mortality from cancer, not to the incidence of cancer, and therefore have to be interpreted with care. Still, the willful disregard of the most important explanatory variable “age” has turned the overall increase in cancer deaths into a potent propaganda tool. If B is favorable to A, then by a simple calculation B ¯ However, B can still be favoris unfavorable to A. ¯ This is also known as Kaigh’s able to subsets of A. TABLE 1 Probability of dying from cancer. Number of women (among 100,000 in the respective age groups) who died from cancer in Germany Age

1970

2001

0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–39 40–44 45–49 50–54 55–59 60–64 65–69 70–74 75–79 80–84

7 6 4 6 8 12 21 45 84 144 214 305 415 601 850 1183 1644

3 2 2 2 4 6 13 25 51 98 161 240 321 468 656 924 1587

Source: Statistisches Jahrbuch für die Bundesrepublik Deutschland.

227

(1989) paradox. In words: If knowing that B has occurred makes some other event A more probable, it makes the complementary event A¯ less probable. However, we cannot infer that subsets of A¯ have likewise become less probable. Schucany (1989, Table 1) gives a hypothetical example where Kaigh’s paradox is used to misrepresent the facts. Suppose a firm hires 158 out of 1000 applicants (among which 200 are black, 200 are Hispanic and 600 white). Of these, 38 non-whites and 120 whites are hired, amounting to 6.3% and 20% of the respective applicants. Being white is therefore favorable to being hired. But this does not imply that being in some non-white population is necessarily unfavorable to being hired. Assume for instance, that 36 of 38 nonwhites who are hired are Hispanics. This implies that being Hispanic is likewise favorable for being hired. Although the selection rate for Hispanics is less than that for whites, we still have P (being hired | Hispanic) = 36/200 = 18% > P (being hired) = 158/1000 = 15.8%. Schucany (1989, page 94) notes: “Regardless of whether we call it a paradox, that such situations will be misconstrued by the statistically naive is a fairly safe bet.” A final and formally trivial example for faulty inferences from conditional probabilities concerns the inequality P (A | B ∩ D) > P (A | C ∩ D). Plainly, this does not imply P (A | B) > P (A | C), but this conclusion is still sometimes drawn by some authors. A German newspaper (quoted in Swoboda, 1971, page 215) once claimed that people get happier as they grow older. The paper’s “proof” runs as follows: Among people who die at age 20–25, about 25% commit suicide. This percentage then decreases with advancing age; thus, for instance, among people who die aged over 70, only 2% commit suicide. Formally, one can put these observations as P (suicide | age 20–25 and death) > P (suicide | age > 70 and death), and while this is true, it certainly does not imply P (suicide | age 20–25) > P (suicide | age > 70).

228

W. KRÄMER AND G. GIGERENZER

In fact, a glance at any statistical almanac shows that quite the opposite is true. Here is a more recent example from the U.S., where likewise P (A | B) is confused with P (A | B ∩ D). This time the confusion is spread by Alan Dershowitz, a renowned Harvard Law professor who advised the O. J. Simpson defense team. The prosecution had argued that Simpson’s history of spousal abuse reflected a motive to kill, advancing the premise that “a slap is a prelude to homicide” (see Gigerenzer, 2002, pages 142–145). Dershowitz, however, called this argument “a show of weakness” and said: “We knew that we could prove, if we had to, that an infinitesimal percentage—certainly fewer than 1 of 2,500—of men who slap or beat their domestic partners go on to murder them.” Thus, he argued that the probability of the event K that a husband killed his wife if he battered her was small, P (K | battered) = 1/2,500. The relevant probability, however, is not this one, as Dershowitz would have us believe. Instead, the relevant probability is that of a man murdering his partner given that he battered her and that she was murdered, P (K | battered and murdered). This probability is about 8/9 (Good, 1996). It must of course not be confused with the probability that O. J. Simpson is guilty; a jury must take into account much more evidence than battering. But it shows that battering is a fairly good predictor of guilt for murder, contrary to Dershowitz’s assertions. 6. HOW TO MAKE THE SOURCES OF CONFUSION DISAPPEAR

Fallacies can sometimes be attributed to the unwarranted application of what we have elsewhere called “fast and frugal heuristics” (Gigerenzer, 2004). Heuristics are simple rules that exploit evolved mental capacities, as well as structures of environments. When applied in an environment for which they were designed, heuristics often work well, often outperforming more complicated optimizing models. Nevertheless, when applied in an unsuitable environment, they can easily mislead. When a heuristic misleads, it is not always the heuristic that is to blame. More often than not, it is the structure of the environment that does not fit (Hoffrage et al., 2000). The examples we have seen here amount to what has elsewhere been called a shift of base or the

base-rate fallacy (Borgida and Brekke, 1981). In fact, this environmental change underlies most of the misleading arguments with conditional probabilities. Consider for instance the question: “What is the probability that a woman with a positive mammography result actually has breast cancer?” There are two ways to represent the relevant statistical information: in terms of conditional probabilities, or in terms of natural frequencies. Conditional probabilities: The probability that a woman has breast cancer is 0.8%. If she has breast cancer the probability that a mammogram will show a positive result is 90%. If a woman does not have breast cancer the probability of a positive result is 7%. Take, for example, a woman who has a positive result. What is the probability that she actually has breast cancer? Natural frequencies: Our data tells us that eight out of every 1000 women have breast cancer. Of these eight women with breast cancer seven will have a positive result on mammography. Of the 992 women who do not have breast cancer some 70 will still have a positive mammogram. Take, for example, a sample of women who have positive mammograms. How many of these women actually have breast cancer? Apart from rounding, the information is the same in both of these summaries, but with natural frequencies the message comes through much more clearly. We see quickly that only seven of the 77 women who test positive actually have breast cancer, which is one in 11 (9%). Natural frequencies correspond to the way humans have encountered statistical information during most of their history. They are called “natural” because, unlike conditional probabilities or relative frequencies, on each occurrence the numerical quantities in our summary refer to the same class of observations. For instance, the natural frequencies “seven women” (with a positive mammogram and cancer) and “70 women” (with a positive mammogram and no breast cancer) both refer to the same class of 1000 women. In contrast, the conditional probability 90% (the sensitivity) refers to the class of eight women with breast cancer, but the conditional probability 7% (the specificity) refers to a different class of 992 women without breast cancer. This switch of reference class easily confuses the minds of both doctors and patients. To judge the extent of the confusion consider Figure 1, which shows the responses of 48 experienced doctors who were given the information cited above, except that in this case the statistics were a base rate

HOW TO CONFUSE WITH STATISTICS OR: THE USE AND MISUSE OF CONDITIONAL PROBABILITIES

F IG . 1.

229

Doctors’ estimates of the probability of breast cancer in women with a positive result on mammography (Gigerenzer, 2002).

of cancer of 1%, a sensitivity of 80%, and a false positive rate of 10%. Half the doctors received the information in conditional probabilities and half received the data as expressed by natural frequencies. When asked to estimate the probability that a woman with a positive screening mammogram actually has breast cancer, doctors who received conditional probabilities gave answers that ranged from 1% to 90%; very few of them gave the correct answer of about 8%. In contrast, most of the doctors who were given natural frequencies gave the correct answer or were close to it. Simply converting the information into natural frequencies was enough to turn much of the doctor’s innumeracy into insight. Presenting information in natural frequencies is therefore a simple and effective mind tool to reduce the confusion resulting from conditional probabilities. ACKNOWLEDGMENTS

Research supported by Deutsche Forschungsgemeinschaft. We are grateful to Robert Ineichen for helping us to track down a reference to Leibniz, and to George Casella, Gregor Schöner, Michael Steele, Andy Tremayne and Rona Unrau for helpful criticism and comments. REFERENCES BAR -H ILLEL, M. and FALK, R. (1982). Some teasers concerning conditional probabilities. Cognition 11 109–122.

B ECK -B ORNHOLDT, H.-P. and D UBBEN, H.-H. (1996). Is the Pope an alien? Nature 381 730. B ECK -B ORNHOLDT, H.-P. and D UBBEN, H.-H. (1997). Der Hund, der Eier legt—Erkennen von Fehlinformationen durch Querdenken. Rowohlt, Hamburg. B LYTH, C. R. (1973). Simpson’s paradox and mutually favorable events. J. Amer. Statist. Assoc. 68 746. B ORGIDA, E. and B REKKE, N. (1981). The base rate fallacy in attribution and prediction. In New Directions in Attribution Research (J. H. Harvey, W. J. Ickes and R. F. Kidd, eds.) 3 63–95. Erlbaum, Hillsdale, NJ. C HUNG, K.-L. (1942). On mutually favorable events. Ann. Math. Statist. 13 338–349. C OWDRY, Q. (1990). Husbands or lovers kill half of women murder victims. The Times, April 14, p. 11. D ’A LEMBERT, J. ET AL . (1779). Encyclopédie, ou dictionnaire raisonné des sciences, des arts et des métiers 10. J. L. Pellet, Geneva. E DWARDS, A. W. F. (1996). Is the Pope an alien? Nature 382 202. F ELLER, W. (1968). An Introduction to Probability Theory and Its Applications 1, 3rd ed. Wiley, New York. G IGERENZER, G. (2000). Adaptive Thinking—Rationality in the Real World. Oxford Univ. Press, New York. G IGERENZER, G. (2002). Calculated Risks: How to Know When Numbers Deceive You. Simon and Schuster, New York. [British edition (2002). Reckoning with Risk. Penguin, London.] G IGERENZER, G. (2004). Fast and frugal heuristics: The tools of bounded rationality. In Handbook of Judgement and Decision Making (D. Koehler and N. Harvey, eds.) 62–88. Blackwell, Oxford. G IGERENZER, G. and E DWARDS, A. (2003). Simple tools for understanding risks: From innumeracy to insight. British Medical J. 327 741–744. G OOD, I. J. (1996). When batterer becomes murderer. Nature 381 481.

230

W. KRÄMER AND G. GIGERENZER

G UILFORD, J. P. (1942). Fundamental Statistics in Psychology and Education. McGraw–Hill, New York. H ALLER, H. and K RAUS, S. (2002). Misinterpretations of significance: A problem students share with their teachers? Methods of Psychological Research 7 1–20. H OFFRAGE , U., L INDSAY, S., H ERTWIG, R. and G IGERENZER , G. (2000). Communicating statistical information. Science 290 2261–2262. K AIGH, W. D. (1989). A category representation paradox. Amer. Statist. 43 92–97. K RÄMER, W. (2002). Denkste—Trugschlüsse aus der Welt des Zufalls und der Zahlen, 3rd paperback ed. Piper, München. K RÄMER, W. (2004). So lügt man mit Statistik, 5th paperback ed. Piper, München. L ESLIE, J. (1992). The doomsday argument. Math. Intelligencer 14 48–51. VON L EIBNIZ, G. W. (1887). Die philosophischen Schriften. (C. I. Gerhardt, ed.) 3. Weidmann, Berlin. M ILLER, G. A. and B UCKHOUT, R. (1973). Psychology: The Science of Mental Life, 2nd ed. Harper and Row, New York.

N ICKERSON, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods 5 241–301. N UNNALLY, J. C. (1975). Introduction to Statistics for Psychology and Education. McGraw–Hill, New York. S CHRAGE, G. (1993). Letter to the editor. Math. Intelligencer 15 3–4. S CHUCANY, W. R. (1989). Comment on “A category representation paradox,” by W. D. Kaigh. Amer. Statist. 43 94–95. S CHUCHARD -F ISCHER , C., BACKHAUS , K., H UMMEL , H., L OHRBERG , W., P LINKE, W. and S CHREINER, W. (1982). Multivariate Analysemethoden—Eine anwendungsorientierte Einführung, 2nd ed. Springer, Berlin. S WOBODA, H. (1971). Knaurs Buch der modernen Statistik. Droemer Knaur, München. WAGNER, C. H. (1982). Simpson’s paradox in real life. Amer. Statist. 36 46–48. W YSS, W. (1991). Marktforschung von A – Z. Demoscope, Luzern. Z IEGLER, H. (1974). Das Alibi des Schornsteinfegers— Unwahrscheinliche Wahrscheinlichkeitsrechnung in einem Mordprozeß. Rheinischer Merkur 39.

Statistical Science 2005, Vol. 20, No. 3, 231–238 DOI 10.1214/088342305000000269 © Institute of Mathematical Statistics, 2005

How to Lie with Bad Data Richard D. De Veaux and David J. Hand

Abstract. As Huff’s landmark book made clear, lying with statistics can be accomplished in many ways. Distorting graphics, manipulating data or using biased samples are just a few of the tried and true methods. Failing to use the correct statistical procedure or failing to check the conditions for when the selected method is appropriate can distort results as well, whether the motives of the analyst are honorable or not. Even when the statistical procedure and motives are correct, bad data can produce results that have no validity at all. This article provides some examples of how bad data can arise, what kinds of bad data exist, how to detect and measure bad data, and how to improve the quality of data that have already been collected. Key words and phrases: Data quality, data profiling, data rectification, data consistency, accuracy, distortion, missing values, record linkage, data warehousing, data mining. consumer products supply chain concluded that “bad data costs the electro industry $1.2B annually.” While the accuracy of these claims is hard to verify, it is clear that data quality is a concern to business worldwide. An informal survey of topics in management seminars shows the prevalence of data quality as an important topic and concern for top-level executives and managers. Anyone who has analyzed real data knows that the majority of their time on a data analysis project will be spent “cleaning” the data before doing any analysis. Common wisdom puts the extent of this at 60–95% of the total project effort, and some studies (Klein, 1998) suggest that “between one and ten percent of data items in critical organizational databases are estimated to be inaccurate” (Laudon, 1986; Madnick and Wang, 1992; Morey, 1982; Redman, 1992). Somewhat paradoxically, most statistical training assumes that the data arrive “precleaned.” Students, whether in Ph.D. programs or in an undergraduate introductory course, are not routinely taught how to check data for accuracy or even to worry about it. Exacerbating the problem further are claims by software vendors that their techniques can produce valid results no matter what the quality of the incoming data. How pervasive are bad data? Not only is industry concerned with bad data, but examples are ubiquitous in the scientific literature of the past 50 years as well. In the 1978 Fisher Lecture “Statistics in Society: Problems Unsolved and Unformulated (Kruskal,

1. INTRODUCTION

Bad data can ruin any analysis. “Garbage in, garbage out” is as true today in this era of terabytes of data and distributed computing as it was in 1954 when How to Lie with Statistics was published (Huff, 1954). Distortions in the data are likely to produce distortions in the conclusions, to the extent that these may be wildly inaccurate, completely invalid or useless. The cost of bad data can be enormous. Estimates of how much bad data cost U.S. industry permeate industry publications and the Internet. PricewaterhouseCoopers (2004), in a recent survey of “Top 500” corporations, found that most corporations are experiencing major impacts to their business as a result of poor data quality. In their survey, 75% of respondents reported significant problems as the result of defective data. David Loshin, author of Enterprise Knowledge Management: The Data Quality Approach, states that “scrap and rework attributable to poor data quality accounts for 20–25% of an organization’s budget” (Loshin, 2001). An A. T. Kearney study of the retail Richard D. De Veaux is Professor, Department of Mathematics and Statistics, Bronfman Science Center, Williams College, Williamstown, Massachusetts 01267, USA (e-mail: [email protected]). David J. Hand is Professor, Department of Mathematics, Imperial College, London SW7 2AZ, UK (e-mail: d.j.hand@ imperial.ac.uk). 231

232

R. D. DE VEAUX AND D. J. HAND

1981), Kruskal devoted much of his time to “inconsistent or clearly wrong data, especially in large data sets.” As just one example, he cited a 1960 census study that showed 62 women, aged 15 to 19 with 12 or more children. Coale and Stephan (1962) pointed out similar anomalies when they found a large number of 14-year-old widows. In a classic study by Wolins (1962), a researcher attempted to obtain raw data from 37 authors of articles appearing in American Psychological Association journals. Of the seven data sets that were actually obtained, three contained gross data errors. A 1986 study by the U.S. Census estimated that between 3 and 5% of all census enumerators engaged in some form of fabrication of questionnaire responses without actually visiting the residence. This practice was widespread enough to warrant its own term: curbstoning, which is the “enumerator jargon for sitting on the curbstone filling out the forms with made-up information” (Wainer, 2004). While curbstoning does not imply bad data per se, at the very least, such practices imply that the data set you are analyzing does not describe the underlying mechanism you think you are describing. What exactly are bad data? The quality of data is relative both to the context and to the question one is trying to answer. If data are wrong, then they are obviously bad, but context can make the distinction more subtle. In a regression analysis, errors in the predictor variables may bias the estimates of the regression coefficients and this will matter if the aim hinges on interpreting these values, but it will not matter if the aim is predicting response values for new cases drawn from the same distribution. Likewise, whether data are “good” also depends on the aims: precise, accurate measurements are useless if one is measuring the wrong thing. Increasingly in the modern world, especially in data mining, we are confronted with secondary data analysis: the analysis of data that have been collected for some other purpose (e.g., analyzing billing data for transaction patterns). The data may have been perfect for the original aim, but could have serious deficiencies for the new analysis. For this paper, we will take a rather narrow view of data quality. In particular, we are concerned with data accuracy, so that, for us, “poor quality data are defined as erroneous values assigned to attributes of some entity,” as in Pierce (1997). A broader perspective might also take account of relevance, timeliness, existence, coherence, completeness, accessibility, security

and other data attributes. For many problems, for example, data gradually become less and less relevant— a phenomenon sometimes termed data decay or population drift (Hand, 2004a). Thus the characteristics collected on mortgage applicants 25 years ago would probably not be of much use for developing a predictive risk model for new applicants, no matter how accurately they were measured at the time. In some environments, the time scale that renders a model useless can become frighteningly short. A model of customer behavior on a web site may quickly become out of date. Sometimes different aspects of this broader interpretation of data quality work in opposition. Timeliness and accuracy provide an obvious example (and, indeed, one which is often seen when economic time series are revised as more accurate information becomes available). From the perspective of the statistical analyst, there are three phases in data evolution: collection, preliminary analysis and modeling. Of course, the easiest way to deal with bad data is to prevent poor data from being collected in the first place. Much of sample survey methodology and experimental design is devoted to this subject, and many famous stories of analysis gone wrong are based on faulty survey designs or experiments. The Literary Digest poll proclaiming Landon’s win over Roosevelt in 1936 that starred in Chapter 1 of Huff (1954) is just one of the more famous examples. At the other end of the process, we have resistant and robust statistical procedures explicitly designed to perform adequately even when a percentage of the data do not conform or are inaccurate, or when the assumptions of the underlying model are violated. In this article we will concentrate on the “middle” phase of bad data evolution—that is, on its discovery and correction. Of course, no analysis proceeds linearly through the process of initial collection to final report. The discoveries in one phase can impact the entire analysis. Our purpose will be to discuss how to recognize and discover these bad data using a variety of examples, and to discuss their impact on subsequent statistical analysis. In the next section we discuss the causes of bad data. Section 3 discusses the ways in which data can be bad. In Section 4 we turn to the problem of detecting bad data and in Section 5 we provide some guidelines for improving data quality. We summarize and present our conclusions in Section 6. 2. WHAT ARE THE CAUSES OF BAD DATA?

There is an infinite variety to the ways in which data can go bad, and the specifics depend on the underlying

HOW TO LIE WITH BAD DATA

process that generate the data. Data may be distorted from the outset during the initial collection phase or they may be distorted when the data are transcribed, transferred, merged or copied. Finally, they may deteriorate, change definition or otherwise go through transformations that render them less representative of the original underlying process they were designed to measure. The breakdown in the collection phase can occur whether the data are collected by instrument or directly recorded by human beings. Examples of breakdowns at the instrument level include instrument drift, initial miscalibration, or a large random or otherwise unpredictable variation in measurement. As an example of instrument level data collection, consider the measurement of the concentration of a particular chemical compound by gas chromatography, as used in routine drug testing. When reading the results of such a test, it is easy to think that a machine measures the amount of the compound in an automatic and straightforward way, and thus that the resulting data are measuring some quantity directly. It turns out to be a bit more complicated. At the outset, a sample of the material of interest is injected into a stream of carrier gas where it travels down a silica column heated by an oven. The column then separates the mixture of compounds according to their relative attraction to a material called the adsorbent. This stream of different compounds travels “far enough” (via choices of column length and gas flow rates) so that by the time they pass by the detector, they are well separated (at least in theory). At this point, both the arrival time and the concentration of the compound are recorded by an electromechanical device (depending on the type of detector used). The drifts inherent in the oven temperature, gas flow, detector sensitivity and a myriad of other environmental conditions can affect the recorded numbers. To determine actual amounts of material present, a known quantity must be tested at about the same time and the machine must be calibrated. Thus the number reported as a simple percentage of compound present has not only been subjected to many potential sources of error in its raw form, but is actually the output of a calibration model. Examples of data distortion at the human level include misreading of a scale, incorrect copying of values from an instrument, transposition of digits and misplaced decimal points. Of course, such mistakes are not always easy to detect. Even if every data value is checked for plausibility, it often takes expert knowledge to know if a data value is reasonable or absurd. Consider the report in The Times of London

233

that some surviving examples of the greater mouseeared bat, previously thought to be extinct, had been discovered hibernating in West Sussex. It went on to assert that “they can weigh up to 30 kg” (see Hand, 2004b, Chapter 4). A considerable amount of entertaining correspondence resulted from the fact that they had misstated the weight by three decimal places. Sometimes data are distorted from the source itself, either knowingly or not. Examples occur in survey work and tax returns, just to name two. It is well known to researchers of sexual behavior that men tend to report more lifetime sexual partners than women, a situation that is highly unlikely sociologically (National Statistics website: www.statistics.gov.uk). Some data are deliberately distorted to prevent disclosure of confidential information collected by governments in, for example, censuses (e.g., Willenborg and de Waal, 2001) and health care data. Even if the data are initially recorded accurately, data can be compromised by data integration, data warehousing and record linkage. Often a wide range of sources of different types are involved (e.g., in the pharmaceutical sector, data from clinical trials, animal trials, manufacturers, marketing, insurance claims and postmarketing surveillance might be merged). At a more mundane level, records that describe different individuals might be inappropriately merged because they are described by the same key. When going through his medical records for insurance purposes, one of the authors discovered that he was recorded as having had his tonsils removed as a child. A subsequent search revealed the fact that the records of someone else with the same name (but a different address) had been mixed in with his. More generally, what is good quality for (the limited demands made of ) an operational data base may not be good quality for (potentially unlimited demands made of ) a data warehouse. In a data warehouse, the definitions, sources and other information for the variables are contained in a dictionary, often referred to as metadata. In a large corporation it is often the IT (information technology) group that has responsibility for maintaining both the data warehouse and metadata. Merging sources and checking for consistent definitions form a large part of their duties. A recent example in bioinformatics shows that data problems are not limited to business and economics. In a recent issue of The Lancet, Petricoin et al. (2002) reported an ability to distinguish between serum samples from healthy women, those with ovarian cancers and women with a benign ovarian disease. It was so

234

R. D. DE VEAUX AND D. J. HAND

exciting that it prompted the “U.S. Congress to pass a resolution urging continued funding to drive a new diagnostic test toward the clinic” (Check, 2004). The researchers trained an algorithm on 50 cancer spectra and 50 normals, and then predicted 116 new spectra. The results were impressive with the algorithm correctly identifying all 50 of the cancers, 47 out of 50 normals, and classifying the 16 benign disease spectra as “other.” Statisticians Baggerly, Morris and Coombes (2004) attempted to reproduce the Petricoin et al. results, but were unable to do so. Finally, they concluded that the three types of spectra had been preprocessed differently, so that the algorithm correctly identified differences in the data, much of which that had nothing to do with the underlying biology of cancer. A more subtle source of data distortion is a change in the measurement or collection procedure. When the cause of the change is explicit and recognized, this can be adjusted for, at least to some extent. Common examples include a change in the structure of the Dow Jones Industrial Average or the recent U.K. change from the Retail Price Index to the European Union standard Harmonized Index of Consumer Prices. In other cases, one might not be aware of the change. Some of the changes can be subtle. In looking at historical records to assess long-term temperature changes, Jones and Wigley (1990) noted “changing landscapes affect temperature readings in ways that may produce spurious temperature trends.” In particular, the location of the weather station assigned to a city may have changed. During the 19th century, most cities and towns were too small to impact temperature readings. As urbanization increased, urban heat islands directly affected temperature readings, creating bias in the regional trends. While global warming may be a contributor, the dominant factor is the placement of the weather station, which moved several times. As it became more and more surrounded by the city, the temperature increased, mainly because the environment itself had changed. A problem related to changes in the collection procedure is not knowing the true source of the data. In scientific analysis, data are often preprocessed by technicians and scientists before being analyzed. The statistician may be unaware (or uninterested) in the details of the processing. To create accurate models, however, it can be important to know the source and therefore the accuracy of the measurements. Consider a study of the effect of ocean bottom topography on sea ice formation in the southern oceans (De Veaux, Gordon, Comiso and Bacherer, 1993). After learning that wind can have a strong effect on sea ice formation,

the statistician, wanting to incorporate this predictor into a model, asked one of the physicists whether any wind data existed. It was difficult to imagine very many Antarctic stations with anemometers and so he was very surprised when the physicist replied, “Sure, there’s plenty of it.” Excitedly he asked what spatial resolution he could provide. When the physicist countered with “what resolution do you want?” the statistician became suspicious. He probed further and asked if they really had anemometers set up on a 5 km grid on the sea ice? He said, “Of course not. The wind data come from a global weather model—I can generate them at any resolution you want!” It turned out that all the other satellite data had gone through some sort of preprocessing before it was given to the statistician. Some were processed actual direct measurements, some were processed through models and some, like the wind, were produced solely from models. Of course, this (as with curbstoning) does not necessarily imply that the resulting data are bad, but it should at least serve to warn the analyst that the data may not be what they were thought to be. Each of these different mechanisms for data distortions has its own set of detection and correction challenges. Ensuring good data collection through survey and/or experimental design is certainly an important first step. A bad design that results in data that are not representative of the phenomenon being studied can render even the best analysis worthless. At the next step, detecting errors can be attempted in a variety of ways, a topic to which we will return in Section 4. 3. IN HOW MANY WAYS?

Data can be bad in an infinite variety of ways, and some authors have attempted to construct taxonomies of data distortion (e.g., Kim et al., 2003). An important simple categorization is into missing data and distorted values. 3.1 Missing Data

Data can be missing at two levels: entire records might be absent, or one or more individual fields may be missing. If entire records are missing, any analysis may well be describing or making inferences about a population different from that intended. The possibility that entire records may be missing is particularly problematic, since there will often be no way of knowing this. Individual fields can be missing for a huge variety of reasons, and the mechanism by which they are missing is likely to influence their distribution over the

HOW TO LIE WITH BAD DATA

data, but at least when individual fields are missing one can see that this is the case. If the missingness of a particular value is unrelated either to the response or predictor variables (missing completely at random—Little and Rubin, 1987, give technical definitions), then case deletion can be employed. However, even ignoring the potential bias problems, complete case deletion can severely reduce the effective sample size. In many data mining situations with a large number of variables, even though each field has only a relatively small proportion of missing values, all of the records may have some values missing, so that the case deletion strategy leaves one with no data at all. Complications arise when the pattern of missing data does depend on the values that would have been recorded. If, for example, there are no records for patients who experience severe pain, inferences to the entire pain distribution will be impossible (at least, without making some pretty strong distributional assumptions). Likewise, poor choice of a missing value code (e.g., 0 or 99 for age) or accidental inclusion of a missing value code in the analysis (e.g., 99,999 for age) has been known to lead to mistaken conclusions. Sometimes missingness arises because of the nature of the problem, and presents real theoretical and practical issues. For example, in personal banking, banks accept those loan applicants whom they expect to repay the loans. For such people, the bank eventually discovers the true outcome (repay, do not repay), but for those rejected for a loan, the true outcome is unknown: it is a missing value. This poses difficulties when the bank wants to construct new predictive models (Hand and Henley, 1993; Hand, 2001). If a loan application asks for household income, replacing a missing value by a mean or even by a model based imputation may lead to a highly optimistic assessment of risk. When the missingness in a predictor is related directly to the response, it may be useful for exploratory and prediction purposes to create indicator variables for each predictor, where the variable is a binary indictor of whether the variable is missing or not. For categorical predictor variables, missing values can be treated simply as a new category. In a study of dropout rates from a clinical trial for a depression drug, it was found that the single most important indicator of ultimately dropping out from the study was not the depression score on the second week’s test as indicated from complete case analysis, but simply the indicator of whether the patient showed up to take it (De Veaux, Donahue and Small, 2002).

235

3.2 Distorted Data

Although there are an unlimited number of possible causes of distortion, a first split can be made into those attributable to instrumentation and those attributed to human agency. Floor and ceiling effects are examples of the first kind (instruments here can be mechanical or electronic, but also questionnaires), although in this case it is sometimes possible to foresee that such things might occur and take account of this in the statistical modeling. Human distortions can arise from misreading instruments or misrecording values at any level. Brunskill (1990) gave an illustration from public records of birth weights, where ounces are commonly confused with pounds, the number 1 is confused with 11 and errors in decimal placements produce order of magnitude errors. In such cases, using ancillary information such as gestation times or newborn heights can help to spot gross errors. Some data collection procedures, in an attempt to avoid missing data, actually introduce distortions. A data set we analyzed had a striking number of doctors born on November 11, 1911. It turned out that most doctors (or their secretaries) wanted to avoid typing in age information, but because the program insisted on a value and the choice of 00/00/00 was invalid, the easiest way to bypass the system was simply to type 11/11/11. Such errors might not seem of much consequence, but they can be crucial. Confusion between English and metric units was responsible for the loss of the $125 million Martian Climate Orbiter space probe (The New York Times, October 1, 1999). Jet Propulsion Laboratory engineers mistook acceleration readings measured in English units of pound-seconds for the metric measure of force in newton-seconds. In 1985, in a precedence setting case, the Supreme Court ruled that Dun & Bradstreet had to pay $350,000 in libel damages to a small Vermont construction company. A part-time student worker had apparently entered the wrong data into the Dun & Bradstreet data base. As a result, Dun & Bradstreet issued a credit report that mistakenly identified the construction company as bankrupt (Percy, 1986). 4. HOW TO DETECT DATA ERRORS

While it may be obvious that a value is missing from a record, it is often less obvious that a value is in error. The presence of errors can (sometimes) be proven, but the absence of errors cannot. There is no guarantee that a data set that looks perfect will not contain mistakes. Some of these mistakes may be intrinsically

236

R. D. DE VEAUX AND D. J. HAND

undetectable: they might be values that are well within the range of the data and could easily have occurred. Moreover, since errors can occur in an unlimited number of ways, there is no end to the list of possible tests for detecting errors. On the other hand, strategic choice of tests can help to pinpoint the root causes that lead to errors and, hence, to the identification of changes in the data collection process that will lead to the greatest improvement in data quality. When the data collection can be repeated, the results of the duplicate measurements, recordings or transcriptions (e.g., the double entry system used in clinical trials) can be compared by automatic methods. In this “duplicate performance method,” a machine checks for any differences in the two data records. All discrepancies are noted and the only remaining errors are when both collectors made the same mistake. Strayhorn (1990) and West and Winkler (1991) provided statistical methods for estimating that proportion. In another quality control method, known errors are added to a data set whose integrity is then assessed by an external observer. The “known errors” method devised statistical methods for estimating how many errors remain based on the success of the observer in discovering the known errors (Strayhorn, 1990; West and Winkler, 1991). Taking this further, one can build models (similar to those developed for software reliability) that estimate how many errors are likely to remain in a data set based on extrapolation from the rate of discovery of errors. At some point one decides that the impact of remaining errors on the conclusions is likely to be sufficiently small that one can ignore them. Automatic methods of data collection use metadata information to check for consistency across multiple records or variables, integrity (e.g., correct data type), plausibility (within the possible range of the data) and coherence between related variables (e.g., number of sons plus number of daughters equals number of children). Sometimes redundant data can be collected with such checks in mind. However, one cannot rely on software to protect one from mistakes. Even when such automatic methods are in place, the analyst should spend some time looking for errors in the data prior to any modeling effort. Data profiling is the use of exploratory and data mining tools aimed at identifying errors, rather than at the substantive questions of interest. When the number of predictor variables is manageable, simple plots such as bar charts, histograms, scatterplots and time series plots can be invaluable. The human eye has evolved to detect anomalies, and this should be taken advantage

of by presenting the data in a form whereby advantage can be taken of these abilities. Such plots have become prevalent in statistical packages for examining missing data patterns. Hand, Blunt, Kelly and Adams (2000) gave the illustration of a plot showing a point for each missing value in a rectangular array of 1012 potential sufferers from osteoporosis measured on 45 variables. It is immediately clear which cases and which variables account for most of the problems. Unfortunately, as we face larger and larger data sets, so we are also faced with increasing difficulty in data profiling. The missing value plot described above works for a thousand cases, but would probably not be so effective for 10 million. Even in this case, however, a Pareto chart of percent missing for each variable may be useful for deciding where to spend data preparation effort. Knowing that a variable is 96% missing makes one think pretty hard about including it in a model. On the other hand, separate manual examination of each of 30,000 gene expression variables is not to be recommended. When even simple summaries of all the variables in a data base are not feasible, some methods for reducing the number of potential predictors in the models might be warranted. We see an important role for data mining tools here. It may be wise to reverse the usual paradigm of explore the data first, then model. Instead, exploratory models of the data can be useful as a first step and can serve two purposes (De Veaux, 2002). First, models such as tree models and clustering can highlight groups of anomalous cases. Second, the models can be used to reduce the number of potential predictor variables and enable the analyst to examine the remaining predictors in more detail. The resulting process is a circular one, with more examination possible at each subsequent modeling phase. Simply checking whether 500 numerical predictor variables are categorical or quantitative without the aid of metadata is a daunting (and tedious) task. In one analysis, we were asked to develop a fraud detection model for a large credit card bank. In the data set was one potential predictor variable that ranged from around 2000 to 9000, roughly symmetric and unimodal, which was selected as a highly significant predictor for fraud in a stepwise logistic regression model. It turned out that this predictor was a categorical variable (SIC code) used to specify the industry from which the product purchases in the transaction came. Useless as a predictor in a logistic regression model, it had escaped detection as a categorical variable among the several hundred potential candidates. Once the preliminary model whittled the

HOW TO LIE WITH BAD DATA

candidate predictors down to a few dozen, it was easy to use standard data analysis techniques and to detect which were appropriate for the final model. 5. IMPROVING DATA QUALITY

The best way to improve the quality of data is to improve things in the data collection phase. The ideal would be to prevent errors from arising in the first place. Prevention and detection have a reciprocal role to play here. Once one has detected data errors, one can investigate why they occurred and prevent them from happening in the future. Once it has been recognized (detected) that the question “How many miles do you commute to work each day?” permits more than one interpretation, mistakes can be prevented by rewording. Progress toward direct keyboard or other electronic data entry systems means that error detection tools can be applied in real time at data entry— when there is still an opportunity to correct the data. At the data base phase, metadata can be used to ensure that the data conform to expected forms, and relationships between variables can be used to cross-check entries. If the data can be collected more than once, the rate of discovery of errors can be used as the basis for a statistical model to reveal how many undetected errors are likely to remain in the data base. Various other principles also come into play when considering how to improve data quality. For example, a Pareto principle often applies: most of the errors are attributable to just a few variables. This may happen simply because some variables are intrinsically less reliable (and important) than others. Sometimes it is possible to improve the overall level of quality significantly by removing just a few of these low quality variables. This has a complementary corollary: a law of diminishing returns applies that suggests that successive attempts to improve the quality of the data are likely to lead to less improvement. If one has a particular analytic aim in mind, then one might reasonably assert that data errors that do not affect the conclusions do not matter. Moreover, for those that do matter, perhaps the ease with which they can be corrected should have some bearing on the effort that goes into detecting them—although the overriding criterion should be the loss consequent on the error being made. This is allied with the point that the base rate of errors should be taken into account: if one expects to find many errors, then it is worth attempting to find them, since the likely rewards, in terms of an improved data base, are likely to be large. In a well-understood environment, it might

237

even be possible to devise useful error detection and correction resource allocation strategies. Sometimes an entirely different approach to improving data quality can be used. This is simply to hide the poor quality by coarsening or aggregating the data. In fact, a simple example of this implicitly occurs all the time: rather than reporting uncertain and error-prone final digits of measured variables, researchers round to the nearest digit. 6. CONCLUSIONS AND FURTHER DISCUSSION

This article has been about data quality from the perspective of an analyst called upon to extract some meaning from it. We have already remarked that there are also other aspects to data quality, and these are of equal importance when action is to be taken or decisions made on the basis of the data. These include such aspects as timeliness (the most sophisticated analysis applied to out-of-date data will be of limited value), completeness and, of central importance, fitness for purpose. Data quality, in the abstract, is all very well, but what may be perfectly fine for one use may be woefully inadequate for another. Thus ISO 8402 defines quality as “The totality of characteristics of an entity that bare on its ability to satisfy stated and implied needs.” It is also important to maintain a sense of proportion in assessing and deciding how to cope with data distortions. In one large quality control problem in polymer viscosity, each 1% improvement was worth about $1,000,000 a year, but viscosity itself could be measured only to a standard deviation of around 8%. Before bothering about the accuracy of the predictor variables, it was first necessary to find improved ways to measure the response. In an entirely different context, much work in the personal banking sector concentrates on improved models for predicting risk—where, again, a slight improvement translates into millions of dollars of increased profit. In general, however, these models are based on retrospective data—data drawn from distributions that are unlikely still to apply. We need to be sure that the inaccuracies induced by this population drift do not swamp the apparent improvements we have made. Data quality is a key issue throughout science, commerce, and industry, and entire disciplines have grown up to address particular aspects of the problem. In manufacturing and, to a lesser extent, the service industries, we have schools for quality control and total quality management (Six Sigma, Kaizen, etc.). In large part,

238

R. D. DE VEAUX AND D. J. HAND

these are concerned with reducing random variation. In official statistics, strict data collection protocols are typically used. Of course, ensuring high quality data does not come without a cost. The bottom line is that one must weigh up the potential gains to be made from capturing and recording better quality data against the costs of ensuring that quality. No matter how much money one spends, and how much resource one consumes in attempting to detect and prevent bad data, the unfortunate fact is that bad data will always be with us. REFERENCES BAGGERLY, K. A, M ORRIS, J. S. and C OOMBES, K. R. (2004). Reproducibility of SELDI-TOF protein patterns in serum: Comparing datasets from different experiments. Bioinformatics 20 777–785. B RUNSKILL, A. J. (1990). Some sources of error in the coding of birth weight. American J. Public Health 80 72–73. C HECK, E. (2004). Proteomics and cancer: Running before we can walk? Nature 429 496–497. C OALE, A. J. and S TEPHAN, F. F. (1962). The case of the Indians and the teen-age widows. J. Amer. Statist. Assoc. 57 338–347. D E V EAUX, R. D. (2002). Data mining: A view from down in the pit. Stats (34) 3–9. D E V EAUX , R. D., D ONAHUE, R. and S MALL, R. D. (2002). Using data mining techniques to harvest information in clinical trials. Presentation at Joint Statistical Meetings, New York. D E V EAUX , R. D., G ORDON, A., C OMISO, J. and BACHERER , N. E. (1993). Modeling of topographic effects on Antarctic sea-ice using multivariate adaptive regression splines. J. Geophysical Research—Oceans 98 20,307–20,320. H AND, D. J. (2001). Reject inference in credit operations. In Handbook of Credit Scoring (E. Mays, ed.) 225–240. Glenlake Publishing, Chicago. H AND, D. J. (2004a). Academic obsessions and classification realities: Ignoring practicalities in supervised classification. In Classification, Clustering and Data Mining Applications (D. Banks, L. House, F. R. McMorris, P. Arabie and W. Gaul, eds.) 209–232. Springer, Berlin. H AND, D. J. (2004b). Measurement Theory and Practice: The World Through Quantification. Arnold, London. H AND , D. J., B LUNT, G., K ELLY, M. G. and A DAMS, N. M. (2000). Data mining for fun and profit (with discussion). Statist. Sci. 15 111–131. H AND, D. J. and H ENLEY, W. E. (1993). Can reject inference ever work? IMA J. of Mathematics Applied in Business and Industry 5(4) 45–55.

H UFF, D. (1954). How to Lie with Statistics. Norton, New York. J ONES, P. D. and W IGLEY, T. M. L. (1990). Global warming trends. Scientific American 263(2) 84–91. K IM , W., C HOI , B.-J., H ONG , E.-K., K IM, S.-K. and L EE, D. (2003). A taxonomy of dirty data. Data Mining and Knowledge Discovery 7 81–99. K LEIN, B. D. (1998). Data quality in the practice of consumer product management: Evidence from the field. Data Quality 4(1). K RUSKAL, W. (1981). Statistics in society: Problems unsolved and unformulated. J. Amer. Statist. Assoc. 76 505–515. L AUDON, K. C. (1986). Data quality and due process in large interorganizational record systems. Communications of the ACM 29 4–11. L ITTLE, R. J. A. and RUBIN, D. B. (1987). Statistical Analysis with Missing Data. Wiley, New York. L OSHIN, D. (2001). Enterprise Knowledge Management: The Data Quality Approach. Morgan Kaufmann, San Francisco. M ADNICK, S. E. and WANG, R. Y. (1992). Introduction to the TDQM research program. Working Paper 92-01, Total Data Quality Management Research Program. M OREY, R. C. (1982). Estimating and improving the quality of information in a MIS. Communications of the ACM 25 337–342. P ERCY, T. (1986). My data, right or wrong. Datamation 32(11) 123–124. P ETRICOIN , E. F., III, A RDEKANI , A. M., H ITT, B. A., L EVINE , P. J., F USARO , V. A., S TEINBERG , S. M., M ILLS , G. B., S IMONE , C., F ISHMAN , D. A., KOHN, E. C. and L IOTTA, L. A. (2002). Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 359 572–577. P IERCE, E. (1997). Modeling database error rates. Data Quality 3(1). Available at www.dataquality.com/dqsep97.htm. P RICEWATERHOUSE C OOPERS (2004). The Tech Spotlight 22. Available at www.pwc.com/extweb/manissue.nsf/docid/ 2D6E2F57E06E022F85256B8F006F389A. R EDMAN, T. C. (1992). Data Quality. Management and Technology. Bantam, New York. S TRAYHORN, J. M. (1990). Estimating the errors remaining in a data set: Techniques for quality control. Amer. Statist. 44 14–18. WAINER, H. (2004). Curbstoning IQ and the 2000 presidential election. Chance 17(4) 43–46. W EST, M. and W INKLER, R. L. (1991). Data base error trapping and prediction. J. Amer. Statist. Assoc. 86 987–996. W ILLENBORG, L. and DE WAAL, T. (2001). Elements of Statistical Disclosure Control. Springer, New York. W OLINS, L. (1962). Responsibility for raw data. American Psychologist 17 657–658.

Statistical Science 2005, Vol. 20, No. 3, 239–241 DOI 10.1214/088342305000000250 © Institute of Mathematical Statistics, 2005

How to Accuse the Other Guy of Lying with Statistics Charles Murray

Abstract. We’ve known how to lie with statistics for 50 years now. What we really need are theory and praxis for accusing someone else of lying with statistics. The author’s experience with the response to The Bell Curve has led him to suspect that such a formulation already exists, probably imparted during a secret initiation for professors in the social sciences. This article represents his best attempt to reconstruct what must be in it. Key words and phrases: tistics.

Public policy, regression analysis, lying with sta-

Unfortunately, the people who write such books often call upon data that have some validity, which confronts us with a dilemma. Such books must be discredited, but if we remain strictly within the rules of scholarly discourse, they won’t be. What to do? Recall Lawrence Kohlberg’s theory of moral development: At the sixth and highest level of morality, it is permissible to violate ordinary ethical conventions to serve a higher good (Kohlberg, 1981). Such is the situation forced upon us by these books. Let me offer six strategies that you may adapt to the specific situation you face. As you consider these strategies, always keep in mind the cardinal principle when attacking the target book: Hardly anyone in your audience will have read it. If you can convince the great majority who never open the book, it doesn’t matter that the tiny minority who have read it will know what you are doing.

In 1994, the late Richard J. Herrnstein and I published The Bell Curve (Herrnstein and Murray, 1994) and set off an avalanche of editorials, news stories, articles and entire books in response. That avalanche included valuable technical contributions that have moved the debate forward. But much of the reaction that went under the cover of scholarly critique baffled me because it seemed transparently meretricious. These people were too smart and well trained to believe their own arguments, I said to myself, and I spent many hours imagining how they rationalized lying (in my not-disinterested view) about the book’s arguments and evidence. But The Bell Curve wasn’t a unique case. For books on certain high-profile policy issues—Bjorn Lomborg’s The Skeptical Environmentalist (Lomborg, 1998) is another prominent example—the ordinary rules constraining scholarly debate seem to go out the window. In my more paranoid moments, I envision a secret initiation for newly-appointed assistant professors in the social sciences that goes something like this: * Over the last few decades, a number of books on public policy aimed at a lay readership have advanced conclusions that no socially responsible person can abide, written so cleverly that they have misled many gullible people.

#1. THE WHOLE THING IS A MESS

This is a form of softening up, “preparing the battlefield” as the military would put it. The goal is to generate maximum smoke. The specific criticisms need not be central to the target book’s argument. They need not even be relevant. All you need to do is to create an impression of many errors, concluding with, “If a sophomore submitted this as a paper in my introductory [insert name of course], I would flunk it.” Samples offer a rich source of smoke. Something is wrong with every sample. Start with that assumption, which has the advantage of being true, seek out that something, and then announce that the data are uninterpretable. If the sample is representative, argue that the

Charles Murray is the Bradley Fellow at the American Enterprise Institute for Public Policy Research, 1150 Seventeenth Street, N.W., Washington, D.C. 20036, USA (e-mail: [email protected]). 239

240

C. MURRAY

data are outdated. If the sample is recent, argue that it is unrepresentative. If it is both recent and representative, you may be able to get some mileage out of missing data. If the author drops cases with missing data, argue that doing so biases the sample. If instead the author uses imputed values, accuse him of making up data. Another excellent way to create smoke is to focus on the target book’s R 2 ’s, which are almost always going to be smaller than 0.5 and often will be around 0.1 or 0.2. The general form of the accusation in this case is, “[The independent variable] that the author claims is so important explains only [x] percent of the variance in [the dependent variable]. That means that [100-x] percent is the result of other causes. The role of [the author’s independent variable] is trivial.” Do not let slip that your own published work is based on similarly low R 2 ’s. A third generic way to create smoke is to accuse of the author of choosing the wrong analytical model. The author chose a linear model when it should have been nonlinear. He chose a tobit model instead of a negative binomial model. He used a fixed-effects model instead of a random-effects model. Here the general form of your position is, “Even a first-year graduate student would know better than to use [the target’s choice of model] instead of [the preferred model].” Do not be deterred if the results are robust across alternative models. Remember the cardinal rule: Hardly anyone will have read the book, so hardly anyone will know. #2. KEEP ADDING INDEPENDENT VARIABLES

Now you are ready to demonstrate that the author is not only incompetent, but wrong. If you have access to data for replicating the target book’s analysis, one statistical tool is so uniformly devastating that no critic should be without it: Keep adding independent variables. Don’t worry if the new variables are not causally antecedent to the author’s independent variables. You can achieve the same result by adding correlated independent variables that are causally posterior. The regression coefficients for the key variables in the target book’s analyses will be attenuated and sometimes become statistically insignificant. Technical note: Combine the old and new variables into a single-equation model, not into a multi-equation model. You don’t want to give your reader a chance to realize that you’re saying that the sun rises because it gets light. So far, I have given you some tools for fighting statistics with statistics. But remember Frederick Mosteller’s dictum that while it is easy to lie with statistics, it is even easier to lie without them. Let me turn

now to refutations of statistical evidence that exploit this profound truth. #3. ANY ALTERNATIVE EXPLANATION THAT CAN BE IMAGINED IS TRUE

The first of these ways to fight evidence without evidence calls on the power of the alternative explanatory hypothesis. As every poker player knows, it is not necessary actually to have good cards if you play the hand as if you had good cards. Similarly, you can advance competing hypotheses as if they are known to be true, as in this form: “The author fails to acknowledge that [some other cause] can have [the observed effect], invalidating the author’s explanation.” Technical note: Don’t make the beginner’s mistake of using “could” instead of “can” in this formulation—a careful reader might notice the implication that the alternative has no evidence to back it up. #4. NOTHING IS INNOCENT

If you can persuade your audience that the author of the target book is slanting the data, you cast a cloud of suspicion over everything the author says. Thus the rationale for strategy #4, again happily requiring no evidence: Treat any inconsistency or complication in the target book’s interpretation of the data as deliberately duplicitous. Some useful phrases are that the author “tries to obscure. . . ” or “conspicuously fails to mention. . . ” or “pretends not to be aware that. . . .” Here, remember that the more detailed the book’s technical presentation, the more ammunition you have: any time the author introduces a caveat or an alternative interpretation in an endnote or appendix, it has been deliberately hidden. #5. SOMEONE SOMEWHERE SOMETIME HAS SAID WHAT YOU PREFER TO BE TRUE

Sometimes the target book will use evidence based on a review of the extant technical literature. Such evidence is as easy to attack as the quantitative evidence if you remember “The Rule of One,” which is as follows: In a literature in which a large number of studies find X but even one study finds not-X, and the finding X is pernicious, you may ignore the many and focus exclusively on the one. Ideally, the target book will not have cited the anomalous study, allowing you to charge that the author deliberately ignored it (see strategy #4). But even if the target book includes the anomalous study in its literature review, you can still treat the one as definitive. Don’t mention the many.

241

HOW TO ACCUSE THE OTHER GUY

A related principle is the “Preferential Option for the Most Favorable Finding,” applied to panel studies and/or disaggregated results for subsamples. If the author of the target book has mentioned the overall results of such a study, find the results for one of the panels or one of the subsamples that are inconsistent with the overall finding, and focus on them. As you gain experience, you will eventually be able to attack the target book using one subsample from an early panel and another subsample from a later panel without anyone noticing. #6. THE JUDICIOUS USE OF THE BIG LIE

Finally, let us turn from strategies based on halftruths and misdirection to a more ambitious approach: to borrow from Goebbels, the Big Lie. The necessary and sufficient condition for a successful Big Lie is that the target book has at some point discussed a politically sensitive issue involving gender, race, class or the environment, and has treated this issue as a scientifically legitimate subject of investigation (note that the discussion need not be a long one, nor is it required that the target book takes a strong position, nor need the topic be relevant to the book’s main argument). Once this condition is met, you can restate the book’s position on this topic in a way that most people will find repugnant (e.g., women are inferior to men, blacks are inferior to whites, we don’t need to worry about the environment), and then claim that this repugnant position is what the book is about.

What makes the Big Lie so powerful is the multiplier effect you can get from the media. A television news show or a syndicated columnist is unlikely to repeat a technical criticism of the book, but a nicely framed Big Lie can be newsworthy. And remember: It’s not just the public who won’t read the target book. Hardly anybody in the media will read it either. If you can get your accusation into one important outlet, you can start a chain reaction. Others will repeat your accusation, soon it will become the conventional wisdom, and no one will remember who started it. Done right, the Big Lie can forever after define the target book in the public mind. * So there you have it: six tough but effective strategies for making people think that the target book is an irredeemable mess, the findings are meaningless, the author is incompetent and devious and the book’s thesis is something it isn’t. Good luck and good hunting. REFERENCES H ERRNSTEIN, R. J. and M URRAY, C. (1994). The Bell Curve: Intelligence and Class Structure in American Life. Free Press, New York. KOHLBERG, L. (1981). The Philosophy of Moral Development: Moral Stages and the Idea of Justice. Harper and Row, San Francisco. L OMBORG, B. (1998). The Skeptical Environmentalist: Measuring the Real State of the World. Cambridge Univ. Press.

Related Documents