Survey Study Of The Requirement

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Survey Study Of The Requirement as PDF for free.

More details

  • Words: 38,798
  • Pages: 128
Survey Study of The Requirement of A Teller Machine Within The University Premises For The Public Usage

Author: R.M.K.T. Rathnayaka

Supervisor: Dr.L.A.L.W. Jayasekara

Special Degree Part II (Research Project) Department of Mathematics University of Ruhuna Matara.

Declaration Thesis of the project as a partial fulfillment for special Degree in Mathematics part (II) (2002) by R.M.K.T. Rathnayaka under the supervision and guidance of Dr. L.A.L.W. Jayasekara, senior lecturer, Department of Mathematics, University of Ruhuna, Sri Lanka.

————————————– Dr. L.A.L.W. Jayasekara

————————————– R.M.K.T. Rathnayaka (Index No: 2002/S/5043)

————————————— Date

2

ACKNOWLADGEMENT My sincere thanks to my supervisor Dr.L.A.L.W. Jayasekara, for invaluable help and encouragement during the preparation of this thesis.

I would like to thanks Mr. Rathnayaka, Head of the Department of Mathematics, University of Ruhuna, for giving me the opportunity to carry out my research work in the Department. Equally, I wish to express gratitude to Senior lecturers in Department of mathematics, and Miss B.B.U.P. Perera, Assistant lecturer in Department of Mathematics, University of Ruhuna and all the members of the Department for their kind coorperation.

Finally I would like to thank my parents, demonstrators and friends who provided support to successfully complete this project.

Name : R.M.K.T. Rathnayaka (2002/S/5043)

Department Of Mathematics University Of Ruhuna Matara Sri Lanka Date : 04/04/2008

3

Abstract Since a long time, people have used to deposit their main requirements such as foods, cloths and money after their daily usage, with a view to use in the future. Step by step with use of money, the concept of banking was become popular all over the world, and rapidly improved with the new out comes of the technology revolution. so that they introduced teller and credit card facilities for the convenience of public. As an example, it is clear that a lot of people who are in our university also keeping context with the banking sector, and most of them like to keep their accounts in the government banks rather then the private banks. Considerable number of these people uses the teller machines and credit cards. But most of the university people have face for a lot of troubles, due to these teller machines are installed within the urban areas. So my main intention is to investigate this matter. About 3500 people come daily, in to the university premises for their academic and official work. The goal of this study to survey study of the requirement of a teller machine within the university premises for the public usage. We prepared to questionnaires for collect data and interview more than 500 people around the university area in wellamadama premises comprising academic, non academic staff, internal external student and security staff. Finally in order to analysis these data we have use “MINITAB” and “R” statistical softwares.

i

Contents Introduction

1

1 Simple Regression Analysis 1.1 Using Simple Regression To Describe A Liner Relationship 1.2 Least Squares Estimation . . . . . . . . . . . . . . . . . . 1.3 Inferences From A Simple Regression Analysis . . . . . . . 1.4 Model And Parameter Estimation . . . . . . . . . . . . . . 1.5 Multiple Linear Regression Model . . . . . . . . . . . . . . 1.6 Least-Square Procedures For Model Fitting . . . . . . . . . 1.7 Polynomial Model Of Degree p . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

3 3 6 8 11 12 13 14

2 Univarite Logistic Regression Model 2.1 Why Use Logistic Regression Rather Than Ordinary Linear Regression. . . 2.2 The Simple Logistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Importance Of The Logistic Transformation . . . . . . . . . . . . . . . 2.4 Fitting The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . 2.5 Fitting The Logistic Regression Model By Using Maximum Likelihood Method . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Testing For The Significance Of The Coefficients . . . . . . . . . . . . . . . 2.7 Testing For The Significance Of The Coefficents For The Logistic Regressin 2.8 Confidence interval estimation . . . . . . . . . . . . . . . . . . . . . . . . .

16 16 21 22 23

3 Multiple Logistic Regression Model 3.1 The Multiple Logistic Regression Model . . . . . . . . . . . . 3.2 Fitting The Multiple Logistic Regression Model . . . . . . . . 3.3 Testing For The Significance Of The Model . . . . . . . . . . . 3.4 Likelihood Ratio Test For Testing For The Significance Of The 3.5 Wald Test For Testing For The Significance . . . . . . . . . .

35 35 37 41 42 44

ii

. . . . . . . . . . . . Model . . . .

. . . . .

. . . . .

. . . . .

24 27 27 32

3.6 Confidence interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 Interpretation Of The Fitted Logistic Regression Model 4.1 Dichotomous Independent Variable . . . . . . . . . . . . . 4.2 Polychatomous Independent Variable . . . . . . . . . . . . 4.3 Deviation From Means Coding Method . . . . . . . . . . . 4.4 Continous Independent Variable . . . . . . . . . . . . . . . 4.5 The Multivariable Model . . . . . . . . . . . . . . . . . . . 4.6 Interaction And Confounding . . . . . . . . . . . . . . . . 4.7 Estimation Of Odds Ratios In The Presence Of Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

48 49 55 58 60 62 66

. . . . . . . . . 69

5 Model-Building Strategies And Mothods For Logistic Regression 73 5.1 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Fractional polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6 Descriptive Data Analysis

84

7 Discussion

103

8 Conclusion 110 8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 9 Appendx

116

iii

An Introduction to Regression Analysis Introduction Advance in technology including computers, scanners, and telecomunications equipment have buried present-day managers under a mountain of data. Although the purpose of these data is to assist managers in the decision-making process, corporate executives who face the task of juggling data on many variables may find them selves at a loss when attemping to make sence of such information. The decision-making process is futher complicated by the dynamic elements in the business environment and the complex interrlationships among these elements. This text has been prepared to give managers (and future managers) tools for examining possible relationships betweeen two or more variables. For example, sales and advertising are two variables commonly throught to be related. When a soft drink company increases advartising expenditures by paying professinal athletes millions of dollars to do its advertisements, it expects this outlay increase sales. In general, when decisions on advertising expenditures of millions of dollers are involed, it is comforting to have some evidence that, in the past, increased advertising expenditures indeed let to increased sales. Another example is the relationship between the selling price of a house and its square footage. When a new house is listed for sale, how shold the price be determined? Is a 4000-square-foot house worth twice as much as a 2000-square -foot house? What other factors might be involved in the pricing of houses and how should these factors be includ in the determination of the price? In a study of absenteeism at a large manufacturing plant, managment may feel that several variables have an impact. These variable might include job complexity, base pay, the number of yeras the worker has been with the plant, and the age of that worker. If absenteeism can cost the company thousands of dollars then the importance of identifying its associated factors becomes clear. Perhaps the most important analytic tool for examining the relationships between two

1

or more variables is regression analysis.Regression analysis is a statistical technique for developing an equation describing the relationship between two or more variables. One variable is specified to be the dependent variable, or the valuble to be explained. The other one or more variable are called the independent or explanatory variables. Using the previous examples, the soft drink firm would identify sales as the dependent variable and advertising expenditures as the explanatory variable. The real estate firm would choose selling price as the dependent variable and size as the explanatory variable to explain variation in selling price from house to house. There are several reasons business researchers might might want to know how certain variable are related. The retail firm may whont to know how much advertising is necessary to achieve a certain level of sales. An equation expressing the relationship between sales and advetising in useful in answering this question. For the real estimate firm, the relationship might be used in assiging prices to houses coming on to the market. To try to lower the absenteeism rate, the management of the manufacturing firm wants to know what variables are most highly related to absenteeism. Reasons for wanting to develop an equation relating two or more variables can be classified as follows. (a) To describe the relationship. (b) For control purpose (What value of the explanatory variable is neede to produce a certain level of the dependent variable) (c) For prediction Much Statistical analysis is a multistage of trial and error. A good deal of exploratory work must be done to select appropriate variables for study and to determine relationships between or among them. This requires that a variety of statistical tests and procedures be performed and sound judgments be made before one arrives at satisfactory choices of dependent and explanatory variable. The emphasis in this text is on this multistage process rather than on the computations themselves or an in-depth study of the theory behind the techniques presented. In this sense, the text is derected at the applied researcher or the consumer of statistic. Except for a few preparator y examples, it is assumed that a computer is available to the reader to perform the actual computations. The use of statistical software frees the user to concentrate on the multistage “model-building” process. Most examples use illustrative computer output to present the results. The two software packages used are “MINITAB” and “Microsoft Excel 2000”. MINITAB is include because it is widely used as a teaching tool in universities and is also used in industry.

2

Chapter 1 Simple Regression Analysis 1.1

Using Simple Regression To Describe A Liner Relationship

Regression analysis is a statistical technique used to describe relationships among variables. The simplest case to examne is one in which variable y, referred to as the dependent variable, may be related to another variable x, called an independent or explanatory variable. If the relationship between y and x is believed linear, then the equation expressing this relationship is the equation for a line: y = b0 + b1 xi

(1.1)

If a graph of all the (x,y) pairs is constructed, then β0 represents the y intercept, the point where the line crosses the vertical (y) axis, and β1 represents the slope of the line. Consider the data show in Table 1.1. A graph of the (x,y) pairs would appear as show in Figure 1.1. Regression analysis is not needed to obtain the equation exprssing the relationship between these two variables.In equation form: y = 1 + 2x This is an exact or deterministic linear relationship. Exact linear relationships are sometimes encoutered in business enviroments. For example, from accounting, T otalCosts = F ixedCosts + variableCosts Other exact relationships may be encountered in various science courses (for example , physics or chemistry). In the social sciences (for example, psychology or sociology) and X Y

1 2 3 3 5 7

4 5 6 9 11 13

Table 1.1: Example Data set-I

3

Figure 1.1: Example Data set-I X Y

1 2 3 3 2 8

4 5 6 8 11 13

Table 1.2: Example Data set-II in business and ecomomics, exact linear relationships are the exception rather than the rule. Data encountered in a business environment are more likely to appear as in Table 1.2. These data graph as shown in Figure 1.2. It appears that x and y may be linearly related, but it is not an exact relationship. Still it may be describe to describe the relationship between in equation form. This can be done by drawing what appears to the “best-fitting” line through the points and estimating (guessing) what the values of β0 and β1 are for this line. This has been done in Figure 1.2. For drow a good guess might be the following equation: yˆ = −1 + 2.5x

Figure 1.2: Example Data set-II

4

y

11 00 00 11

y*=8 y*=7

0 1 y*−y* 0 1

0 11 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1

11 00 0 1 11 00

x*=3

x

Figure 1.3: Motivation for the Least-squares Regression line The drowbacks to this method of fitting the line should be clear. For example, if the (x, y) pairs graphed in Figure 1.2 where given to two people, each would probably guess different values for the intercept and slope of the -fitting line. Furthermore, there is no way to assess who would be more correct. To make line fitting more precise, a definition of what is means for a line to be the “best“ is needed. The criterion for a best-fitting line that we will use might be called the “miniumum sum of squared errors”criterion or, as it is more commonly know, the least-quares criterion. In Figure 1.3, the (x, y) pairs from Table 1.2 have been plotted and arbitrary line drawn through the points. Consider the pair of values denoted (x∗ , y ∗) The actual y value is indicated as y ∗ ;the value predicted to be associated with x∗ if the line shown where used is indicated as yˆ∗ . The difference between the actual y value and the predicted y value at the point x∗ is called a residual and represent the “error”involved. This error denoted y ∗ − yˆ∗ . If the line is to fit the data points as accurately as possible, these errors shoud be minimized.This should be done not just for the single point (x∗ , y ∗), but for all the points on the graph. There are several vays to approach this task.  (1) Use the line that minimizes the sum of errors, ni=1 (yi − yˆi ). The problem with this approach is that,for any line that passes through the point(¯ x,¯ y), n

i=1 (yi

− yˆi )=0

so there are an infinite number of lines satisfying this criterion,some of which obviously do not fit the data well. For example, in Figue 3.4, line A and B have both been constucted so that , n  (yi − yˆi ) = 0 i=1

But line A obviously fits the data better than line B;that is , it keeps the distances yi − yˆi small.

5

y

y=7.5

11 00 11111111 00000000 1 0 00 11 00000000 11111111 00 11 00000000 11111111 00000000 11111111 11 00 00000000 11111111 00000000 11111111 0 1 00000000 11111111 0 1 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 1 0 00000000 11111111 00000000 11111111 x*=3.5

x

Figure 1.4: Line A and B Both satisfing the criation

n

i=1 (yi

− yˆi ) = 0

(2) Use the line that minimize the sum of absolute value of error. n i=1

|(yi − yˆi )|

This is calledthe minimum sum of absolute error criterion.The resulting line is called the least absolute value (LAV)regression line.Although use of this criterion is gaining popularity in many situations, it is not the one that ew use in this text. Finding the line that satisfies the minimumsum of absolute errors criterion requires solving a fairly compex problem by a technique called linear programming. (3) Use the line that minimizes the sum of squared errors.

1.2

Least Squares Estimation

The parameters β0 and β1 are estimated by the method of least squares. The reasoning behind this method is quite simple. From the many straight lines that can be drawn through a scatergram we wish to pick the one that“best fits” the data. The fits is “best” in the sense that the values of b0 and b1 chosen are those that minimize the sum of the squares of the residuals. In this way we are essntially picking the line that comes as close as it can to all data points simultaneously. For example, if we consider the sample of five data points shown in Figure 1.8, then the least-squares procedure selects that line which causes e21 + e22 + e23 + e24 + e25 to be as small as possible. The sum of squres of the errors about the estimated regression line is given by SSE =

n  i=1

e2i

=

n  i=1

6

(yi − b0 − b1 x1 )2

y

11 00 001 11 0e5 0 1 0 1

1 0 e4 0 1 11 00

0 1 0e3 11 0 0 1 0 1

11 00 e1

μ = b + Y|x 0

b x 1

e2

11 00 x

Figure 1.5: The least-squares procedure minimizes the sum of the squares of the residuals ei Differentiatiating SSE with respect to b0 and b1 ,we obtain  ∂SSE = −2 (yi − b0 − b1 xi ) ∂b0 i=1 n

 ∂SSE = −2 (yi − b0 − b1 xi )xi ∂b1 i=1 n

We now set these partial derivatives equal to 0 and use the rules of summation to be obtain the equation, n n   xi = yi nb0 + b1 i=1

b0

n 

xi + b1

i=1

n 

i=1

x2i

=

i=1

n 

xi yi

i=1

This equations are called the normal equations. They can be solved easily to obtain these estimates for β0 and β1 : n n n 1 i=1 xi yi − n ( i=1 xi i=1 yi ) (1.2) β1 = n 2 1 n 2 ( x − x ) i=1 i i=1 i n β0 = y¯ − β1 x¯ β0 and β1 that minimize, βˆ1 =

n (x − x¯)(yi − y¯) i=1 n i ¯ )2 i=1 (xi − x βˆ0 = y¯ − β1 x¯

7

(1.3)

Variable (i) 1 2 3 4 5 6 Sums

xi 1 2 3 4 5 6 21

yi 3 2 8 8 11 13 45

xi yi 3 4 24 32 55 78 196

x2i 1 4 9 16 25 36 91

Table 1.3: Computations for finding β0 and β1 A computational simper from of 1.2 is, n n n 1 i=1 xi yi − n ( i=1 xi i=1 yi ) ˆ (1.4) β1 = n 2 1 n 2 ( i=1 xi ) i=1 xi − n As an example of the use of these formulas,consider agin the data in Table 1.3. The intermediate computations necessary for finding β0 and β1 are shown in Table 1.3. The slope, β1 , can be computed using the formula in equation 1.2, 196 − 16 (21)(45) 38.5 ˆ = β1 = = 2.2 1 2 17.5 91 − 6 (21) The intercept, β0 , is computed as in equation 1.3 βˆ0 = 7.5 − 2.2(3.5) = −0.2 because

21 45 and y¯ = = 7.5 6 6 The least-squares regression line for these data is x¯ =

yˆ = −0.2 + 2.2x Summary There is no longer any guesswork associated with computing the best-fitting line once a criterion has been stated that difines “best”. Using the criterion of minimum sum of squared errors, the regression line we computed provides the best description of the relationship between the variables x and y.

1.3

Inferences From A Simple Regression Analysis

Thus far, regression analysis has been viewed as a way to describe the relationship between two variables. The regression equation obtained can be viewed in this manner simply as

8

An inverse relationship

A direct relationship

A curvilinear relationship

No linear relationship

Figure 1.6: Example of possible population regression lines a descriptive statistic. However, the power of the technique of least-squares regression is not in its use as a descriptive measure for one particular sample, but in its ability to draw inferences or generalizations about the relationship for the entire population of values for the variable x and y. To draw inferences from a simple regression, we must make some assumptions about how x and y are related in the population. These initial assumptions describe an “ideal”situation. Later, each of these assumptions is relaxed and we demostrate modifications to the basic least-squares appoach the provid a model that still suitable for statistical inference. Assume that the relationship between the variables x and y is represented by a population regression line. The equation of this line is written as, μy|x = β0 + β1 x

(1.5)

Where μy|x is the conditional mean of y given of x,β0 is the y intercept for the population regressiion line, and, β1 is the slope of the population regression line.Example of possible relationships are shown in Figure 1.6. Suppose that we are developing a model to describe the temperature of the water off the continental self. Since the temperature dependes in the part on the depth of the water, two variable are involved. These are X, the water depth, and Y , the water tem-

9

y

0

11 00 00 11 00 11 00 11 0 1 0 00 0 11 00 11 11 1 00 0 1 0 1 00 11 11 00 11 00 00 11 111 0 11 0 0000 11 0 1 00 11 1100 00 11 111 0 100 0 00 11 11 00

water temperature

water temperature

y

x

00 11 00 11 μ = β + β x 0 1 0 1 Y|x 000000000000000 111111111111111 11 00 11 00 00 11 000000000000000 111111111111111 00 0 11 000000000000000 111111111111111 11 00 000000000000000 111111111111111 1100 00 11100 11 000000000000000 111111111111111 11 0 00 10 11 00 1 000000000000000 111111111111111 11 00 000000000000000 111111111111111 11 1 00 011 00 11 00 000000000000000 111111111111111 011 1 00 11 00 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111

0

Depth of water (a)

x Depth of water (b)

water temperature

y

0

(x,y) i i 11 100 0 0 0 1 1 0 1 0 1 0 1 ε 0 1 1 0000000000000 1111111111111 0 1 1 0 01 1 0 00 11 0000000000000 1111111111111 0 0 1 0 1 0 1 00 11 0000000000000 1111111111111 11 00 11 00 1 0 0 1 0 1 0000000000000 1111111111111 11 0 00 1 0 1 0 1 0000 11 0000000000000 1111111111111 0011 11 0000000000000 1111111111111 11 0 00 00 0000000000000 1111111111111 111 00 11 0000000000000 1111111111111 0000000000000 1111111111111

x

Depth of water (c)

Figure 1.7: Depth of water Vs Water Tempature. perature. We are not interested in making inferences on the depth of the water. Rather, we want to describe the behavior of the water temparatuer under the assumption that the depth of the water is known precisely in advance. Even if the depth of the water is fixed at some value x, the water temperature will still vary due to other random influences. For example, if several temparature measurements are taken at various places each at a deapth of x = 1000 feet, these measurements will vary in value. For this reson, we must admit that for a given x we are really dealing with a “Conditional” random variable, which we denote by Y |x (Y given that X = x). This conditional random variable has a mean denoted by μY |x . It is obvious that the average temperature atx = 1000 feet to be the same as that at x = 5000 feet. That is, it is reasonable to assume that μY |x is a function of x. We call the graph of this function the curve of regression of Y on X. Since we assume that the value of X is known in advance and that the value assumed by Y dependes in part on the particular value of X under considaration, Y is called the dependent or response variable. The variable X whose value is used to help predict the behaviour of Y |x is called the independent or predictor variable or the regressor. My immediate problem is to estimate the form of μY |x based on data obtained at some selected values x1 , x2 , x3 , ...., xn of the predictor variable X.The actual values used to develop the model are not overly important. If a functional relationship exit, it should become apparent regardless of which X values are used to didiscovedr it. However, to be of practical use, these values, should represent a fairly wide range of possible values of the independent variable X. Sometimes the values used can be preselected. For example, in

10

studing the relationship between water temperatures and deapths, we might know that our model is to be used to predict water temperature for depths from 1000 to 5000 feet. We can choose to measure water temparatures at any water deapth that we wish within this range. For example, we might take measurement at 1000-foot increments. In this way we present our X values at x1 =1000, x2 =2000, x3 =3000, x4 =4000, x5 = 5000 feet. When the X values used to develop the regression equation are preseleted, the study is said to be controlled. Regardless of how the X values for study are selected, my random sample is properly viewed as taling the form {(x1 , Y |x1 ), (x2 , Y |x2 ), (x3 , Y |x3 ), ....., (xn , Y |xn )} The first member of each ordered pair denoptes a value of the independent variable X; it is a real number. The second member of each pair is a random variable. Now we estimate the curve of regression of Y on X when the regression is conideered to be linear. In this case the equation for μY |x is given by, μy|x = β0 + β1 x

(1.6)

Where β0 and β1 denote the real numbers.

1.4

Model And Parameter Estimation

Description Of Model From the elementary algebra that the equation for the straight line is y = b + mx, where b denote they intercept and m denotes the slope of the line. In the simple linear regression model is, μy|x = β0 + β1 x β0 denotes the intercept and the β1 the slope of the regression line. We must find a logical way to estimate the theoretical parameters β0 and β1 . In conducting a regression study, we shall be observing the variable X at n points x1 , x2 , x3 , ....., xn . These points are assumed to be measured without error. When they are preselected by the experimenter, we say that the study is a controlled study; when they are observed at random, then the study is called an observational study. Both situations are handled in the same way mathematically. In either case we shall be concerned with the n random variables Y |x1 , Y |x2 , Y |x3 , ....., Y |xn . recall that a random variable varies about its mean value. Let Ei denote the random difference between Y |xi and its mean, μY |x , That is, let Ei = Y |xi − μy|xi (1.7)

11

Solving this equation for Y |xi , we conclude that Y |xi = μy|xi −Ei In this expression it is assumed that the random difference Ei has mean 0. Since we are assuming that the regression is linear, we can conclude that μy|xi = β0 +β1 xi . Substituting, we see that Y |xi = β0 + β1 xi + i It is customary to drop the conditional notation and to denote Y |xi by Yi . Thus an alternative way toexpress the simple linear regression model is Y − i = β0 + β1 xi + Ei

(1.8)

where Ei is assumed to be random variable with mean 0. Our data consist of a collection of n pairs (xi , yi ), where xi is an observed value of the variable X and yi is the corresponding observation for the random variable Y . The observed value of a random variable usually differes from its mean value by some random amount. This idea is expressed mathematically by writing yi = β0 + β1 xi + i

(1.9)

In this equation i is a realization of the random variable Ei that appears in the alternative model for simple linear regression 1.8. In a regression study it is useful to plot the data points in the xy plane. Such a plot is called a “Scattergram”. We do not expect these points to lie exactly in the straight line. However, if linear regression is aplicable, then they should exhibit a linear treand. Once β0 and β1 have been approximated from the available data, we can replace these theoretical parameters by their estimated values in the regression model. Letting b0 and b1 denote the estimates for β0 and β1 ,respectively, the estimated line of regression takes the form μ ˆ Y |x = b0 + b1 x (1.10) Just as the data points do not all lie on the theoretical line of regression,they also do not all lie on this estimated regression line. If we let ei denote the vertical distance from a point(xi , yi ) to the estimated regression line, then each data point satisfies the equation yi = b0 + b1 xi + ei The term ei is called the residual. Figure 1.8 illustrates this idea and points out the difference between i and ei graphically.

1.5

Multiple Linear Regression Model

In the previous section we studied the simple logistic regression model. This model expresses the idea that the mean of a responce variable Y depends on the value assumed

12

by a single predictor value X. In this section I extend the concepts studied earlier to cases in which the model become more complex. In particular,we distinguish between two basic models: the polynomial model, in which the single predictor variable can appear to a power greter than 1, and the multiple linear regression model, in which more than one distinct variable can be used.

1.6

Least-Square Procedures For Model Fitting

In this section we develop the least-squares estimators for the parameters in both the polynomial and multiple regression models. Before introducing these models specifically, let us not that each of term is a special case of what is called the general linear model. These are models in which the mean value of a responce variable Y is assumed to depend on the values assumed by one or more predictor variables. As in the case of simple linear regression, the predictor variable X1 , X2 , ....., Xk are not treated as random variables. However, for a given set of numerical values for these variables x1 , x2 , ..., xk , the responce variable denote by Y |x1 , x2 , ..., xk is assumed to be a random variable. The general linear model expresses the mean value of this conditional random variable as a function of x1 , x2 , ..., xk . The model takes the following form:

General linear model μY |x1 ,x2 ,...,xk = β0 + β1 x1 + β2 x2 + ...... + βk xk

(1.11)

In this model x1 , x2 , ..., xk denote known real numbers;β0 , β1 , β2 , ....., βk denote unknown parameters.Our main task is to estimate the values of these parameters from a data set. Example:1.6.1 Suppose that we want to develop an equation with which we can predict the gasoline mileage of an automobile based on its weight and the temperature at the time of operation. We might pose the model, μY |x1,x2 = β0 + β1 x1 + β2 x2 Here the responce variable is Y , the mileage obtained. There are two independent or predictor variables. These are X1 , the wight of the car, and X2 , the temperature. The values assumed by these variable are denoted by x1 and x2 , respectively. For example, we might want to predict the gas mileage for a car the weights 1.6 tons when it is being driven in 85F˙ wether, Here x1 = 1.6 and x2 = 85. The unknown parameters in the model are β0 , β1 , and β2 . Their values are to be estimated from the data gathered. It is possible to treat the polynomial and multiple regression models simultaneously from mathematical standpoint. However, they differ enough in a practical sence to jestify conidering them separately. We begin with a desciption of the general polynomial model.

13

y

11 00 00 0 11 11 1 00 0 1 11 00 0 1 0 1 0 1 00 1 11 0 0 11 1 00 0 1 11 00 0 1

x

Figure 1.8: Quadaratic model:μ = β0 + β1 x + β2 x2

1.7

Polynomial Model Of Degree p

The general polynomial regression model of degree p expresses the mean of the responce variable Y as polynomial function of one predictor value X. It is takes the form, μY |x = β0 + β1 x1 + β2 x2 + ...... + βp xp

(1.12)

Where the p is a positive integer .If we let x1 = x, x2 = x2 , x3 = x3 , ........., xp = xp , then the model can be rewriten in the general linear form as 1.12. Scatergrams are useful in determining when a polynomial model might be appropriate. The pattern show in Figer 1.10 suggests the quadratic model, μY |x = β0 + β1 x1 + β2 x2

(1.13)

that of Figer 1.10 poins to be the cubic model, μY |x = β0 + β1 x + β2 x2 + β3 x3 .

(1.14)

Once we decide that a polynomial is appropriate, we are faced with the problem of estimating the parameters β0 , β1 , β2 , ......, βp . To apply the method of least squares, we first express the polynomial in the form, Y |x = β0 + β1 x1 + β2 x2 + ...... + βp xp + Ei

(1.15)

Where Y |x denotes the response variable when the predictor variable assumes the value x, and E denotes the random difference between Y |x and its mean value, μY |x = β0 + β1 x + β2 x2 + ...... + βp xp . A random sample of size n takes the form {(x1 , Y |x1 ) , (x2 , Y |x2 ) , ....., (xn , Y |xn )} Where the first member of each ordered pair denote a real number, and the secound, a random variable. As in the case of sample linear regression, it is customary to drop the conditional notation. The sample itself become (x1 , y1 ), (x2 , y2), ...., (xn , yn ) where each ι = 1, 2, ...., n, Yi = β0 + β1 x1i + β2 x2i + ...... + βp xpi + Ei

14

(1.16)

y

1 0 00 11 0 1 0 0 1 1 0 1 0 1 0 00 1 11 0 1 11 0 0 00 11 0 1 0 11 1 00 1 00 0 11 11 00 00 11 00 11 x

Figure 1.9: Cubic model:μ = β0 + β1 x + β2 x2 + β3 x3 Once again, we must assume that the random error E1 , E2 , ......, En are independent random variables, each with mean 0, and variance σ 2 . The estimate mean responce, estimated value of Y for a given by yˆ = μ ˆY |x = b0 + β1 x + b2 x2 + ...... + bp xp

(1.17)

Where b0 , b1 , ....., bp are the least-squares estimates for β0 , β1 , ...., βp To find these estimates, we minimize the sum of squares of the residence.

15

Chapter 2 Univarite Logistic Regression Model Regression method have become an integral component of any data analysis concerned with describing the relationship between a responce variable and one or more explanatory variables. It is often the case that yhe outcome variable is discrete, taking on two or more explanatory variables. It is often the casethat the outcome variable is discrete, taking on two or more possible values. Over the last decade the logistic regression model has become, in many fields, the standard method of analysis in this situation. Logistic regression analysis is the most popular regression techiniques available for modeling dichotomous dependent variables. In this chapter I discribe the univarite logistic regression model and several of its key featurs. Particularly how an odd ratio can be estimated from it. We also demostrate how logestic regression may be applied, using real-life data set. Over the last decade the logistic regression model has become in many fields the standed method of analysis in this situation Before begining a study of logistic regression it is important to undestand that the goal of an analysis using this method is the same as that of any model-bulding tecnique used in statistics:to find the best fitting and most parsimonious, yet biologically reasonable model to describe the relationship between an outcome (dependent or responce) variable and a set of independent (predictor or explanatory) variables. These variables are often called covariates. The most common example of modeling, and one assumed to be familiar to the readers of this text, is the usual linear regression model where the outcome variable is assumed to be continous.

2.1

Why Use Logistic Regression Rather Than Ordinary Linear Regression.

In early statistions use ordinary linear regression for there data analysisings. They didn’t use logistic regression with a binary outcome. Statistic develop day by day, however, are now most statistions & phychologists use logistic regression for following reasons.

16

(i) The outcome variable in the logistic regression is binary or dichotomous. (ii) If you use linear regression, The predicted values will become grether than one and less than zero. If we moves ar theretically inadmissble. (iii) One of the assumptions of regression is that the variance of Y is constant be the case with a binary variable, become the variances is P Q. Example2.1: When the 50 precent of the people are,t hen the variance is 0.25. Its maximum value as we move to more extreme values, the variance decreases, when p = 0.1, the variance (p × q) is 0.1×0.9 = 0.09, so as P approaches 1 or zero, the variance approches zero. This difference between logistic and linear regression is reflected both in the choice of parametric model and in the assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow the same general principales used in linear regression. Thus, the techiniques used in linear regression analysis will mortivate our approach to logistic regression. We illustrate both the similarities and differences between logistic regression and linear regression with an example. textbfExample2.2 Table 2.1 Age and ID AGE AGRP 1 20 1 2 23 1 3 24 1 4 25 1 5 25 1 6 26 1 7 26 1 8 28 1 9 28 1 10 29 1 11 30 2 12 30 2 13 30 2 14 30 2 15 30 2 16 30 2 17 32 2 18 30 2 19 33 2 20 33 2

coronary heart disease (CHD) stutus of CHD ID AGE AGRP 0 21 34 2 0 22 34 2 0 23 34 2 0 24 34 2 1 25 34 2 0 26 35 3 0 27 35 3 0 28 35 3 0 29 36 3 0 30 36 3 0 31 37 3 0 32 37 3 0 33 37 3 0 34 38 3 0 35 38 3 1 36 39 3 0 37 39 3 0 38 40 4 0 39 40 4 0 40 41 4

17

100 subjects. CHD 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0

ID 41 42 43 44 45 46 47 48 49 50 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

AGE 41 42 42 42 42 43 43 43 44 44 44 45 45 46 46 47 47 47 48 48 48 49 49 49 50 51 52 52 53

AGRP 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6

CHD 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0

ID 71 72 73 74 75 76 77 78 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

AGE 53 54 55 55 55 56 56 56 57 57 57 57 57 57 58 58 58 59 59 60 60 61 62 62 63 64 64 65 69

AGRP 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8

CHD 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1

Table 2.1 lists age in years (AGE), and presence or absence of evidence of significant coronary heart disease (CHD) for 100 subjects selected to participate in a study. The table also cantains an identifier variable (ID) and an age group variable (AGRP). The out come variable is CHD, which is coded with a value of zero to indicate CHD is absent, or 1 to indecatethat it is precent in the individual. It is of interest to explore the relationship between age and the presence or absence of CHD in this study population. Had out come variable CHD been continuse rather than binary, we probably would begin by forming a scatter plot of the outcome versus the independent variable. We would use this scatterplot of the data in Table 2.1 is given in Figer 2.1.

18

Figure 2.1: Scatterplot by CHD by AGE for 100 subjects. In this scatterplot all points fall on one of two parallel lines representing the absence of CHD (y = 0) and the presence of CHD (y = 1). There is some tendency for the individuals with no evidence of CHD to be younger than those with evidence of CHD. While this plot does depict the dichotomous nature of the outcome variable quite clearly, it does not provide a clear picture of the nature of the relationship between CHD and age. Other problem with Figure 2.1 is that the variabilty in CHD at all ages is large. This makes it difficult to describe the functional relationship between age and CHD. One common method of removing some variation while still maintaning the stucture of the relationship between the outcome and the independent variable is to create intervals for the independent variable and compute the mean of the outcome variable within each group. In this table 2.1 this strategy is carried out by using the age group variable, AGRP which categorize the age data of table 2.1. Table 2.2 contains, for each age group, the frequncy of occurreence of each outcome as well as the mean (or propotion with CHD present) for each group. Now we can proportion of individuals CHD versus the incependent of each age interval. By examining this is a clear picture of the relationship begins to emerge. It appears that as age increses,the propotion of individuals with evidence of CHD increases. While this provides considerable insight in to the relationship between CHD and age in this study, a functional form for this relationship needs to be described. The plot in this figure is similar to what one might obtain if this same process of grouping and averaging whre per formed in a linear regression. We will note two important differences.

19

Age Group 20-29 30-34 35-39 40-44 45-49 50-54 55-59 60-69 Total

n 10 15 12 15 13 8 17 10 100

Absent 9 13 9 10 7 3 4 2 57

Present 1 2 3 5 6 5 13 8 43

Mean(Proportion) 0.10 0.13 0.25 0.33 0.46 0.63 0.76 0.80 0.43

Table 2.1: frequncy table of AGE group by CHD

Figure 2.2: Plot of the presentage of subjects with CHD in each age group.

20

Table 2.2: The first difference concerns the nature of the relationship between the outcome and independent variable. In any regression problem the key quantity is the mean value of the outcome variable, given the value of the independent variable. The quntity is called the conditional mean and will be expressed as E(Y |x) where Y denotes the outcome variable and x denotes a value of the independent variable. In The quantity E(Y |x) is called “the expected value of Y, given the value x”, In linear regression, we assume that this mean may be expressed as equation linear in x, E(Y |x) = β0 + β1 x

(2.1)

This expression implice that it is possible for E(Y |x) to take on any value as x range between −∞ and ∞. The colum labeled “Mean” in Table 2.2 provides an estimate of E(Y |x). We will assume for purposes of exposition, that the estimated values bploted in Figure 2.2 are be close enouge to the true values of E(Y |x) to provide a reasonable assessment of the relationship between CHD and AGE. With dichotomous data, the conditional mean must be greater than or equal to 1 [0 ≤ E(Y |x) ≤ 1] . The change in the E(Y |x) per unit change in x becomes progressively smaller as the conditional mean gets closer to 0 or1. The curve is said to be S-shaped. It resembles a plot of a commulative distribution of random variable. It should not seemsurrising that some well-know commulative distributions have been used to provide a model for E(Y |x) in the case when Y is dichotomous.The model will use is that of the logistic distribution. Many distribution functions have been proposed for use in the analysis of a dichotomous outcome variable. Cox and Shell (1989) discuss some of these. There are two primary reasons for choosing the logistic distribution. First from a mathematical point of view, it is an extremely flexible and easily used function, and secound it lends it self to a clinically meaningful interpretation.

2.2

The Simple Logistic Model

Logistic regression is a mathematical modling apprach that can be use to describe the relationship of several variable X to a dichotomous dependent variable Y where Y is typically called as 1 or 0 for its two possible categories. The logistic model describes the expected value of Y (i.e E(Y ) ) in termes of following “logistic formula”. E(Y ) =

1 1+

21

e−(β0 +β1 x)

(2.2)

eβ0 +β1 x (2.3) 1 + eβ0 +β1 x For (0, 1) random variable such as Y . It follows from basic statistical principales about expected values that E(Y ) is equivalent to probability P r(Y = 1) for a (0, 1) random variable Y , E(Y ) =

E(Y |x) = [0 × P r(Y = 0)] + [1 × P r(Y = 1)] = P r(Y = 1)

(2.4)

can be written in a form that describes the probability of occurrence of one of the two possible out comes of Y as follows: P (Y = 1) =

1 1 + e−(β0 +β1 x)

(2.5)

In order to simplify notation, we use the quantity, π(x) = E(Y |x)

(2.6)

to represent the conditional mean of Y given x when logistic distribution is used. The specific form of the logistic regression model we use is, π(x) =

eβ0 +β1 x 1 + eβ0 +β1 x

(2.7)

A transformation of π(x) that is central to my study of logistic regression is the logistic tarnsformation. This transformation is defined, in terms of π(x)   π(x) g(x) = ln 1 − π(x) now apply equation 2.7,then we have ⎤ ⎡ e(β0 +β1 x) ⎢ 1 + e(β0 +β1 x) ⎥ ⎥ = ln[e(β0 +β1 x) ] = β0 + β1 x g(x) = ln ⎢ ⎦ ⎣ 1

(2.8)

(2.9)

1 + e(β0 +β1 x)

2.3

The Importance Of The Logistic Transformation

(1) The logit transformation g(x) has many of the desirable properties of a linear regression model. (2) The logit, g(x), is linear in its parameters, may be continuse, and may range from −∞ to + ∞, depending to the range of x.

22

(3) Logistic regression models concerns the conditional distribution of the out come variable. In the linearregression model we assume that an observation out come variable may be expressed as Y = E(Y |x) +  The quantity  is called the error and expressed an observation’s deviation from the conditional mean. The most common assumption is that ε follows a normal distribution with mean zero and some variance that is constant across levels of the independent variable. It follows that the conditional distribution of the outcome variable x will be normal with mean E(Y |x), and a variable that is constant. This is not the case with a dichotomous outcome variable. In this situation we may express the value of the outcome variable given x as y = π(x) + ε, Here the quantity ε may assume one of two possible values. If y = 0 then ε = −π(x) with probability 1 − π(x). Thus , ε has a distribution with mean zero and variance equal to π(x)[1 − π(x)]. That is, the conditional distribution of the outcome variable follows a binomial distribution with probability given by the conditional mean, π(x). Summary: We have seen that in a regression analysis when the outcome variable is dichotomous: (1) The conditional mean of the regression equation must be formulated to be bounded between zero and 1. we have stated that the logistic regression model, π(x) given in equation 2.7, satisfies this constraint. (2) The binomial, not a normal, distribution describe the distribution of the errors and will be the sistical distribution upon which the analysis is based. (3) The principales that gride an analysis using linear regression will also guide in logistic regression.

2.4

Fitting The Logistic Regression Model

Suppose we have a sample of n independent observations of the pair (xi , yi) , i = 1, 2, ..n where yi denotes the value of a dichotomous out come variable and xi is the value of the independent value for the ith subject. Further more, assume outcome variable has been coded as 0 or1, representing the absence or the presence of charecteristic respectively. This coding for a dicotomous outcome is used through for our text. To fit the logistic regression model in equation 2.7 to a set ofdata requires that, we estimate the value of β0 and β1 , the unknown parameters. In linear regression, the method used most often for estimating unkown parameter is least squares. In that method we choose those values of β0 and β1 wchich minimize the

23

sum of squared deviations of the observed value of Y from the predicted values based upon the model under the usual assumptions for linear regression. The method of least squares yields estimatators with a number of desirable statistical properties. Unfortunatly when the model of least squares is applied to a model with a dichotomous outcome of the estimators no longer have same properties.

2.5

Fitting The Logistic Regression Model By Using Maximum Likelihood Method

The general method of estimation that the least square function under the linear regression model (when the error terms are normally distributed) is called maximum likelihood . This method will provide the function for our approach to estimation with the logistic regression model. In a very general sence the method of maximum likelihood yields values for the unkown parameters which maximize the probability of obtaining the obseved set of data. we can use this method when the error terms are normally distributed.This method will provide the function for our approach to estimation with the logistic regression model. In a very general sence the method of maximum likelihood yields values for the unkown parameters which maximize the probability of obtaining the observed data. In order to apply this method first we must construct a function called likelihood function. This function expresses the probability of the obserrved data as a function of the unknown parameters. The maximum likelihood estimators of these parameters are chosen to be those values that maximize this function. We now describe how to find these values from the logistic regression model. If Y is coded as 0 or 1 then the expression for Π(x)given in equation2.7 providees (for an arbitrary value of β=(β0 , β1 )) the conditional probability that Y equal to 1 given x. This will denote as P (Y = 1|x). It follows the quantity 1 − Π(x) gives the conditional probability that Y is equal to the zero given x,P (Y = 0|x). Thus, for those pairs (xi , yi), where yi = 1 the contribution to the likelihood function is π(xi ) and for those pairs whire yi , the contribution to the likelihood function is 1 − π(xi ), where the quantity π(xi ) denotes the values of π(x) computed at xi . Thus for pairs (xi , yi), yi = 1 contribution to likelihood function = π(xi ) yi = 0 contribution to likelihood function = 1 − π(xi ), Where the quntity π(xi ) denotes the value value of π(x) computed at xi . A convent way to express the contribution to the likelihood function for the (xi , yi ) is through the expression, Likelihood f unction f or the pair (xi , yi) = π(xi )yi (1 − π(xi ))1−yi

24

(2.10)

since the observartion are assumed to be independent, the likelihood function is obtain as the product of the terms given in expresion 2.10, as follows. l(β) = Πni=1 [Π(xi )]yi [1 − π(xi )]1−yi

(2.11)

The principal of maximum likelihood states that we use as our estimate of β the value which maximize the expressin in 2.11. However, it is easier mathematicallu to work with the log of equation 2.11. This expression,the loglikelihood is defined as, L(β) = ln[l(β)] =

n 

yi ln[π(xi )] + (1 − yi ) ln[1 − π(xi )]

(2.12)

i=1

L(β) = ln [l(β)] =

n  i=1



 π(xi ) yiln + ln [1 − π(xi )] 1 − π(xi )

by equation2.7, π(x) = then we can be obtain 1 − π(xi ) =

(2.13)

eβ0 +β1 x 1 + eβ0 +β1 x 1 1 + 1 + eβ0 +β1 x

(2.14)

Now we devide equation 2.7/2.14, π(xi ) = eβ0 +β1 x 1 − π[xi ]

(2.15)

then we apply log scale for both sideds,   π(xi ) ln = β0 + β1 x 1 − π[xi ]

(2.16)

Now we apply equation 2.5 & 2.16 for equation 2.13 

n 

1 (β0 + β1 ) + ln L(β) = ln(l(β)) = 1 + eβ0 +β1 x i=1

 (2.17)

Defferentating with respect to β0  ∂ ∂ ln(l(β)) = yi + (1 + eβ0 +β1 x ) ∂β0 ∂β0 i=1 n



1 1 + eβ0 +β1 x

 ∂ eβ0 +β1 x ln(l(β)) = yi − ∂β0 1 + eβ0 +β1 x i=1 n

25

 ∂ ln(l(β)) = yi − π(xi ) ∂β0 i=1 n

(2.18)

now defferentation with respect to β1

n  1 ∂ ∂ β0 +β1 x ln(l(β)) = yi xi + (1 + e ) ∂β1 ∂β1 1 + eβ0 +β1 x i=1

n  1 + eβ0 +β1 x i = yi x − xi 1 + 1 + eβ0 +β1 x i=1 ∂ ∂ ln(l(β)) = 0 and ln(l(β)) = 0 ∂β0 ∂β1 To find the value of β that maximizes L(β), we differentiate L(β) with respect to β0 and β1 and set the resulting expressions equal to zero. These equations, known as the likelihood equations, are: n  [yi − π(xi )] = 0 (2.19)

if

i=1 n 

xi [yi − π(xi )] = 0

(2.20)

i=1

The value of β given by the solution to equations 2.19 & 2.20 is called the maximum ˆ In general, the use of symbol “ˆ.” denotes the maxlikelihood estimate and debote as β. imum likelihood estimate of the respective quantity. For exampleπ(xi ) is the maximum likelihood estimate of π(xi ). This quantity provides an estimate of the conditional probability that Y is equal to 1, given that is equal to xi . As such, it represents the fitted or redicted value for the logistic regression model. an interesting consequence of equation 2.19 is that, n n   yi = π ˆ (xi ) (2.21) i=1

i=1

That is, the sum of the observed value of y is equal to the sum of the predicted (expected) values. As an example, consider the data given in Table 2.3. Used of statistical software pakage called Minitab, with AGE as the independent variable, produced the out put in the Table 2.3. The maximum likelihood estimates of β0 and β1 are, thus see to be βˆ0 = −5.309 and βˆ = 0.111. To fitted values are given by the equation e− 5.309 + 0.111 × AGE 1 + e − 5.309 + 0.001 × AGE and the estimated logit, gˆ(x), is given by equation π(x) =

gˆ(x) = −5.309 + 0.111 × AGE

(2.22)

(2.23) The log likelihood given in Table 2.3 is the value of equation 2.12 computed using βˆ0 and βˆ1 .

26

Variable AGE Constant

Coeff Std.Err 0.111 0.0241 -5.309 1.1337

Z P > |Z| 4.61 0.001 -4.68 0.001

Table 2.3: Results of fitting the logistic regression model to the data in Table 2.1

2.6

Testing For The Significance Of The Coefficients

The general method for assessing significance of variables is easily illustrated in the linear regression model, and its use there will motivate the approach used for logistic regression. A comparisoin of the two approaches will highlight the difference between modeling continous and dichotomous responce variable. In the regresion, the assessment of the significance of the slope coefficient is approached by forming what is referred to an analysis of variance table. This table partition the total sum of squared deviations about their mean in to two parts, (1) The sum of squared deviations of observations about the regressionline SSE. (or residual sum-of -squares). (2) The sum of squares of predicted values,based on the regression model, about the mean of the dependent variable SSR.(Or due regression sum-of-squares). In linear regression, the comparison of observed and predicted values is based on the square of the distance between the two. If yi denotes the observed value for the ith individual under the model, then the statistic used to evaluate this comparison is, SSE =

n  (yi − yˆi )2

(2.24)

i=1

Under the model not contaning the indepent variable in question the only parameter is β0 and βˆ0 = y¯ and mean of the responce variable. In this case, yˆi = y¯ and SSE is equal to the total variance. When we include the independent varible in the total model any decrease in SSE will be due to the fact that the slope coefficient for the independent variable is not zero. The change in the value of SSE is the due to the regression source of variabilty, denote SSR, That is,  n   n    2 2 SSR = (2.25) (yi−) − (yi − yˆi ) i=1

2.7

i=1

Testing For The Significance Of The Coefficents For The Logistic Regressin

The guiding principle with logistic regression is also same: In logistic regression, comparison of observed to predicted value is based on the log likelihood function define in the

27

equation 2.12 L(β) = ln(l(β)) =

n 

{yi ln [π(xi )] + (1 − yi ) ln[1 − π(xi )]}

(2.26)

i=1

To better undestand this comparison, it is helpful conceptully to think of an observed value of the responce variable as also being a predicted value resulting from a saturated model. A saturated model is one that contains as many parameters as there one data points. (A simpal model is a saturated model is fitting a liner regression model when there only two data points, n = 2 ). The comparison of observed predicted values using the likelihood function is based on the following expression:   Likelihoodof thef ittedmodel D = −2ln (2.27) likelihoodof thesaturatedmodel The quantity inside the large brackets in the expression above is called the likelihood ratio. Using minus twice its log is necessary to obtain a quantity whose distribution is known and can there for based for hypothesis is known and can there for beased for hypothesis testing purpose. Such a test is called the likelihood ratio test. By using equation 2.12 ⎤ ⎡ n  { yi ln [ˆ π (xi )] + (1 − yi ) ln[1 − π ˆ (xi )]} ⎥ ⎢ ⎥ ⎢ i=1 ⎥  D = −2ln ⎢ ⎢ ( n { yi ln [yi ] + (1 − yi) ln[1 − yi ]}) ⎥ i=1 ⎦ ⎣ Where π ˆi = π ˆ (xi )

    n  π ˆi 1−π ˆi D = −2 yi ln + (1 − yi)ln y 1 − yi i i=1

(2.28)

The statistic, D, in equation2.7 is called the deviance by some authors, are plays a central role in some approoches to assessing goodness of fit. The deviation for logistic regression plays the same role that the residual sum of squars plays in linear regression. Furthermore, in a setting such as the show in Table 2.1, where the values of the outcome variable are either 0 or 1. So likelihood of saterrated model is 1, n l(Sateratedmodel) = πi=1 (yi)yi × (1 − yi )(1−yi ) = 1

(2.29)

D = −2ln(Likelihoodof thef ittedmodel)

(2.30)

So that, For purpose of, assesing the significance of an independent variable. We compare the value of D with and without independent variable in the equation. The change in D due to the inclusion of the independent variable madel obtain as; G = D(Modelwithvariable) − (Modelwithoutvariable)

28

(2.31)

The statistic plays the some role in logistic regression as the numerator of the patial F test does in linear regression.Becouse the likelihood of the saturated model is common to both values of D begin difference to compute G, then be repressed as,   Likelihood with a variable G = −2ln (2.32) Likelihood without variable 

1−yi βˆ0 e n 1− l(without variable) = πi=1 1 + eβˆ0  ˆ y i

1−yi 1 eβ0 n = πi=1 1 + eβˆ0 1 + eβˆ0  ˆ  β 0 yi ) (e n = πi=1 1 + eβˆ0 ˆ

eβ0 1 + eβˆ0

ln {l(without variable)} =

n 

y i 

ˆ yi βˆ0 − ln(1 + eβ0 )

i=1

Defferentation with respect to βˆ0  ∂ 1 ˆ | (ln(i(without variable))) = yi − eβ0 ˆ ∂βo 1 + eβ0 i=1 n

If

(2.33)

∂ (ln(i(without variable))) = 0, ∂βo n 

 yi =

i=1

n1 =

n1 = n

n 



ˆ

i=1

eβ0 1 + eβˆ0

n



yi =

i=1 yi = n

Simmlary, n0 = n0 = n

ˆ

eβ0 1 + eβˆ0

n

n 

ˆ

 n 

eβ0 1 + eβˆ0

n



1 − yi

(2.34)

(2.35)

i=1

1 − yi 1 = n 1 + eβˆ0

i=1

29

(2.36)

In this case,the value of G is, ⎡  n1  n0 ⎤ n0 n1 ⎥ ⎢ n G = −2 ln ⎣ n n yi (1−yi ) ⎦ Πi=1 π ˆ (1 − π ˆ) So we can take Test statistic,   n  [yi ln(ˆ πi )] − [n1 ln(n1 ) + n0 ln(n0 − n ln(n))] G=2

(2.37)

(2.38)

i=1

Hypotesis: H0 : β1 = 0 H1 : β1 = 0 Under the hypotesis that β1 is equal to zero, the statistic G follows a chi-square distribution with degree of freedom. Additional mathematical assumptions are also needed.however, for the above case they are rather than nonrestrictive and involve having a sufficintly large sample size n. We use symbol χ(ν) to denot a chi-square random variable with ν degree-of -freedom. As an example, we consider the model fit ton the data in Table 2.1 whose estimated coefficients and loglikelihood are given in Table 2.3. For these data, n1 = 43 and n0 = 70 thus evaluating G, as show in equation 2.34 yields. G = 2 {−53.677 − [43 + 57 − 100 100]} = 2[−53.677 − (−68.331)] = 29.31 Using this notation, the p− value associated with test is, P [χ2 > 29.31] < 0.001. Thus, we have conconvancing evidence that AGE is a significant variable inpredicting CHD. This is merely a statement of the statistical evidence for this variable. Other important factor to consider before concluding that the variable is clinically important would include the appropriatenss of the fitted model, as well as inclusion of other potentially inportant variables.

Wald test The assumption needed for these tests are the same as those of the likelihood test in equation 2.12. A More complete discussion of these tests and their assumptions may be

30

found in Rao(1973). The wald test statistic obtained by comparing the maximum likelihood estimate of the slope parameter,βˆ1 , to an estimate of its standad error. Hypotesis: H0 : β1 = 0 H1 : β1 = 0 The resulting ratio,under hypothesis that β1 = 0, will follow a standad normal distribution. Test statistic: W =

βˆ1 SE(βˆ1 )

(2.39)

For example, the Wald test for the logistic regression model in Table 2.3 is provided in the colum headed Z and is, 0.111 W = = 4.61 0.024 and the the two tailed P-value, Table 2.3, is P (|Z| > 4.61) < 0.001. Where Z denotes a random variable following the standed normal distribution.Hauck and Donner (1977) examined to performance of the wald test and found that it behaved in an aberrant manner, often failing to reject the null hypotesis when the coefficint was significant. They recommended that the likelihood ratio test be used. Jennings (1986) has also looked at adequacy of inferences in logistic regression based on wald statistics. His conlusion are similar to those of Hauck and Donner. Both the likelihood ratio test, G, and wald test, W , require the computation of the maximum likelihood estimate for β1 .

Score test In univariate case, this test is based on the conditional distribution of the derivative in equation 2.19. In this case, we can write down an expression for the score test. The test uses the value of equation 2.20. Computed using  n β0 = ln(n1 /n0 ) and β0 = 0. As 1 = y¯. Thus ,the left-hand side noted earlier ,under these parameter values, π ˆ = n n ¯). It may be shown that the estimated variance of equation i=1 xi (yi − y  2.20 become 2 y¯(1 − y¯) (xi − x¯) . The test statistic for the score test (ST) is, n i=1 (yi −) (2.40) ST =  n 2 y(1 − y) (x − x) i i=1 Hypotesis: H0 : β1 = 0 H1 : β1 = 0

31

As an example of the Score test, consider the model fit to the data in Table2.1. The value of the test statistic for example is, 296.66 3333.7342 = 5.41

ST = √

and the two tailed P − value, P (|Z| > 5.41) < 0.001.

2.8

Confidence interval estimation

The basis for contruction of the interval estimators is the same statistical theory. We used to formulate is the test for significance of the model. The end points of a 100(1 − α)% confident interval for the slope coefficent, beta1  N(β1 , var(βˆ1 )) βˆ1 − β1   N(0, 1) var(βˆ1 ) Where SE[βˆ1 ] be a positive square root of the variance estimator. ⎛ ⎞ ˆ β1 − β1 P r ⎝−z1− α2 ≤  ≤ z1+ α2 ⎠ ˆ var(β1 ) The end points of a 100(1 − α)% confident interval for the slope coefficent are, βˆ1 ± z1− α2 SE(βˆ1 )

(2.41)

βˆ0 ± z1− α2 SE(βˆ0 )

(2.42)

and, for the intercept they are,

Where z1− α2 is the upper 100(1 − α2 )% point form the standed normal distribution and ˆ denotes a model-based estimater of the standed error of the respective parameter SE estimater. As an example, consider the model fit to the data Table 2.1 regressing AGE on the presence or absence of CHD. The results are presented in the Table2.3. The endpoints of a 95% coefidence interval for the slope coeffident interval for the slope confident from 2.41are, 0.111 ± 1.96 × 0.0241 Interval is (0.064, 0.158). Briefly, the results suggest that the change in the log-odds of CHD per one year increase in age is0.111 and the change could be a little as 0.064 or as

32

0.158 with 95 percent confidence. The logit is the linear part of the logistic regression model and, it is a most like the fitted line in a linear regression model. The estimator of the logit is, gˆ(x) = βˆ0 + βˆ1 x

(2.43)

The estimator of the variance of the estimator of the logit requires obtaining the variance of a sum. In this case it is, vˆar(ˆ g (x)) = vˆar(βˆ0 ) + x2 vˆar(βˆ1 ) + 2xˆ cov(βˆ0 , βˆ1 ).

(2.44)

In general the variance of a sum is equal to the sum of the variance of each term and twice the covariance of each possible pair of terms formed from the components of sum. The endpoints of a 100(1 − α)% Wald-based confidence interval for the logit are,

gˆ(x) ∼ N(g(x), var[ˆ g (x)])

gˆ(x) − g(x) ∼ N(0, 1) SE[ˆ g (x)]

The end point of a 100(1 − α) confidence interval for the logit are,

gˆ(x) − g(x)  +z1− α2 P r −z1− α2  SE[ˆ g (x)] 100(1 − α)% confidence interval for the logit are, gˆ(x) ± z

(1−

ˆ g (x)] α SE[ˆ ) 2

(2.45)

ˆ g (x)] be a positive square root of the variance estimater in 2.44. Where SE[ˆ The estimated logit for the fitted model in Table 2.3 is show in 2.23. In order to evaluate 2.44 for a specific age we need the estimated covariance matrix.This matrix can be obtained from the output from all logistic regression software pakeges. The estimated logit form 2.23 for a subject of age 50 is, for subjet of age 50 is, gˆ(x) = −5.309 + 0.111 × 50 = 0.24 The estimated variance using equation2.44 and result in Table 2.4 is, vˆar[ˆ g (50)] = 1.28517 + (50)2 × (0.000579) + 2 × 50(−0.026679)

= 0.0650

ˆ g (50)] = 0.249. Thus the end points of a 95 and the estimated standed error is, SE[ˆ percent confidence interval for the logit at age 50 are, 0.240 ± 1.96 × 0.2550 = (−0.260, 0.740)

33

AGE Constant 0.000579 -0.026677 1.28517

AGE Constant

Table 2.4: Estimated convariance matrix of the estimated coefficicent in Table 2.3 The estimated of the logit and its confidence interval provide the basis for the estimator of the fitted value, in this case the logistic probability, and its associated confidence interval.In particular, using equation2.24 at the 50 the estimated logistic probability is, e(50) 1 + e(50) e−5.31+0.111×50 = 1 + e−5.31+0.111×50 = 0.560

π ˆ (50) =

and the end points of a 95 percent confidence interval are obtained from the respective end points of the confidence interval for the logit. The end point of the 100(1 − α)%confidence interval for the fitted value are , gˆ(x) ± z

ˆ g (x)] α SE[ˆ ) 2 ˆ g (x)] 1 + gˆ(x) ± z α SE[ˆ (1− ) 2 (1−

using the example at age 50 to demostrate the calculations, the lower limit is, e−0.260 = 0.435 1 + e−0.260 and the upper limit is , e0.740 = 0.677 1 + e0.740 We have found that a major mistake often made by persons new to logistic regression modeling is to try and apply estimates on the probability scale to individual subjects. The fitted value computed in π ˆ (50) analogous to a particular point on the line obtained from a linear regression.In linear regression each point on the fitted line provides an estimate of the mean of the dependent variable in a population of subjects with co variate value “x”. Thus the value of 0.56 in π ˆ (50) is an estimate of the mean (i.e propotion) of 50 years old subjects in the population sampled that have evidence of CHD.Each individual 50 years old subject either does or does not have evidence of CHD. The confidence interval suggests that this mean could be between 0.435 and 0.677 with 95% cofidence.

34

Chapter 3 Multiple Logistic Regression Model In the previous chapter we introduced the logistic regression model in the univariate contex. As in the case of linear regression, the strength of a modeling technique lies in its ability to model many variables, some of which may be on different measurement scale. In this chapter we will generalize the logistic model to the case of more than one independent variable. This will be referrd to as the “multivariable case”.Central to the consideration of multiple logistic models will be estimation of the coefficients in the model and testing for their significance.

3.1

The Multiple Logistic Regression Model

´ Consider a collection of p independent variables denoted b the vector X=(x 1 , x2 , x3 ...., xp ). For the moment we will assume that each of these variables is at least interval scale. Let the conditional probability that the out comeis presen be denoted by P (Y = 1|X) = π(x). The logit of the multiple logistic regression model is given by the equation, g(X) = β0 + β1 x1 + β2 x2 + ............... + βp xp

(3.1)

In which case the logistic regression model is, π(X) =

eg(X) 1 + eg(X)

(3.2)

In some of the independent variable are discrete, nominal scale variables such as race, sex, treatment group, and so forth. It is inapproprite to include them in the model as if they were interval scale variables. The numbers used to represent the variables levals of these nominal scale variables merely indentifiers, and have no numeric significance. In this situation the method of choice is to use a collection of design variables(Or dummy variables). suppose, for example, that one of the independent variable is race. Which has been coded as “White”,“Black”, and “Others”. In this case two design variables are necessary.

35

RACE White Black Other

D1 (Design Variable) 0 1 0

D2 (Design variable) 0 0 1

Table 3.1: An example of the coding the design variables for race coded at three levels y 3

2

// β 0 β

β

/ 0

1

0

x

Figure 3.1: Design variables. One possible coding strategy is that when the respondent is “White”, The two design variables, D1 and D2 , would both be set equal to zero;when represent is “Black’, D1 and D2 , would both be set equal to zer; when the respondent is “black”, D1 would be set equal to 1,while D2 would still equal 0; when the race of the espondent is “Other”, we would useD1 = 0 and D2 = 1. Table 3.1 illustrates this cording the design variables. Most logistic regression software will generate design variables, and some programs have a choice of several different models. The different strategies for creation and interetation of design variables and discussed in detail in next capter. In general, if a nominal scaled variable has k possible values, then k − 1 design variables will be needed. This is true since, unless stated other wise, all of our models have contant term. To illustrate the notation used for design variables in this text suppose that the j th independent variable xj has kth levels. The kj − 1design variables will be denoted as Djl and the coefficints for these design variable will be denoted as βjl , l = 1, 2, ...kj−1. Thus, the logit for a model will p variables and j th variable being discrete discrete would

36

be g(X) = β0 + β1 x1 + β2 x2 + ............... +

n 

βjl Djl βp xp

(3.3)

l=1

By using example 3.1, g(X) = β0 + β1 x1 +

2 

β2l D2l

l = 1, 2

l=1

= β0 + β1 x1 + β21 D21 + β22 D22 + β21 D21 (1)W hite =⇒ g1 (X) = β0 + β1 x + β21 (0) = β0 + β1 x (2)Black =⇒ g2 (X) = β0 + β1 x + β21 (1) + β22 (0) = β0 + β1 x + β21 = β3 + β1 x (3)Other =⇒ g3 (X) = β0 + β1 x + (β21 (0) + β22 (0)) = β0 + β1 x + β22 = β4 + β1 x we can ploted this equations. The 3 equations are paralel each of others. Only the Intercepts are differents. When discussing the multiple logistic regression model we will, in general, suppress the summation and double subscripting needed to indicate when design variables are being used. The exception to this will be the discussion of modeling strategies when we ned to use the specific value of the coefficients for any design variable in the method.

3.2

Fitting The Multiple Logistic Regression Model

Assume that we have a sample of n indepedent observations (xi , yi ) i = 1, 2, 3, ..., n. As in the univariate case, fitting the model requires that univariate case, fitting the model requires that we obtain estimates of the vector β  = (β0 , β1 , ..., βp ). The method of estimation used in the multivariable case will be the same as in the univariate situation-maximum likelihood. There will be p + 1 likelihood equations that are obtained by differentiating the log likelihood function with respect to the p + 1 coefficients.

37

Suppose we have a sample of n independent observations of the pair (x1i , x2i , x3i , ...., xpi , yi ) i = 1, ......, p, where yi denotes the value of a dichotomous outcome variable and xi is the value of the independent variable for the ith subject. Furthermore, assume that the outcome variable has been coded as 0 or 1, representing the obsence or the presence of the characteristic respectively. This will be denoted as P (Y = 1|x). It follow that the quantity 1 − π(x) gives the conditional probability that Y is equal to zero gives x,P (Y = 0|x). Thus, for those pairs (x1i , x2i , .....xpi , yi) where yi = 1, the contribution to the likelihood function is π(xij ), and for those pairs where yi = 0, the contribution to the likelihood function is 1 − π(xi ),where the quantity π(xij ) denote the value of π(x) computed at xij . The pair (x1i , x2i , .....xpi , yi ), contribution to, y1 = 1 contribution to likeliood f unction = π(xij ) y1 = 0 contribution to likeliood f unction = 1 − π(xij ) The conventity way to express to express the conttribution to the likelihood function for the pair (x1i , x2i , .....xpi , yi) is through the expression. π(xij )yi [1 − π(xij )]1−yi Science the observations are assumed to be independent, the likelihood function is obtained as product of the terms given gy expression. n π(xij )yi [1 − π(xij )]1−yi l(β) = πi=1

(3.4)

Where β  = (β0 , β1 , ..., βp ), The method of estimation used in the multivariate case will be the same as in the univariate situation-maximum likelihood. There will be P + 1 likelihood equations that are obtained by differentiating the log likelihood function with respect to the P + 1 coefficients. The principal of maximum likelihood, ln(l(β)) =

n 

{ ln(π(xij + (1 − yi ) ln(1 − π(xij ))))}

i=1

=

n  i=1

ln

π(xij ) 1 − π(xij )

+ ln (1 − π(xij ))

by equation 3.2, π(xij ) =

eg(x) 1 + eg(x)

&

1 − π(xij ) =

1 1 + eg(x)

so we can get, π(xij ) = eg(x) 1 − π(xij )

π(xij ) = ln(eg(x) ) = g(x) ln 1 − π(xij )

38

ln(l(β)) = =

n  i=1 n 



1 yi g(x) + ln 1 + eg(x)

 (3.5) 

yi (β0 + β1 x1 + β2 x2 + ...... + βp xp ) + ln

i=1



1 1 + eβ0 +β1 x1 +β2 x2 +......+βpxp

(3.6) Defferentating w.r.t β0 ,

n   −eβ0 +β1 x1 +β2 x2 +......+βpxp  ∂ β0 +β1 x1 +β2 x2 +......+βpxp [ln(l(β))] = yi + 1 + e ∂β0 1 + eβ0 +β1 x1 +β2 x2 +......+βpxp i=1

n  −eβ0 +β1 x1 +β2 x2 +......+βpxp = yi − 1 + eβ0 +β1 x1 +β2 x2 +......+βpxp i=1 =

n 

yi − π(xij )

i=1

 ∂ [ln(l(β))] = yi − π(xij ) ∂β0 i=1 n

(3.7)

now defferentatiating equation 3.5 w.r.t βi ,



n    −eβ0 +β1 x1 +β2 x2 +......+βpxp ∂ β0 +β1 x1 +β2 x2 +......+βpxp (xij ) [ln(l(β))] = yi xij + 1 + e ∂β1 1 + eβ0 +β1 x1 +β2 x2 +......+βpxp i=1 

 n    −eβ0 +β1 x1 +β2 x2 +......+βpxp β0 +β1 x1 +β2 x2 +......+βpxp (xij ) = xi yij + 1 + e 1 + eβ0 +β1 x1 +β2 x2 +......+βpxp i=1 =

n 

xij [yi − π(xij )]

i=1

 ∂ [ln(l(β))] = xij [yi − π(xij )] ∂β1 i=1 n

(3.8)

Then, 3.7 and 3.8, ∂ [ln(l(β))] = 0 ∂β0

&

∂ [ln(l(β))] = 0 ∂β1

so we can get, n 

yi − π(xij = 0

(3.9)

xij [yi − π(xij )] = 0

(3.10)

i=1 n  i=1

39

For j = 1......p. As in the univariate model, the solution of the likelihood equations requires special software that is available in the most, if not all, statistical packages. Let βˆ denote the solution to these equations. Thus, the fitted values for the multiple logistic regression model are π ˆ(xi ), the value of the expression in the equation 3.8 computed using hatβ and xi . In the previous chapter only a brief mention was made of the method for estimating the standed errors of the estimated coefficients. Now that the logistic regression model has been generalized both in concept and notation to the multivariable case, we consider estimation of standed errors in more detail. The method of estimating the variance of the estimated coefficients follows from welldeveloped theory of maximum likelihood estimation. This theory states that the estimators are obtained from the matrix of secand partial derivatives of the likelihood function. These partial derivatives have the following form,  ∂ 2 L(β) = − x2ij πi (1 − πi ) ∂βj2 i=1

(3.11)

 ∂ 2 L(β) =− xij xil πi (1 − πi ) ∂βj ∂βl i=1

(3.12)

n

n

for j = 1, ....p where πi denote P i(xi ) Let the (P + 1) × (P + 1) matrix containing the negative of the terms given in equation 3.11 and 3.16 be denote as I(β). This matrix is called the observed information matrix. The variance and covariance of the estimated coefficients are obtain from the inverse of this matrix which we denote as V ar(β) = I −1 (β). we will use notation V ar(βi ) to denote the j th diagonal element of this matrix, which is the variance of βˆj ,and Cov(βj , βl )to denote an arbitrary off-diagonal element. which is covariance of βˆi and βˆl . The elements of the variance and covariance, which will be ˆ are obtained by evaluating V ar(β) at β. ˆ We will use V ar(βˆj ) and denote by Vˆ ar(β), cˆov(βˆj , βˆl ) j, l = 1....p to be denote the values in this matrix. Also estimated standed errors of the estimated coefficients, which we will denote as,  1 ˆ βˆj ) = Vˆ ar(βˆj ) 2 SE(

(3.13)

for j = 1....p. We will use this notation in developing methods for coefficient testing and confidence interval estimation. A formulation of the inmormation matrix which will be useful when discussing model ˆ = X  V X, where X is an n by P + 1 matrix containing ˆ β) fitting and assessment of fit is I( the data for each subject,and V is an n by n diagonal matrix with general element

40

ˆ i )(1 − Π ˆ i ). That is, the matrix X Π(x ⎛ 1 ⎜ 1 ⎜ X = ⎜ .. ⎝ . 1 and the matrix V is,



⎜ ⎜ V=⎜ ⎝

is, x11 x12 . . . x1p x21 x22 . . . x2p .. .. .. .. . . . . xn1 xn2 . . . xnp

⎞ ⎟ ⎟ ⎟ ⎠

π ˆ1 (1 − π1 ) 0 ... 0 0 π ˆ2 (1 − π2 ) . . . 0 .. .. .. . . . 0 0 ... 0 π ˆn (1 − πn )

(3.14)

⎞ ⎟ ⎟ ⎟ ⎠

(3.15)

before proceeding futher we present an example that illustrates the formulation of a multiple logistic regression model and the estimation of its coefficients using a subject of the variables from the data for the low birth weight study. The code sheet for the full data set is given in Table(2.6). The goal of this study was to identify risk factors associated with given birth to a low birth weight body (weighing less than 2500 grams). Data were collected on 189 women, n1 = 59 of whom had low birth weight babies and n0 = 130 of whom had normal birth weight babies.Four variables thought to be inportant were, (i) Age (ii) Wight of the mother at her least menstrual period (iii) Race (iv) Number of physician visits during the first trimester of pregnancy. In this example, the variable race has been recoded using the two design variables in Table 3.1. The results of fitting the logistic regression model to these data are shown in Table 3.2. In the Table 3.2 the estimated coefficients for the two design variables for race are indicated by RACE2 and RACE3 . The estimation logit is given by the following expression. g(x) = 1.295−0.024×AGE−0.014×LW T +1.004×RACE2 +0.4333×RACE3 −0.049×F T V ˆ The fitted values are obtained using the estimated logit, g ˆ(X)

3.3

Testing For The Significance Of The Model There are three methods for find the level of significance.

(1) likelihood Ratio test (2) Wald test (3) score test

41

Variable AGE LWT RACE2 RACE3 FTV Constant

Coeff. -0.024 -0.014 1.004 0.433 -0.049 1.295

Std.Err 0.0337 0.0065 0.4979 0.3622 0.1672 1.0714

Z -0.71 -2.18 2.02 1.20 -0.30 1.21

P > |z| 0.480 0.029 0.044 0.232 0.768 0.227

Table 3.2: Table 3.2, estimated coefficients for a multiple Logistic regression model using the variables AGE, weight at least menstrual period (LWT), Race and Number of first trimester physician visits (FTV) for the low birth weight study.

3.4

Likelihood Ratio Test For Testing For The Significance Of The Model

Once we have fit a paticular multiple (multivariable)logistic regression model, we begin the process of model assessment. As in the univariate case presented in chapter 2, the first step in this process is the univariate assess the significance of the variable in the model. The likelihood ratio test for overall significance of the p coeffcients for the independent variables in the model is performed in exactly the same manner as in the univariate case. The test is based on the statistic G given in equation 2.32. The only difference is that the ˆ fitted values, π ˆ , under the model are based on the vector containing p+1 parameters,β. Under the null hypotesis that the p “slope” coeffcients for the covariantes in the model are equal to zero, the distribution of G will be chi-square with p degree of the freedom. Consider the fitted model whose estimated coefficients are given in Table 3.2 for that model, the values of the loglikelihood, calculated by using minitab software. Test statistic G = 2 {[yi ln(ˆ πi ) + (1 − yi )ln(1 − π ˆi ) + (nln(n1 ) + n0 ln(n0 ) − nlnn )]}

(3.16)

Hypotesis: H0 : βj = 0 H1 := βj = 0

j = 1....p

The likelihood is given by the Table 3.2 For that model the value of the likelihood, shoe at the bottem of table is L = −111.286. The log likelihood for the constant only model may be obtained by evaluating for the constant only model may be obtained by evaluating the numerator equation 2.37 or by fitting the constant only model. n=189 n0 = 130

42

Variable LWT RACE2 RACE3 Constant

Coeff -0.015 1.081 0.481 0.806

Std.Err 0.0064 0.4881 0.3567 0.8452

Z -2.31 2.22 1.35 0.95

P > |Z| 0.018 0.027 0.178 0.340

Table 3.3: Estimated coefficients for a multiple Logistic Regression model sing the variable LWT and RACE from the low birth wight stutdy. n1 = 59 Thus the value of the likelihood ratio test statistic from equation 2.44 G = 2[−111.286 − [(59ln59 + 130ln130) − 189ln189]] = 12.099 Now we can compare this value by using χ(n − 1)degree of the freedom. The P-value for the test is , P [χ2 (5) > 12.099] = 0.034 < 0.05 which is significant at the α = 0.05 leval. conclution We reject the null hypotesis in this case and conclude that at least one and perhaps all p coefficient are different from zero, an interpretation analogous to that in multiple linear regression. If our goal is to obtain the best fitting model while minimizing the number of parameters, the next logit step is to fit a reduced model containing only those variables thought to be significant, and comare it to the full model contaning all the variables. The results of fitting the reduced model are given by Table 3.3. The difference between the two models is the exclusion of the variables AGE and FTV from the full model. The likelihood ratio test comparing these model is obtained using the difinition of G given the equation 2.37. It will have a distribution that is chi-square with 2 degrees -of fredom under hypotesis that the coefficients for the variable excluded are equal zero. Hypotesis: H0 : βj = 0 ∀j = 1, 2, 3, ...p H1 : βj =  0 f or at ∀ onej. The value of the test statistic comparing the models in Table 2.2&2.3. , G = −2[(−111.630) − (111.286)] = 0.688

43

which, with 2 degree-of-fredom, has a P −value of P [χ2 (2).0.688] = 0.709. Science the P -value is large , exceeding 0.05, we conclude that the reduced model is good as full model. Thus there is no advantage in includeing AGE and FTV in the model. However, we must not base our models entirely on tests of statistical significance. Whenever a categorical independent variable is include (or exclude) from a model, all of its design variables shold be include (or excluded); to do other wise implies that we have recoded the variable. For example, if we only include design variable D1 as define in Table 3.1, then RACE is entered in to the model as a dichotomous variable coded as black or not black. If k is the number of levels of a categorical variable, then the contribution of this variable will be k − 1. For example, if we exclude race from the model,and race is coded at three levels using the design variables shon in table 3.1, then there would be 2 degrees-of-freedom for the test, one for each design variable.

3.5

Wald Test For Testing For The Significance

Becase of multiple degree-of-freedom we must be careful in our use of the Wald(W ) statistic to assess the significance of the coefficients. For example, if the W statistics for both coefficients exceed 2, then we could conclude that the design variable are significant.The multivariable analog of the wald test is obtained from the folowing vector-matrix calculation: Test statistic ˆ −1 βˆ W = βˆ [ˆ v ar(β)] ˆ = βˆ [X  V X]−1 β, The matrix βˆ is, βˆ = (β0 β1 ....βp )p×1 ˆ i (1 − Π ˆ i ), The matrix X is, is an n by n diagonal matrix with general element Π X  V X = (p × n) × (n × p) × (p × n) = (p × n). So W be a normal scale variable. Hypotesis H0 : βj = 0 ∀j = 1, 2, 3, ...p H1 : βj = 0 f or at ∀ one j. It will be disributed as chi-square with p + 1 deree-of-fredom. Under the hypotesis that each of the p + 1 coefficients is equal is equal zero. Tests for just the p slope coefficients

44

are obtained by eliminating βˆ0 , from βˆ and the relevant row(first or last) and colum(first or last) from (X  V X). Science evaluation of this test requires the capabilty to perform ˆ there is no gain over the likelihood ratio test vector-matrix operations and to obtain β, of the significant of the model.

3.6

Confidence interval estimation

We discussed confidence interval estimatoers for the coefficients, logit and logistic probabilities for the simple logistic regression model. The methods used for coefidence interval estimators for a multiple variable model are essentially the same. The confident interval estimator for the logistic ia a bit more comlicated for the multiple variable model than the result presented in 2.45. The basic idea is the same, only there are now more terms involved in the summation.It follows from 3.2 that a general expression for the estimator of the logit for a model containing p covariates is, g(X) = βˆ0 + βˆ1 x1 + βˆ2 x2 + ............... + βˆp xp ˆ

(3.17)

An alternative way to express the estimation of the logit in 3.17 is through the use of ˆ where the vector βˆ = (βˆ0 , βˆ1 , ....βˆp ) denote the estimator vector notation as gˆ(X) = X  β, of the p+1 coefficients and the vector X  = (X0 , X1 , ..., Xp ) represents the constates in the model, where x0 = 1. gˆ(X) = βˆ0 + βˆ1 x1 + βˆ2 x2 + ............... + βˆp xp Vˆ ar[ˆ g (X)] = V ar[βˆ0 + βˆ1 x1 + βˆ2 x2 + ............... + βˆp xp ] Vˆ ar[ˆ g (X)] =

n 

Xj2 Vˆ ar(βˆj )

+

i=1

p p  

ˆ βˆj βˆk ) 2xj xk Cov(

j=0 k=j+1

We can express this resuls much more concisely by using the matrix expression for the estimater of the variance of the estimater of the coefficients.Form the expression for the observed information matrix, we have that, ˆ = (X  V X)−1 V ar(β)

(3.18)

It follows from 3.18 that an equivation expression for the estimator is 3.17, ˆ V ar[ˆ g (x)] = X  Vˆ ar(β)X = X  (X  V X)−1 X Var[ˆ g(x)] = X  (X  V X)−1 X

(3.19)

Fortunately, all good logistic regression software pakeges provide the option for the user to create a new variable containing the estimated values of 3.19 or the standed error for

45

LWT RACE2 RACE3 Constant

LWD 0.000041 -0.000647 0..000036 -0.005211

RACE2

RACE3

Constant

0.2382 0.0532 0.0226

0.1272 -0.1035

0.7143

Table 3.4: Estimated covariance matrix of the estimated coefficients in Table 3.3 all subjects in the data set. This feature estimaters the computational burden associated with the matrix calculationas in 3.19 and allows the user to routinely calculate fitted values and confidence interval estimaters. However it is useful to illustrate the details of the calculations. Using the model in Table 3.3, the estimated logit for a 150 pound white women is, gˆ(LW T = 150, RACE = white) = 0.806 − 0.015 × LW T + 1.081 × RACE2 + 0.481 × RACE3 = 0.806 − 0.015 × (150) + 1.081 × 0 + 0.481 × 0 = −1.444 and estimated the logistic probability is, e−1.444 1 + e−1.444 = 0.91

ˆ Π(LW T = 150, RACE = white) =

Conclution The interpretation of the fitted value is that the estimated proportion of low birth weight babies amoung 150 pound white momen is 0.191.In order to use 3.17 to estimate the variance of this estimated logit we need to obtain the estimated covariance matrix show in the Table 3.4. Thus the estimated variance of the logit is, gˆ(LW T, RACE = W hite) = β0 + βˆ1 × LW T + βˆ2 RASE2 + βˆ3 RACE3 V ar[ˆ g (LW T = 150, RACE = W hite)] = var(βˆ0 ) + 02 var(βˆ2 ) + 02 var(βˆ3 ) + 2 × 0 × cˆov(βˆ1 βˆ2 ) +2 × 0 × cˆov(βˆ0 βˆ3 ) + 2 × 150 × cˆov(βˆ1 βˆ2 ) +2 × 150 × 0 × cˆov(βˆ1 βˆ2 )2 × 0 × 0 × cˆov(βˆ2 βˆ3 ) V ar[ˆ g (LW T = 150, RACE = W hite)] = 0.7143 + (150)2 × 0.000041 + 0 × 0.2382 + 0 × 0.1272 +2 × 150 × (−0.0052) + 2 × 0 × 0.0226 +2 × 0 × (−0.1035) + 2 × 150 × 0 × (−0.000647) +2 × 150 × 0 × 0.000036 + 2 × 0 × 0.0532 = 0.0768

46

The standad error is,  V ar[ˆ g (LW T = 150, RACE = W hite)] = var(V ar[ˆ g (LW T = 150, RACE = W hite)]) = 0.2771 The confidence interval for 95% confidence interval for estimated logistic is, gˆ(X) ∼ N(−1.444, (0.2771)2 ) −1.444 ± 1.96 × (0.2771) = (−1.988, −0.0901) the associated confidence interval for the fitted value is (0.120,0.289).

47

Chapter 4 Interpretation Of The Fitted Logistic Regression Model Introduction In chapter 2 and 3 I discused the method for fitting and testing for the significance of the logistic regression model. After fitting a moel the emphasis shifts from the computation and assessment of significance of the estimated coefficients to the interpret ation of their values. Strictly speaking, an assessment of the adequacy of the fitted model should precede any attempt at interpteting it. Thus, we begin this chapter assuming that a logistic regression model has been fit, that the variables in the model are significant in either a clinical or statistical sense, and that the model fits according to some statistical measure of fit. The interpretation of any fitted model reqires that we be able to drow practical inferences from the estimated coefficients in the model. The interpretation of any fitted model requires that we be able to draw practical inferences from the estimated coefficients in the model. The question being address is: What do the estimated coefficients in the model tell us about the research questions that the motivated the study? For most models this involves the estimated coefficients for the independent variables in the model.On occasion, the independent coefficient is of interest;but this is the exception, not the rule. The estimated coefficients for the independent variables represent the slope.(i.e, rate of change) of a function of the dependent variable per unit of change in the independent variable.Thus interpretation involves two issues. (1) Determining the functional relationship between the dependent variable and the independent variable.

48

(2) Appropriatly defining the unit of change for the independent variable. The first step to determine what function of the dependent variable yields a linear function of the independent variables. This is called the link function. In the case of a linear regression model, it is the identity function science the dependent variable, by definition, is linear in the parameters. (For those unfamiliar with the term)“identity function”, it is the function Y = y). In the logistic regression model the link function is the logit transfomation, g(X) = ln[π(x)(1 − π(x))] = β0 + β1 x For a linear regression model recall that the slope coefficient, β1 is equal to the difference between the value of the dependet variable x+1 and the value of the dependent variable at x,for any value of x. For example, if y(x) = β0 + β1 x. it follows that β1 = y(x + 1) − y(x). In this case, the interpretation of the coefficient is relatively straighforward as it expresses the resulting change in the measurement scale of the dependent variable for a unit change in the independent variable. For example, if in a regression of wight on hight of male adolescents the slope is 5, then we would conclude that an increase of 1 inch in height is associated with an increase of 5 pounds weight. In the logistic regression model, the slope coeficients represents the change in the logit corresponding to a change of one unit in the independent variable (i.e, β1 = g(x + 1) − g(x)). Proper interpretation of the coefficient in a logistic regression model depends on being able to place meaning on the difference between two logits. Interpretation of this difference is discussed in detail on a case-by-case basis as it relates directly to the definition and meaning of a one-unit change in the independent variable.

4.1

Dichotomous Independent Variable

We begin our consideration of the interpretation of logistic regression coefficients with the situation where tha independent variable is nominal sale and dichotomous (i.e measured at two levels). This case provides the cnceptual foundation for all other situations. We assume that the independent variable, x is coded as either zero or none.The difference in the logit for a subject with x=1 and x=0 is,

g(x) = β0 + β1 x g(1) − g(0) = (β0 + β1 ) − (β0 + β1 x) = β1 The algebra shown in this equation is rather straightforward. We present it in this level of detail to emphasize that the first step in interpreting the effect of a covarite in a model is

49

Outcome Variable(Y) y=1 y=0 Total

Independent Variable (X) x=1 x=0 β0 +β1 e eβ0 π(1)= π(0)= 1 + eβ0 +β1 1 + eβ0 1 1 1 − π(1)= 1 − π(0)= β +β 0 1 1+e 1 + eβ0 1.0 1.0

Table 4.1: values of the logistic regression model when the independent variable is dichotomous outcome express the desired logit difference in terms of the model. In this case the logit difference is equal to β1 , In order to interpret this result we need to introduce and discuss a mesure of association termed the odd ratio. The possible values of the logistic probabilities may be conveniently display in a 2 × 2 table as shown in Table 4.1. The odds of the outcome being present among individuals with x = 1 is defined as π(1)/(1 − π(1)). similary, the odds of the oucome being present among individuals with x = 0 is defind as π(0)/(1−π(0)). The odds ratio, denoted OR, is defind as the ratio of the odds for x = 1 to the odds for x = 0, and given by the equation, eβ0 +β1 x 1 + β0 + β1 x

(4.1)

π(1)/(1 − π(1)) π(0)/(1 − π(0)

(4.2)

Π(x) =

OR =

β +β

OR =

e 0 1 1 ( 1+e β0 +β1 )/( 1+eβ0 +β1 ) β

e 0 1 ( 1+e β0 )/( 1+eβ0 )

eβ0 +β1 eβ0 = eβ1 = β1 =

Hence, for logistic regression with a dichotomous independent variable coded 1 and 0, the relationship between the odds ratio and the regression coefficecent is, OR = eβ1

(4.3)

This simple relationship between the coefficient and the odd ratio is fundamental reason why logistic regression has proven to be such a powerful analytic research tool. The odds ratio is a measure of association which has found wide use, especilly in epidemiology, as it approximates how much more likely (or unlikely) it is for the outcome to be

50

Outcome CHD(Y) Present(1) Absend(0) Total

AGED(X) ≥ 55(1) < 55(0) Total 21 22 43 6 51 57 27 73 100

Table 4.2: cross-classification of AGE dichotomized at 55 years and CHD for 100 subjects present among those with x = 1 then among those with x = 0. For example, if y denotes the persence or absence of lung cancer and if x denotes whether ˆ =2 estimates taht lung cancer is twice as likely to ocur the person is a smoker, than OR among smokers than among nonsmokers in the study population. As another example, suppose y denotes the precence or absence of heart disease and x denotes whether or not the person engages in regular strenous pysical exericise. ˆ = 0.5, then occurreses of heart disease is one harf If the estimated odds ratio is OR as likely to occur among those who exercise than those who donot in the study population. The interpretation given for the odds ratio is based on the fact that in many instance it approximates a quantity called the Relative risk. The parameter is equal to the ratio It follws from 4.1 that the odd ratio approximates the relative risk if,  π(1)/π(0).  1−π(0)

1. This holds when π(x) is smaller for both x = 1and 0. 1−π(1) An example may help to clarify what the odds ratio is and how it is coputed from the results of a logistic regression program or from a 2 × 2 table. In many examples of logistic regression encountered in the literature we find that a continuous variable has been dichotomize at some biologically meaningful cutpoint. Example: We consider pevious example that data displayed in Table1.1, and create a new variable, AGED, which takes on the variable 1 if the age of the subject is greater than or equal to 55 and zero otherwise. The result of cross classifying the dichotomized age variable with the outcome variable CHD is preseted in Table 3.2. By usin equation 4.3, the pair (xi , yi), The product of the terms given, n π(xi )yi [1 − π(xi )]1−yi l(β) = πi=1

(4.4)

The data in Table 3.2 tell us that there were 21 subjects with values (x = 1, y = 1) ,22 with (x = 0, y = 1) 6 with (x = 1, y = 0), and 51 with (x = 0, y = 0). Hence, for these data,the likelihood function shown in 2.11 simplifies to, l(β) = π(1)21 × [1 − π(1)]6 × π(0)22 × [1 − π(0)]51

51

(4.5)

Use of a logistic regression program to obtain the estimates of β0 and β1 . ˆ = e2.094 = 8.1. Readers who have have had some The estimate of the odd ratio is OR previous experience with the odds ratio undoubtedly wonder why a logistic regression package was used to obtain the maximum likelihood estimate of the odds ratio. When it could have been obtained directly form the cross-product ratio from Table 4.2 namely, ˆ = (21/6) = 2.094 OR (22/51) (21/6) ] = 2.094 βˆ1 = ln[ (22/15) We emphasize here that logistic regression, in fact, regression even in the simplest case possible.The fact that the data may be formulated in terms of a contingency table provides the basis for interpretation of estimated coefficients as the log of odds ratio. Along with the point estimation of a parameter, it is good idia to use a confidence interval estimate of provide additional information about parameter value. The odds ratio, OR, is usually the parameter of interest in a logistic regression due to its ˆ tends to have a distribution that the ease of interpretation. However, its estimate, OR ˆ is due to the fact that possible skewed. The skewness of the sampling distribution of OR values range between 0 and ∞, with the null value equatling. ˆ In theory, for large enogh sample sizes, the distribution of ORis normal. ˆ ∼ N(β1 , SE(βˆ1 )) ln(OR) Unfortunately, this sample size requirement typically exceeds that of most studies. hence, ˆ inferences are unually based on sampling distribution of ln(OR) = βˆ1 , which tends to follow a normal distribution for much smaller sample sizes. A 100(1 − α)% confidence interval (CI) estimate for the odds ratio is obtained by first calculating the end points of a confidence interval for the coefficients β1 , exp[βˆ1 + z(1− α2 ) SE βˆ1 ] As an exaple, consider the estimationof the odds ratio for the dichotomize variable AGE ˆ = 8.1 and the end points of a 95% confidence intervel are D. The point estimation is OR exp(2.094 ± 1.96.529) = (2.9, 22.9) Conclution This interval is typical of the confidence intervals seen for odds ratios when the point estimate exceeds 1. The confidence interval is skewed to the right. The confidence interval suggests that CHD among those 55 and older in the study population could be as little as 2.9 lites or much as 22.9 times more likely than those under 55, at the 95 present level

52

of confident. We illustrate these computations in detail, as they demonstrate the general method for computaing estimates of odds ratios in logistic regression. The estimate of the log of the odds ratio for any independent variable at two different levels, say x = a versus x = b, is the difference between the estimated logits computed at these two values. ˆ ln[OR(a, b)] = gˆ(x = a) − gˆ(x = b) = (βˆ0 + βˆ1 × a) − (βˆ0 + βˆ1 × b) = βˆ1 (a − b) ˆ ln[OR(a, b)] = βˆ1 (a − b)

(4.6)

The estimate of the odds ratio is obtained by exponentiating the logit difference, ˆ OR(a, b) = exp[βˆ1 × (a − b)]

(4.7)

ˆ b) This expression is equal to exp(β1 )only when (a-b)=1.In 4.6 & 4.7 the notation OR(a, is use to represent the odds ratio, OR =

π ˆ (x = a)/(1 − π ˆ (x = a)) π ˆ (x = b)/(1 − π ˆ (x = b)

(4.8)

ˆ = OR(0, ˆ and when a = 1 and b = 0. We let OR 1) Some software packeges ofer a choice of methods for coding design variables. The “zero-one”coding used so far in this section is frequenty referred to as reference cell coding. There are two methods for coding the cells. (1) Reference cell coding method (2) Eviation from means coding method

Reference Cell Coding Method The reference cell method typically assigns the value of zero to the lower code for x and one to hight code Example: If sex was coded as 1 =male and 2 =female. Then the resulting design variable under this method, D would be coded 0 =male and 1 =female, and then treating the variable SEX as if it were interval scaled.

53

Sex(code) Male(1) Female(2)

Design variable 0 1

Table 4.3: Illustration of the coding of the design variable using the reference cell method. Sex(code) Male(1) Female(2)

Design variable -1 1

Table 4.4: Illustration of the coding of the design variable using the deviation from means method.

Deviatio Form Means Coding Method Another coding method is frequently reffed to as deviation form means coding. This method assings the value of −1 to the lower code,and a value 1 to the higher code. The coding for the variable SEX discussed above is shown in Table 3.4. Suppose we wish to estimate the odds ratio of female versus male, when deviation from means coding is used. we do this by usig the general nethod shown in 4.6 & 4.7. different levels, say x=a versus x=b, is the difference between the estimated logits computed at these two values. ˆ ln[OR(f emale, male)] = gˆ(x = f emale) − gˆ(x = male) = g(D = 1) − g(D = −1) = (βˆ0 + βˆ1 × (D = 1)) − (βˆ0 + βˆ1 × (D = −1)) = 2βˆ1 and the estimated odds ratio is, ˆ OR(f emale, male) = exp(2βˆ1 ) Thus, if we had exponentiated the coefficient from the computer output we would have obtained the wrong estimated of the odds ratio. This points out quite clearly that we must pay close attention to the method to the method used to code the design variables. The method of coding also influences the calculation of the end points of the confidence interval. For the above example, using the deviation from means coding, the estimated ˆ ˆ standard error needed for confidence interval estimation is SE(2 βˆ1 ) which is 2 × SE(β 1 ). Thus the end points of confidence interval are exp[2βˆ1 + z(1− α2 ) 2SE βˆ1 ] In general, the end points of the confidence interval for the odds rather given in 4.9 are exp[βˆ1 (a − b) + z(1− α2 ) |a − b|SE βˆ1 ]

54

CHD status Present Absent Total Odds Ratio above95 CI ˆ ln(OR)

White 5 20 25 1 0.0

Black 20 10 30 8 (2.3,27.6) 2.08

Hispanic 15 10 25 6 (1.7,21.3) 1.79

Other 10 10 20 4 (1.1,14.9) 1.39

Total 50 50 100

Table 4.5: cross-classification of hypothetical data on RACE and CHD status for 100 subjects. where |a − b| is the absolute value of (a − b). since we can control how we code our dichotovariabls, we recommend that, in most situations, they be coded as 0 or 1 for analysis purpose. Each dichotomous variable is then treated as an interval scale variable. summary: For a dichotomous variable the parameter of interest is the odds ratio. An estimate of this parametermay be obtained from the estimated logistic regression coefficient, regardless of how the variable is coded. This relationship between the logistic regression coefficient and the odds ratio provideds the fundamention for interepretation of all logistic regression results.

4.2

Polychatomous Independent Variable

Suppose that insted of two categories the independent variable has k > 2 distinct values. For example, we may have variables that denote the county of residence within a state, the clinic used for primary health care within a city, or race. Each of these variable has a fixed number of distinct values and the scale of measurment is nomial. In this section we present methods for creating design variables for polychotomous independent variables. The choice of a particular method depends to same extent on the goals of the analysis and the stage of model development. We bebin by extending the method presented in Table 4.1 for a dichotomous variables. For example, suppose that in a study of chd the variable race is coded at four levels, and that the cross-classification of RACE by CHD yields the data in Table 4.5. These data are hypothetical and have been formulated for ease of computation. The extension to a situation where the variable has more than four levels is not conceptually different, so all examples in this section use k = 4.

55

RACE(code) white(1) Black(2) Hispanic(3) Other(4)

Race(2) 0 1 0 0

RACE(3) 0 0 1 0

RACE(4) 0 0 0 1

Table 4.6: specification of the design variables for RACE using reference cell coding with white as the reference group. variable RACE(2) RACE(3) RACE(4) CONSTANT Log likelihood=-62.2937

Coefficients 2.079 1.792 1.386 -1.386

Std.Err 0.6325 0.6466 0.6708 0.5000

z 3.29 2.78 2.07 -2.77

P > |z| 0.001 0.006 0.039 0.006

Table 4.7: Results of fitting the logistic regression model to the data in table 4.5 using the disign variablesa in table 4.6. At the bottom of the Table 4.5, the odds ratio is given for each race, using white as the refference group. For example, for hisponic the estimated odd ratio is, (15 × 20) (5 × 10) This reference group is indicated by a value of 1 for the odds ratio. These some estimates of the odds ratio may be obtained from a logistic regression program with an appropriate choice of design variables. The method for specifing the design variables. The method for specifying the design variables involves setting all of them equal to zero for the reference group, and then setting a single design variable equal to 1 for each of the other groups. This is illustrated in following Table4.6. using of any logistic regression program with design variables as show in Table 4.6 yields the estimated logistic regression coefficients given in table 4.7. Use of any logistic regression program with design variables coded as shown in Table 4.6 yields the estimated logistic regression

56

regression coefficients given in Table4.7. gˆ(x) = β0 +

n 

βjl Djl

i=1

= β0 + (β2l D2l + β22 D22 + β23 D23 ) = β0 + (βˆl D2l + βˆ2 D22 + βˆ3 D23 ) = β0 + (βˆl RACE(2) + βˆ2 RACE(3) + βˆ3 RACE(4))

ˆ ln[OR(Balck, W hite)] = gˆ(Black) − gˆ(white) " # = β0 + βˆl (RACE(2) = 1) + βˆ2 (RACE(3) = 0) + βˆ3 (RACE(4) = 0) # " ˆ ˆ ˆ − βl (RACE(2) = 0) + β2 (RACE(3) = 0) + β3 (RACE(4) = 0) = βˆ1 also ˆ ln[OR(Hyisponic, W hite)] = gˆ(Hyisponic) − gˆ(white) " # = β0 + βˆl (RACE(2) = 0) + βˆ2 (RACE(3) = 1) + βˆ3 (RACE(4) = 0) " # − βˆl (RACE(2) = 0) + βˆ2 (RACE(3) = 0) + βˆ3 (RACE(4) = 0) = βˆ2 ˆ ln[OR(Balck, W hite)] = gˆ(Black) − gˆ(white) = β1 (20 × 20) = (5 × 10) = 8 β1 = 2.072 ˆ ln[OR(Hispanic, W hite)] = gˆ(Hysponic) − gˆ(white) = β2 (15 × 20) = (5 × 10) = 6 β1 = 1.792

57

RACE(code) white(1) Black(2) Hispanic(3) Other(4)

Race(2) -1 1 0 0

RACE(3) -1 0 1 0

RACE(4) -1 0 0 1

Table 4.8: specification of design variable for RACE using deviation form means coding. ˆ ln[OR(Other, W hite)] = gˆ(Other) − gˆ(white) = β3 (10 × 20) = (5 × 10) = 2 β1 = 1.386  vˆar(βˆj ) =

1 1 1 1 + + + 5 20 20 10



ˆ var(βj )]1/2 SE(β j ) = [ˆ = 0.6325 We begin by computing the confidence limits for the log odds ratio (the logistic regression coefficient) and then exponentiate these limits to obtain limits for the odds ratio. In general, the limits for a 100(1 − α)% C.I for coefficent are of the form, βˆj ± (1 − α2 ) × SE(βˆj ).

4.3

Deviation From Means Coding Method

The secound method of coding design variables is called deviation from means coding. This coding expresses effect as the deviation of the “group mean”from the “overall mean”. The estimated coefficients obtained using deviation from means coding may be used to estimated the odds ratio for one category relative to a reference catogary. The equation for the estimate is more complicated than the one obtained using the reference cell coding. However, it provides an excellent example of the basic principal of using the logit difference of compute an odds ratio. To illustrate this we calculate the log odds ratio of Black versus White using the coding for design variablegiven in Table4.8. The logit

58

difference is as fillows. gˆ(x) = β0 +

n 

βjl Djl

i=1

= β0 + (β2l D2l + β22 D22 + β23 D23 ) = β0 + (βˆl D2l + βˆ2 D22 + βˆ3 D23 ) = β0 + (βˆl RACE(2) + βˆ2 RACE(3) + βˆ3 RACE(4)) # " ˆ W hite) = gˆ(Black) − gˆ(white) ln OR(Balck, " # = β0 + βˆl (RACE(2) = 1) + βˆ2 (RACE(3) = 0) + βˆ3 (RACE(4) = 0) # " ˆ ˆ ˆ − βl (RACE(2) = −1) + β2 (RACE(3) = −1) + β3 (RACE(4) = −1) = 2βˆ1 + βˆ2 + βˆ3 "

# ˆ ln OR(Balck, W hite) = 2βˆ1 + βˆ2 + βˆ3

(4.9)

To obtain a confidence interval we must estimate the variance of the of the sum of the coefficientsints in 4.9. In this example, the estimater is, " # ˆ var ln[OR(Balck, W hite)] = 4 × var(βˆ1 ) + var(βˆ2 ) + var(βˆ3 ) + 4ˆ cov(β1 , β2 ) cov(β2 , β3 ) +4 × cˆov(β1 , β2 ) + 42ˆ The evaluation of 4.9 for the current example gives, " # ˆ ln OR(Balck, W hite) = 2(0.765) + 0.477 + 0.072 = 2.079 The estimate of the variance is obtained by evaluating 4.9 wchich, for the current example, yields, # "  ˆ = 4(0.351)2 + (0.362)2 + (0.385)2 + 4(−0.031) V ar ln OR(Balck, W hite) 4(−0.040) + 2(−0.0444) = 0.40 and the standed error is " # ˆ ˆ SE ln[OR(Balck, W hite)] = 0.6325 We note that the values of the estimated log odds ratio, 2.079, and the estimated standed error, 0.625, are identical to the values of the estimated coefficient and standed error for the first design variable in Table 4.7. This is expected, since the design variables used to obtain the estimated coefficients in Table 4.7 were formulated specifilly to yield the log odds ratio relative to the white race category.

59

variable RACE(2) RACE(3) RACE(4) CONSTANT Log likelihood=-62.2937

Coefficients 0.765 0.477 0.072 -0.072

Std.Err 0.3506 0.3623 0.3846 0.2189

z 2.18 1.32 0.19 -0.33

P > |z| 0.029 0.188 0.852 0.742

Table 4.9: Results of fitting the logistic regression model to the data in Table 4.5 using the design variables in Table4.8

4.4

Continous Independent Variable

When a logistic regression model contains a continous independent variable, interetation of the estimated coefficients depend on how it is entered in to model and the paticular units of the variable. For purpose of developing the model to interpret the coefficient for a continuous variable, we assume that logit is linear in the variable.For purpose of developing the method to interpret the coefficients for a continous variable,we assume that the logit is linear in the variable. Under the assumtion that the logit is linear in the continuse variable, x, the equation for the logit is g(x) = β0 + β1 x. It follows that the slope coefficient β1 , give the change in the log odds for an increase of 1 unit in x that is, g(x) = β0 + β1 x g(x + 1) = β0 + β1 (x + 1) g(x + 1) − g(x) = β1 Most often the value of “1” is not clinically interesting. For example, a 1 year increase in age or a 1mmHg increases in systolic blood may be too small to be considered important. But a change of 10 yeares or 10mmHg might be considered more useful. On the other hand, if the range of x is from zero to 1,then a change of 1 is too large and a change of 0.01 may be more realistic. Hence, to provided a useful interpretation for continuous scale covariates we need to develop a method for point and interval estimation for an arbitrary change of “c” units in the covariate. The logs odds ratio of c units in x is obtained from the logit difference. g(x) = β0 + β1 x g(x + c) = β0 + β1 (x + c) g(x + c) − g(x) = c × β1 and the associated odds ratio is obtained by exponentiating logit difference. OR(c) = OR(x + c, x) = exp(cβ1 )

60

An estimate may be obtained by replacing β1 with its maximum likelihood estimate βˆ1 . An estimate of the standard error needed for confidence error of β1 by c. Hence the endpoints of the 100(1 − α)% confidence interval estimate of OR(c) are, exp[c × βˆj ± (1 −

α ) × c × SE(βˆj )] 2

j = 1...p

since both the point estimate and end points of the confidence interval depend on the choice of c, the particular value of c should be clearily specified in all tables and calculations. The rather arbitrary nature of the choice of c may be trouble to same. To provide the reader of our analysis with a clear indication of the of hom the risk of the outcome being present chanes with the variable in question changes in multiple of 5 or 10 may be most meaningful and easlily undestood. As an example , consider the univariable model in Table 1.3, In that example a logistic regression of AGE on CHD status using the data of Table 2.1 was reported. The resulting estimated logit was gˆ(AGE) = −5.310 + 0.111 × AGE

(4.10)

The estimated odds ratio for an increasing of 10 years in age is , ˆ OR(10) = exp(10 × 0.111) = 3.03 This indicates that for every increse 3.03 times. The validity of such a statement is questionable in this example, since the additional risk of CHD for a 40years old compared to a 30 years old may be quite different form the additional risk of chd for a 40 years old compared of a 30 years old may be quite different from the additional risk of CHD for a 60 years to 50 years old.This is an unavoidable dilemma when continouse covariates are modeled linearly in the logit. If it is belived that the logit is not linear in the covariate, then grouping and use of dummy variables should be considered. The end points of a 95% confidence interval for odds ratio are, exp(10 × 0.111 ± 1.96 × ×0.024) = (1.90, 4.86) Results similar to these may be placed in tables displaying the results of a fitted logistic regression model. In summary, the interpretation of the estimated coefficient for a continous variable is similar to that of nominal scale variable: an estimated log odds ratio. The primary difference is that a meaningful change must be defined for the continuous variable.

61

4.5

The Multivariable Model

In the previous we discussed the interpretation of an estimated logistic regression coefficient in the case when there is a singal variable in the fitted model. Now we considers a multivariable analysis for a more comprehensive modeling of the data. One goal of such an analysis is to statistically adjust the estimated estimated effect of each variable in the model for differences in the distributions of and associations among the other independent variables. Applying this concept to a multivariable logistic regression madel, we may surmise that each estimated coefficient provide an estimate of the log odds adjusting for allother variables included in the model. A full understanding of the estimaters of the coefficients from a multivariable logistic regression model requires that we ve a clear understanding of what is actually meant by the term adjusting statistically, for the variables. We begin by examining adjustment in the context of a linear regression model, and then extend the concept to logistic regression. The multivariable situation we examine is one in which the model contains two independent variables, (1) One-dicotomous (2) One-Continuous but primary interest is focused on the effect of the dichotomous variable. This situation is frequently encountered in epidemiologic reserch when an exposure to a risk factor is recorded as being either present or absend, and wish to adjust for a variable such as age. The analogous situation in linear regression is called analysis of covariance. For example, we wish to compare the mean weight of two groups of boys. It is known that wightis associated with many characteristics, one of which is age.Assume that on all characteristics except age the two groups have nealy identical distributions. If the age distribution is also the same for the two groups, then a univariate analysis would suffice and we could compare the mean weight of two groups. This comparison would provide us with a correct estimate of the difference in weight between the two groups. The statistical model that describe the situation in Figure 4.1 states that the value of weight, W , may be express as, w = β0 + β1 x + β2 a (4.11) where, (1) x=o for group 1 and (2) x=1 for group 2 and denote age. In this model the parameter β1 represent the true difference in weight between the two groups and β2 represent the rate of change in weight per year of age. Suppose that the

62

w

Weight(w)

w

w

w

a

a

a

Age(a)

Figure 4.1: Comparison of the weight of two groups of boys with different distribution of age. Group1 Group2 Variable Mean Std.Dev Mean Std.Dev PHY 0.36 0.485 0.80 0.404 AGE 39.60 5.272 47.34 5.259 Table 4.10: Descriptive statistics for two groups of 50 mens on AGE and whether they had seen a physician(PHY)(1=Yes,0=No)within the last months. mean age of group 1 is a ¯1 and the mean age of group 2 is a ¯2 . This situation is describe graphically in Figer3.1. In this figer it is assumed that the relationship between age and weight is linear, with the same significant nonzero slop in each group. Compare of the mean weight of group1 to mean weight of group 2 amounts to a comparison of w1 to w2 .In terms of the model this difference is, w2 = β0 + β1 x + β2 a2 w1 = β0 + β2 a1 (w2 − w1 ) = β1 + β2 (a2 − a1 ) Thus the comparison involves not only the true difference between the group, β1 , but a components β2 (a2 − a1 ) which reflect the difference between the ages of the groups. The process of statistically adjesting for age involves comparing the two groups at some common value of age. The value usually used is the mean of the two groups which, for the example, is denoted by a ¯ in Figure 3.1. In terms of the model this yields a comparison of

63

w4 to w3 , w4 = β0 + β1 x + β2 a w3 = β0 + β2 a w4 − w3 = β1 x + β2 (a − a) = β1 is true difference between two groups. In theory any common value would yield the same difference between two lines. The choice of the overall mean makes sense for two reasons. It is biologically reasonable and lies within the range for which we belive that the association between age and wight is linear and contant within each group. Consider the same situation show in figuer 3.1, but instead of weight being the dependent variable, assume it is a dichotomous variable and that the vertical axis denotes the logit. That is, under the model the logit is given gy the equation g(x, a) = β0 + β1 x + β2 a A univariate comparison obtained from the 2 × 2 table cross-classfing outcome and group would yield a log odds ratio approximately equal to β1 +β(a2 −a1 ). This would incorrectly estimate the effect of group due to the difference in the distribution of age. This logit difference is g(x = 1, ¯a) − g(x = 2, a ¯) = β1 .Thus, the coefficients β1 is the log odds ratio that we would expect to obtain from a univariate comparison if the two groups had the same distribution of age. The univariate log odds for group 2 versus group 1 is, ˆ ln(OR) = ln(0.8/0.2) − ln(0.36/0.64) = 1.962 ˆ OR = 7.11 We can also see that there is a considerable difference in the age distribution of the two groups, the mean in group 2 being on average more than 7 years older than those in group 1. We would guess that much of the apparent differnce in the proportion of men secing a physician might be due to age. Analyzing the data with a bivariate model using a coding of Group=0 for group 1, and Group=1 for group2, yield the estimated logistic regression ˆ shown in Table 4.11. The age adjusted log odds ratio is OR=exp(1.263)=3.54.Thus ,much of the apparent difference difference between the two groups is fact,due to diffrences in age. Let us examine this adjustment i more detail using Figure 4.1. An approximation to the

64

variable GROUP AGE CONSTANT Log likelihood=-54.8292

Coefficients 1.263 0.107 -4.866

Std.Err 0.5361 0.0465 1.9020

z 2.36 2.31 -2.56

P > |z| 0.018 0.021 0.011

Table 4.11: Resuls of fitting the logistic regression model to the data summarized in Table 4.10

unadjested odds ratio is obtained by exponentiating the dfference w2 − w1 . In terms of the fitted logistic regression model shown in Table 3.11. This difference is, w2 − w1 = (β0 + β1 + β2 a2 ) − (β0 + β2 a1 ) = β1 + β2 (a2 − a1 ) [−4.866 + 1.263 + 0.107()047.34] − [−4.866 + 0.1107(39.60)] = 1.263 + 0.107(47.34 − 39.60) The value of this odds ratio is, e1.263+0.107(47.34−39.60) = 8.09 The discrepancy between 8.09 and the actual unadjusted odds ratio, 7.11 is based on the difference in the average logit,while the crude odds ratio is approximatly equal to a calculation based on the averege estimated logistic probability for two groups. The age adjested odds ratio is obtained by exponentating the difference w4 − w3 , which is equal to the estimated coefficient for GROUP. In the example the difference is, w4 − w3 = (β0 + β1 + β2 a) − (β0 + β2 a) = β1 [−4.866 + 1.263 + 0.107(43.47)] − [−4.866 + 0.107(43.47)] = 1.263 Bachand and Hosmer (1999) compare two different sets of criteria for defining a covariate to be confounder. They show that the numerical approach used in this section, examining the change in the magnitude of the coefficient for the risk factor from logistic regression models fit with and without the potential confounder both risk factor and confounder is not fully Sshape. The method of adjestment when the variables are all dichotomous, polychotomous, continuous or a mixture of these is identical to that just described for the dichotomous-continuous variable case. For example, suppose that instead of treating age as continuous it was dichotomized using a cutpoint of 45 years. To obtain the age-adjusted effect of group we fit age-adjusted effect of group we fit the bivariate model containing the two dichotomous variables and calculate a logit difference variables and calculate a logit difference at the two dichotomous variables and calculate a logit difference at the two

65

levels of group and a common value of the dichotomous variables for age. This procedure is similar for any number and mix of variables.Adjusted odds ratios are obtained by comparing individuals who differ only in the characteristic of interest and have the values of all other variable constant.

4.6

Interaction And Confounding

In the last section I saw how the inclution of additional variables in a model provides a way of statistically adjesting for potential differences in their distributions. The team confounder is used by epidemiologyist to describe a covariate that is associated with both the outcome variable of interest and a primary independent variable or risk factor.When both associations are present then the relationship between the risk factor and the outcome variable is said to be confounded. In this section we introduce the concept of interaction and show how we can control for its effect in the logistic regression model. In addition, I illustrate with an example how confounding and interaction may affect the estimated coefficients in the model. If the association between the covariate (i.e age) and the outcome variable is the same with in each level of the risk factor (i.e group), then there is no interaction between the covariate and the risk factor. Graphically, the absence of interaction yields a model with two parallel lines, one for each level of the risk factor variable. In general, the absence of interaction is characterized by a model that contains no second or higher order terms involving two or more variables. When interaction is present, the asspociation between the risk factor and the outcome variable differs, or depends in some way on the level of the convariate. That is, the covariate modefies the effect of the risk factor. Epidemiologists use the term effect modifier to describe a variable that interacts with a risk factor. In the previous example,the logit in linear in age for the men in group 1, then interaction implies that the logit does not follow a line with the same slope for the secound group. In theory, the association in group 2 could be described by almost any model except one with the same slope for the secound group. In theory, the association in group 2 could be dascribe by almost any model except one ith the same slope as the logit for group 1. Figure 3.2 presents the graphs of three different logits.In this graph, 4 has been addsd to each of the logists to make plotting more convenient. The graphs of these logits are used to exlpain what is meant by interaction. Consider an explain what is meant by interaction. Consider an example where the outcome variable is the presence or absence of CHD, the risk factor is sex, and the covariate is age. Suppose that the line labeled l1 corresponds to the logit for female aas a function of age. Line l2 represents the logit for males.

66

l3 5 l2

LogOdds+4

4

3 l1 2

1

35

40

45

50

55

60

65

AGE

Figure 4.2: Plot of the logits under three different models showing the presence and absence of interaction. Model 1 2 3

Constant 0.060 -3.3374 -4.216

SEX AGE SEX × AGE 1.981 1.356 0.082 -4.216 0.103 -0.062

DEVIATION 419.816 407.78 406.392

G 12.036 1.388

Table 4.12: Estimate logistic regression coefficients , deviance, and the likelihood ratio test statistic (G) for an example showing evidence of confounding but no interation(n=400) These two lines are parallel to each other, indicating that the relationship between age and CHD is the same for male and females. In this situation there is no interection and the log odds ratios for sex(male vs Female) Controlling for age is given gy difference between line l2 and l1 , l2 − l1 , This difference is equal to the vertical distance between the two lines, which is the same for all ages. Suppose instead that the logit for males is given by the line l3 . This line is steeper than the line l1 , for females, indicating that the relationship between age and CHD among males is different from that among females. When this occurs we say there is an interactin between age and sex. The estimate of the log-odds for sex (male versus female)controlling for age is still given by the vertical distance between the lines, l3 − l1 , but this difference now depends on the age at which the comparison is being made. Thus, we cannot estimate the odds ratio for sex without first first specifying the age at which the comparison is being made. In order words, age is an effect mofifier.

67

Model 1 2 3

Constant 0.201 -6.672 -4.825

SEX AGE SEX × AGE 2.386 1.2774 0.166 -7.838 0.121 0.205

DEVIATION 376.712 338.688 330.654

G 38.024 8.034

Table 4.13: Table 4.13 Estimate logistic regression coefficients, deviance, and the likelihood ratio test statistic (G) for an example showing evidence of confounding but no interation(n=400)

Result of Table 4.12 & Table 4.13 Table 4.12 and 4.13 present the result of of fitting a series of logistic regression models to two different sets of hypothetical data. The variable in of the data sets are the same SEX,AGE and the outcome variable CHD. In addition to the estimated coefficients, the deviation for each model is given. Recall that the change in the significance of coefficients for variables added to the model. An interaction is added to the model gy creating variable that is equal to the product of the value of SEX and the value of AGE. Some programs have synatex that automatically creates interaction variables in a statistical model, while other require the user to create them through a data modification step. Examining the results in Table 4.12 we see that estimated coefficient for the variable SEX changed from 1981 in the model 1 to 1.356 a 46 pesent decrease, when AGE was added in model 2. Heance, there is clear evidence of a confounding effect due to AGE when the interection term “AGE×SEX” is added in model 3 we see that the change in the deviance is only 1.3888, when compared to the chi-squar distribution with 1 degree of freedom, yields a p-value of 0.24, which is clearly not significant. Note that the coefficient for sex changed from 1.356 to 4.239. This is not surprising science the inclusion of an interaction term, especially when it involves a continuous variable usually produces fairly marked changes in the estimated coefficients of dichotomous variables involved in the interaction. Thus, when an interaction term is presented in the model we cannot assess confounding via the change in a coefficient. For these data we would prefer to use model 2 that suggests age is a confounder but not an effect modifier. The resuls in Table 3.13 show evidence of both confounding and interaction due to age. Comparing model 1 and model 2 we see that the coefficient for sex changes from 2.386 to 1.374 an 87 percent decrease. When the age by sex interaction is added to the model we see that the change in the deviance is 8.034 with a p-value of 0.005. Science the change in the deviance is significant we prefer model 3 to model 2 and should regard age as both a confounder and an effect modifier. The net result is that any estimate of the odds ratio for sex shoud be made with reference to a specific age.

68

4.7

Estimation Of Odds Ratios In The Presence Of Interaction

In previous section we showed that when there was interection between a risk factor and another variable,the estimater of the odds ratio for the risk factor depends on the value of the variable that is interacting with it. In this situation we may not be able to estimate the odds ratio by simply exponentiating on estimated coefficient one approch that will always yield the correct model-based estimate is to, (1) Write down the expressins for the logit at the two levels of the risk factor being compared. (2) Algebraically simplify the difference between the two logits and compute the value. (3) Exponentiate the value obtained in step 2. As a first example, we develop the method for a model containing only two variables and their interection. In this model, denote the risk factor as, F, the covariate as X and their interection as F×X and X=x is, g(f, x) = β0 + β1 f + β2 x + β3 f × x (4.12) Assume we want the odds ratio comparing two levels of F, F=f1 versus and F, F=f0 at X=x. Following the three step procedure first we evaluate the expressions procedure first we evaluate the expression for the two logits yielding. g(f1 , x) = β0 + β1 f1 + β2 x + β3 f1 × x g(f0 , x) = β0 + β1 f0 + β2 x + β3 f0 × x Second we compute and simpify there difference to obtain the log-odds ratio yieling. ln[OR(f = f1 , f = f0 , x = x)] = g(f1, x) − g(f1 , x) = (β0 + β1 f1 + β2 x + β3 f1 × x) −(β0 + β1 f0 + β2 x + β3 f0 × x) = β1 (f1 − f0 ) + β3 × x(f1 − f0 ) Third we obtain the odds ratio by exponentiating the difference obtained at step 2 yieding. OR = exp[β1 (f1 − f0 ) + β3 × x(f1 − f0 )]

(4.13)

The expression for the log-odds ratio in4.13 does not simlipy to a single coefficient. Instead it involves two coefficients, the difference in the values of the risk factor and the interaction variable.The estimator of the log-odds ratio is obtained by replaceing the parameters in 4.13 with the three estimaters.we calculate the end points for the confidence interval for

69

the log-0dds ratio and then exponentiate the end points is the estimater of the variance of the estimater of the log odds ratio in 4.13. Using methods for calculating the variance of a sum we obtain the following estimater. # " ˆ = (f1 − f0 )2 × vˆar(βˆ1 ) (4.14) vˆar ln[OR(F = f1 , F = f0 , X = x)] +[x(f1 − f0 )]2 var(βˆ3 ) 2

+2x(f1 − f0 ) × cov(β1 , β3 )

(4.15) (4.16)

The end points of a (1 − α)% confidence interval estimator for the log-odds ratio are, " # ˆ ˆ α (4.17) [β1 (f1 − f0 ) + β3 × x(f1 − f0 )] ± z(1− 2 ) SE ln[OR(F = f1 , F = f0 , X = x)] Where the standad error in 4.17is the positive square root of the variabne estimator in 4.13. We obtain the end points of the confidence interval estimator for the odds ratio by exponentiating the endpoints in 4.17. The estimators for the log-odds and its variance simplify in the case when F is a dichotomous risk factor. If let f1 = 1and f0 = 0 then the estimator of the log-odds ratio is, ˆ ln[OR(F = 1, F = 0, X = x)] = βˆ1 + βˆ3 x

(4.18)

The estimator of the variance is, ˆ Vˆ arln[OR(F = 1, F = 0, X = x)] = vˆar(βˆ1 ) + x2 vˆar(βˆ3 ) + 2xˆ cov(βˆ1 , βˆ3 ) and the end points of the confidence interval are , " # ˆ ˆ ˆ ˆ α (β1 + β3 x) ± z(1− 2 ) SE ln[OR(F = f1 , F = f0 , X = x)]

(4.19)

(4.20)

Example: We consider a logistic regression model using the low birth weight data described in section 1.6 containing the variables AGE and a dichotomous variable, LWD, based on the weight of the mother at least menstrul period. This variable takes on the value 1 if LWT< 110 pounds, and is zero otherwise. The result of fitting a series of logistic regressio models given in Table 4.14. using the estimation coefficent for LWD in model 1. we estimated the odds ratio as exp(1.054) = 2.87. The result shown in Table 4.14 indicate that AGE ˆ = 4.2, present,but it does interact with LWD, P=0.076. is not a stong confounder, β% Thus, to assess the risk of low wight at the last menstrual period correctly. We must include the interaction of this variable with the women’s age becouse the odds ratio is not constant over age. An effective way to see the presence of interaction is via a graph of the estimated logit under model 3 in Table 4.14 This is show in Figure 4.3.

70

Model 0 1 2 5

Constant -0.790 -1.054 -0.027 0.774

LWD

Age

lWD × AGE

1.054 1.010 -0.044 -1.944 -0.080 0.132

ln[l(β) -117.34 -113.12 -112.14 -110.57

G

P

8.44 0.0004 1.96 0.160 3.14 0.076

Table 4.14: Table 4.14 estimated logistic regression coefficients, Deviance, the likelihood ratio teststatistic (G), and the P-value for the change for models containing LWD and AGE from the low birthwight containing LWD and AGE from the low birthwight data(n=189) Constant Constant 0.828 LWD -0..828 AGE -0.352-02 LWD*AGE -0.352-01

LWD

AGE

LWD × AGE

2.975 -0.353-01 0.157-02 -0.128 -0.157-02 0.573-02

Table 4.15: Estimated covariance matrix for the estimated parameters in model 3 of Table 4.14. The upper line in Figure 4.3 corresponds to the estimated logitfor women with LWD=1 and the lower line is for women with LWD=0. Separate plotting symbols have been used for the two LWD groups. The estimated log-odds ratio for LWD=1 verses LWD=0 at AGE=x from 4.18 is equal to the vartical distance between the two lines at AGE=x in Figure 4.3 that none of the women in the low wight group, LWD=1, are older than about 33 years. Thus we should restrict our estimates of the effect of low wight to the range of 14 to 33 years. Based on these observations we estimate the effect of low weight at 15, 20, 25 and 30 years of age. Using 4.18and the result for model 3 the estimated log-odds ratio for low weight at the last menstrual period for a women of AGE a is, ˆ ln[OR(LW D = 1, LW D = 0, AGE = a)] = −1.944 + 0.132

(4.21)

In oder to obtain the estimated variance we must first obtain the estimated covariance matrix is symmetric most logistic regression. Soft packages print the result in the form similar to that shown in thae Table 4.15. The estimated variance of the log-odds ratio given 4.17 is obtain from 4.20 and is " # ˆ V ar ln[OR(LW D = 1, LW D = 0, AGE = a)] = 2.975 + a2 × 0.0057 + 2 × (−0.128) (4.22)

71

Age OR 95 CIE

15 20 25 30 1.04 2.01 3.90 7.55 0.29-3.79 0.91-4.44 1.71-8.88 1.95-29.19

Table 4.16: Estimated odds ratios and 95 present confidence intervals for LWD,controlling for AGE. values of the estimated odds ratio and 95% confidence interval (CI) using 4.22 for several ages are given in Table 4.16. The result show in Table 4.16 demostrate that the effect of lwd on the odds of having a low birth weight baby increase exponentially with age. The result also show that the increase in risk is significant for low weight women 25 years and older. In particular low weight women 25 years and older. In particular low weight women of age 30 are estimated to have a risk that is about 7.5 times that of women of the same age who are not low weight. The increase in risk could be as little as two times ar as much as 29 times with 95% coefidence.

72

Chapter 5 Model-Building Strategies And Mothods For Logistic Regression In the previous cahpters we fouused on estimating, testing, and interpreting the coefficients in a logistic regression model. The examples discussed were characterized by having few independent variables, and there was perceived to be only one possible model. While there may be situations where this is the case, it is more typical that there are many independent variables that could potentially be included in the model. Hence, we need to develop a stretegy and associated methods for handling these more complex situations. The goal of any method is to select those variables that result in a “best” model within the scientific context of the problem. In order to achieve this goal we must have: (1) A basic plan for selecting the variables for the model (2) A set of methods for assessing the adequacy of the model both in terms of its individual variables and its overall fit. We suggest a general stetegy that consider both of these areas. Succesful modeling of a complex data set is part science, part statisticalmethods, and part eperience and common sense. It is our goal to provide the reader with a paradigm that, when applied thoughtfully, yields the dest possible model within the constraints of the available data.

5.1

Variable selection

The criteria for includeing a variable in a model may vary from one problem to the next and from one scientific discipline to another. The traditional approach to statistical model building involves seeking the most parsimonious model that still explains the data. The rationale for minimizing the number of variables in the model is that the resultant model is more likely to be numerically stable, and is more easily generalized. The more variables included in a model, the greater the estimated standard errors become,and the more dependent the model becomes on the obseved data. Epidemiologic methodologists suggest

73

including all clinically and intuitively relevant variables in the model, regardless of their “statistical significance”. The rationale for this approach is to provide as complete control of confounding as possible within given the dataset. This is based on the fact that it is possible for individual variables not to exhibit strong confounding, but when taken collectively, considerable confounding can be present in the data. The major problem with this approach is that model may be “overfit”, producing numerically unstable estimates. Overfitting is typically characterized by unrelistically large estimated coefficients and/or estimated standard errors. This may be especially troublesome in problems where the number of variables in the model is large relative to the number of subjects and/or when the overall proportion responding (y = 1) is close to either 0 or 1. There are several steps one can be follow to aid in the selection of variables for a logistic regression model. The process of model building is quite similar to the one used in linear regression. (1) the section process should being with a careful univariable analysis of each variable. For nominal, ordinal, and continuous variables with few integer values, we suggest this be done with a continuous variable of outcome (y=0,1) versus the k levels of the independent variable. The likelihood ratio chi-square test with k-1 degrees -offreedom is exactly equal to the value of the likelihood ratio test for the significance of the coefficients for k-1 design variables in a univariable logistic regression model that contains that single independent variable.Since the pearson chi-square test is asymptotically equivalent to the likelihood ratio chi-square test, it may also be used. In addition to the overall test, it is good idea, for those variables exhibiting at least a moderate level of association, to estimate the individual odds ratio (along with confidence limits) using one of the levels as the reference group. For continuous variables, the most desirable univariable analysis involves fitting a univariable logistic regression model to obtain the estimated coefficient, the estimated standed error, the likelihood ratio test for the significance of the coefficient,and the univariable Wald statistic. An alternative analysis, which is equivalent at the univariable level, may be based on the two-sample t-test. Descriptive statistics avalable from two-sample t-test analysis generally include group means,standed deviations, the t-statistic, and its p− value. The similarity of this approach the logistic regression analysis follows from the fact that the univariable linear discriminant function estimate of the logistic regression coefficient is $ (¯ x1 − x¯0 ) t 1 1 = + (5.1) s2p sp n1 n0 and that the linear discriminant function and the maximum likelihood estimate of the logistic regression coefficient are usually quit close. When the independent varible is approximatly normally distributed within each of the outcome groups, y=0,1.

74

Thus the univariate analysis based on the t-test shoud be useful in determining whether the variable should be included in the model,science the p-value should be of the same order of magnitude as that of the Wald statistic, or likelihood ratio test from logistic regression. For continous covariates, we may wish to supplement the evaluation of the univariable logistic fit with some sort of smoothed scatterplot. This plot is helpful, not only in asertainingthe potential importance of the variable and possible presence and effect of extreme of extreme (large or small) observations, but also its appropriate scale. One sample and easily comuted from of a smoothed scaterplot was illustrated in Figure 1.2 using the data in Figure 1.2. Other more complecated methods that have greater precision are available. Kay and Little (1987) illustrate the use of a method proposed by Copas(1983). This method requires computing a smoothed value for the resonse variable for each subject that is a weighted average of the values of the outcome variable overall subjects. The weight for each subject is a continous decreasing function of the distance of the value of the covariate for the subject under consideration from the value of the covariate for all other cases. For example, for covariate x for the ith subject we compute n w(xi xj )yi y¯i = i=1 (5.2) n i=1 w(xi xj ) Where w(xi xj ) represents a paticular weight function. For example if we use STATA’s scatterplot smooth command, ksm,with the wight option and band width k, then 

3 |xi − xj |3 (5.3) w(xi , xj ) = 1 − where is define so that the miximum value for the weight is  1 and the two indices defining the sumation. ij and iu , include the k precent of the n subjects with x value closet to xi . Other wight function are possible as well as additional smoothing using locally weighted least squares regression, called lowess in some packages. (2) Upon completion of the univariable analysis,we select variables for the multivariable analysis.Any variable whose univariable test has a p-value < 0.25 is a candidate for the multivariable model along with all variable of known clinical importance. Once the variable have been identified, we bwgin with a model containing all of the selected variables. Our recomandation that 0.25 level be used as a screening criterion for variable selection is based on the work by Bendel and Afifi (1977) on linear regression and on the work by Mickey and Greenland (1989) on logistic regression. Thus we can

75

show that use of more traditional level (such as 0.05 level) often fails to identify variables known to be important. Use of the higher level has the disadvantage of including variable that are of questionable importance at the model bulding stage. For this reson, it is importance to review all variables added to a model critically berore a dicision is reached regarding the final model. As node above, the issue of variable selection is made complicated by different analytic philosophies as well as different statistical methods. One school of thought argues for the inclusion of all scientifically relevent variables in to multivariable model regardless of the results of univariable model analyses. In the general, the appropriateness of the decision to begin the multivariable model with all possible variables depends on the overall sample size and the number in each outcome group relative to the total number of candidate variables. When the data are adequate to support such an analysis it may be useful to begin the multivariable modeling from this point. However, when the data are inadequate, this appoach can produce a numerically unstable multivariable discussed in greater detail. In this case the Wald statistics should not be used to select variables becouse of the unstable nature of the results. Insted, we should select a subset of variables based on results of the univariable analyses and refine the definition of “Scientifically relevent”. Another approach to variable variable selection is to use a stepwise method in which variables are selected either for inclusion from the model in a sequentioal fashion based solely on statistical criteria. There are two main versions of the stepwise procedure. (a) Forward selection with a test for backward estimation (b) Backward elimination followed by a test for forward selection. This algorithms used to define these procedures in logistic regression. The stepwise approach is useful and intuitively appeling in that it buids models in a sequential fashion and it allows for the examination of a collection of models which might not otherwise have been examined. “Best subsets selection” is a selection method that has not been used extensively in logistic regression. Stepwise, best subsets, and other machanical selection procedures have criticized because they can yield a biologically implausible model[Greenland(1989)] and can select irrelevant, or noise, variables [Flack and chang (1987), Griffiths and Pope(1987)]. The problem is not the fact that the computer can select such models, but rather that the analyst fails to scretinize the resulting model carefully, and reports such result as the final, dest model. The wide avalability and ease with wtich stepwise mothods can be used has undoubtedly reduced some analysts to the role of assisting the computer in model selecting rather than the more appropriate alternative. It is

76

only when the anayst understands the strengths, and especially the limitations, of the methods that these methods can serve as useful tools in the model-building process. The analyst, not the computer, is ultimately responsible for the review and evaluation model. (3) Following the fit of multivariable model, the impotance of each variable included in the model should be varified.This shoud be verifified.This should include(a) an examination of the Wald statistic for each variable and (b) a comparison of each estimated coefficient with the coefficient from the model containing only that variable. Variable that do not contribute to the model based on these criteria shoud be eliminated and contribute to the model based on these criteria should be eliminated and a new model shoud be fit.The new model should be compared to the old, lager, model using the likelihood ratio test. Also, the estimated coefficients for the remaining variables should be concerned about variables whose coefficients have changed markedly in magnitude. This indicates that one or more of the excluded variables was important in the sense of providing a needed adjustment of the effect of the variable that remained in the model. This process of deleting, refitting, and varifying continues until it appears that all of the important variable are included in the model and those excluded are clinically and/or statistically unimportant. At this point, we suggest that any variable not selected for the original multivariable model de added back in to the model. This step can be helpful in identifying variable that,by themselves, are not significantly related to the outcome but make an important contribution in the presence of other variable. We refer to the model at the end of step (3) as the preliminary main effect model. (4) Once we have obtained a model that we feel that we feel contains the essential variable, we sshoud look more closely at the variables in the model. The question of the appropriate categories for discrete variables should have been addressed at the univariable stage.For continouse variables we should check the assumptuion of linearity in the logit. Assuming linearity in the variable selection stage is a common particular variable shoud be in the model. The graph for several different relationships between the logit and a continous independent variable are show in Figure5.1. The Figure 5.1 illustrates the case when the logit is (a) Linear (b) Quadratic (c) Some othe nonlinear continuous relationship (d) binary

77

y (c)

(a)

(d)

Log−odds or Logit

(b)

Liner

Binary Quadratic

Other Nonliner

x Covariate

Figure 5.1: Different types of models for relationship between the logit and a continuous variable. Where there is a cutpoint above and below which the logit is constant. In each of the situations describe in Figure 5.1 fitting a linear model would yield a significant slop. Once the variable is identified as important, we can obtain the correct parametric relationship or scale in the model refinement stage.The exxeption to this would be the rare instance where the function is U−shaped. We refer to the model at the end of step(4) as the mainef f ectsmodel. (5) Once we have refined the main effects model and ascertauned that each of the continuous variables is scaled correctly, we cheek for interactions among the variables in the model. In any model an interaction between two variables implies that the effect of one of the variables is not constant over levels of the other. For example, an interaction between sex and age implies that the slope coefficient for age is different for male and females. The final decision as to whether an interaction term should be included in a model should be based on statistical as well as practical considerations. Any interaction term in the model must make sence from a clinical perspective. We address the clinical plausibility issue by creating a list of possible pairs of variables in the model that have some scientific basis to interact with each other. The interaction variables are created as the arithmetic product of the pairs of main effect variables. We add the interaction variables, one at a time, to the model containing

78

all the main effects and assess their significance using a likelihood ratio test. We feel that interactions must contribute to the model at traditional levels of statistical significance. Inclsion of an interaction term in the model that is not significant typically increases the estimated standad errors without chaning the point estimates. In general, for an interaction term to alter both point and interval estimates, the estimated coefficient for the interaction term must be statistically significant. In step(1) we mentioned mentioned that one way to examine the scale of the covariate is to use a scatterplot smooth, plotting the reults on the logit scale. Unfortunately scatterplot smoothing method are not easily extended to multivariable models and trhus have limited applicability in the model refinement step. However, it is possible to extend the grouping type smooth show in Figure 2.2 to multivariable models. The procedure is easily implemented within any logistic regression package an is based on the following observation. The difference, adjusted for other model covariates, between the logits for two different groups is equal to the value of an estimated coefficient from a fitted logistic regression model that treats the grouping variable as catogorical. We have found that the following implementation of the grouped smooth is usually adequate for purposes of visually checking the scale of a continuous covariate. First, using the descriptive statistics capabilities of most any statistical package, obtain the quartiles of the distribution of the variable. Next create a categorical variable with 4 levels using three cutpoints based on the quartiles. Other grouping strategies can be used but one based on quartiles seems to work well in practice. Fit the multivariable model replacing the continous variable with the 4− level catogarical variable. To do this, 3 design variables must be used with the lowest quartile serving as the reference group. Following the fit of the model; plot the estimated coefficients versus the midpoints of the groups. In addition, plot a coefficients versus the midpoints of the groups.In addition, plot a coefficients versus the midpoint of the group. In addition, plot a coefficient equal to zero at the midpoint of the first quartile. To aid in the interpretation we connect the four plotted points. Visually inspect the plot and choose the most logical parametric shape(s) for the scale of the variable. The next step is to refit the model using the possible parametric forms suggested by the plot and choose one that is significantly different form the linear model and makes clinical sense. It is possible that two or more different parameterations of the covariate will yield similar models in the sence that they are significantly different from the linear model. However, it is our experience that one of the possible models will be more appealing clinically, thus yielding more easily interpreted parameter estimated.

79

Another more analytic approach is to use the method of fractional polynomials.

5.2

Fractional polynomial

Fractional polynomial was developed by Royston and Altman (1994),to suggest transformations. We wish to determine what value of xp ields the best model for the covariate. In theory, we could incorporate the power, p, as an additional parameter in the estimation procedure. However, this greatly increases the complexity of the estimation problem. Royston and Altman propose replacing full maximum likelihood estimation of the power by a search through a small but reasonablee set of possible values. Hosmer and Lemeshow (1999) provide a brief introduction to the use of fractional polynomials when fitting a proportional hazards regression model. This material provivdes the basis for our discussion of its application to logistic regression. The method of fractional polynomials may be used with a multivariable logistic regression model, but for sake of simplicity, we describe the procedure using a model with a single continuous covariate. The logit, that is linear in the covariate, is g(x, β) = β0 + xβ1

(5.4)

Where β deno0tes the vector of modelcoefficients.One way to generalize this function is specify it as j  g(x, β) = β0 + Fj (x)βj (5.5) j=1

The functions Fj (x) are a particular type of power function.The value of the first function is F1 (x) = xp1 . In theory, the power, p1 , could be any number, but in most applid settings it makes sense to try to use somethin simple. Royston and Altman (1994) propose restricting the power to be among those in the set Ω = −2, −1, −0.5, 0, 0.5, 1, 2, 3,where p1 = 0 denotes the log of the variable.The remaining functions are defined as,  xpj f or pj = pj−1 F (x) = Fj−1 ln(x) f or pj =j−1 for j = 2, 3, .... and restriciting powers to those in Ω. For example, if we chose j = 2 with P1 = 2 and P2 = 2, then the logit is by usin ?? g(x, β) = β0 + F1 (x)β1 + F2 β2 F1 (x) = ln(x) and F2 (x) =

√1 x

1 g(x, β) = β0 + lnxβ1 + √ β2 x

80

Variable AGE BECK NDRGTX IVHX2 IVHX3 RACE TREAT SITE

Coeff. 0.018 -0.008 -0.075 -0.481 -0.775 0.459 0.437 0.264

Std.Err 0.0153 0.0103 0.0247 0.2657 0.2166 0.2110 0.1931 0.2034

ˆ OR 1.20∗ 0.96+ 0.93 0.62 0.46 1.58 1.55 1.30

95%CI (0.89 , 1.62) (0.87 , 1.06) (0.88 , 0.97) (0.37 ,1.04) (0.30 , 0.70) (1.04 , 2.39) (1.06 , 2.26) (0.87 , 1.94)

G 1.40 0.64 11.84 13.35 4.62 5.18 1.67

P 0.237 0.425 0.001 0.001 0.032 0.023 0.197

Table 5.1: Univariable logistic regression models for the USI(n=575) As another example,if we chose j = 2with P1 = 2and P2 = 2,then the logit is, g(x, β) = β0 + x2 β1 + β2 x2 ln(x) The model is quadratic in x when j = 2 with P1 = 2and P2 = 2. Again we could allow the covariate to enter the model with any number of functions.j; but in most applied settings anadequte transformation may be found if we use j = 1 or 2. Example: As an example of the model-bulding process, consider the analysis of the UMARU IMPACT study(USI). The study is described in section 2.6 and a code sheet for the data is shown in Table 2.8. Briefly the goal of the analysis is to determine whether there is a difference between the two treatment programs after adjesting for potential confounding and interaction variables. ∗−Odds Ratio for a 10 year increase in AGE. +−Odds Ratio for a 5 point increase in BECK. One outcome of considerable public health interest is whether or not a subject remained drug free for at least one year from randomization to treatment (DeREE in table 2.8). A total of the 575 subjects (25.57%), considered in the analyses in this text, remained drug free for at least one year. The analyses in this chapter are primarily designed to demostrate specific aspects of the logistic model building. The results of fitting the univariable logistic regression models to these data are given in Table 4.1.In this table we present, for each variable listed in the first column,the following information. (1) The estimated slope coefficient (s) for the univariable logistic regression model containing only this variable (2) The estimated standad error of the estimated slop coefficient

81

Variable AGE NDRGTY IVHX2 IVHX3 RACE TREAT SITE Constant Log likelihood

Coeff. 0.050 -0.062 -0.603 -0.733 0.226 0.443 0.149 -204 -309.6241

Std.Err 0.0173 0.0256 0.2873 0.2523 02233 0.1993 0.2172 0.5548

z 2.91 -2.40 -2.10 -2.90 1.01 2.22 0.68 -4.34

P > |z| 0.004 0.016 0.036 0.004 0.311 0.026 0.494 ¡0.001

Table 5.2: Results for a multivariable model containing the covariates sinificant at the level of Table5.1. (3) The estimated odds ratio, which is obtained by exponentiating the estimated coefficient. For the variable AGE the odds ratio is for a 5−point increase. This wos done since a change of 1 year or 1 point would not be clinically meaningful. (4) The 95% CI for the odds ratio. (5) The likelihood ratio test statistic, G, for the hypothesis that the slope coefficient is zero. Under the nullhypothesis,this quantity follows the chi-square distribution with 1 degree of freedom,except for the variable IVHX, where it has 2 degrees of freedom. (6) The significance level for the likelihood ratio test. With the exception of Beck score there is evidence that each of the variables has some association (P< 0.25) with the outcome, remaining drug free for at least one year(DFREE). The covariate recording historyof intravenous drug use (IVHX)is modeled via two design variables using “1=Never” as the reference code. Thus its likelihood ratio test has two degrees-of-freedom. We begin the multivariable model with all but BLCK. The results of fitting the multivariable model are given in Table 5.2. The results in Table 5.2, When compared to Table 5.1, indicate weaker associations for some cocovariates when controlling for other variable s.In particular, the significance level for the Wald test for the coefficient for SITE is p = 0.494 and for RACE is P = 0.311. Strict adherence to conventional levels of statistical significance would dictate that we consider a smaller model deleting these covariates. However, due to the fact that subjects were randomized to treatment within site we keep SITE in the model. On consultation with our colleagues we were advised that race is an important control variable. Thus on the basis of subject matter considerations we keep RACE in the model. The next step in themodeling process is to check the scale of the continuous covariates in the model, AGE and NDRGTX in this case. One approach to developing the order in

82

which to check for scale is to rank the continous variable by their pespective significance levels. Results in Table 5.2 suggest that weconsider Age and then NDRGTX.

83

Chapter 6 Descriptive Data Analysis Data analysis and results obtain by them Detail data analysis is used to represent the results and basic characteristics of a distribution in a much accurate and attractive form. So that, some one can get a rough idea about the corresponding distribution very easily by investigating the distribution report. First, I studied the reasons for the requirement of installation of a teller machine within the university premises for the public usage and here, I wish to perform a data analysis for these collected data. So that, to do this, I have chosen the entire academic and non-academic staff as the population of my distribution. Because of this, I decided to give a rough idea about the distribution structure of the lecturers, students, and non-academics of the Wellamadama premises. Situation Academic staff Temp & demonstrate staff Non Academic staff Security staff Intetnal student External Student Total

Size of the population 185 110 399 45 5109 704 6552

Now, let us analyses this entire distribution structure in detail according to the distinct faculties and disciplines. First, let us have a close look at the structure of the academic staff.Besically this can be divided in to two parts. (1.) Lecturers (2.) Others

84

Figure 6.1: Size of the University population All lecturers, who are working at each faculty, can be put in to the first category, and all the remaining permanent and temporary academics are fallen in to the second category. First, let us choose the set of lecturers for the faculties of the faculties of Science, Arts, Management & Finance and Fisheries & Marine Science. This set can be shown as in the following table according to their faculties. Facuilty H & SS Managment & Finance Science Fisheries & Marine Science Total

Academic staff 80 28 67 10 185

Presentage 43.24 15.14 36.22 5.41 100

When these data are being investigated, we can clearly see that, a considerable number of lecturers are working at the Arts and Science faculties. The reason for this is the most number of students of the university are studding in these two faculties. On the other hand, only the few lecturers are working at the Management & Fisheries faculties due to the lack of students. Then, we can study the distribution of the temporary and demonstration staff of the university. Facuilty H & SS Managment & Finance Science Fisheries & Marine Science Total

Demonstrate staff 17 5 75 13 110

85

Presentage 15.45 8.46 70.12 11.84 100

Figure 6.2: Academic staff When we study the above graph, it is clear that the distribution of the temporary and demonstration staff is completely different from the previous distribution of the academic staff. Here we can see that the other academic staff of the science faculty is mush lager than the other faculties of the university. The reason for this is that science faculty needs a considerable number of temporary staff member to guide the students at their practical sessions. Then let us analyse the distribution of the non-academic staff of the university. Facuilty Administrate H & SS Managment & Finance Science Fisheries & Marine Science Total

Non academic staff 262 39 10 84 4 399

If we analyse the above graph, we can see that the most of the non-academic members are working under the administration branch. They are distributed as the, (1.) Administrative officers (2.) Clerical staff (3.) Technical officers (4.) Minor and other staff members

86

Figure 6.3: Temporary and Demonstrate staff Within the Finance branch, Administration branch, Library and other faculties. Student of the university are considered as the major entity and it is much important to study their distribution within the university. Besically we can divide them in to two parts as follows. (1.) Internal students (2.) External student Internal students are studying in the Arts, Science, Management and Fisheries & Marine Science and most of them are staying at the university hostels. Moreover we can see that most of them are in the some age. But external students have their own residence places and belong to different age stages, job categories and social status. Up to this point, I have described the distribution and nature of the academic, nonacademic and student entities of the Wellamadama premises. But my may intention was to study the reasons for the requirements of installation of a teller machine within the university premises for the public usage. To do this, I have chosen a sample of 600 entities out of 7500 total collection. At this point, I was being careful to choose them among the each and every faculty. Here I considered about savings accounts of the customers and their connection with the government and private banks. Owns a savings Account Yes No

No.of servings Accounts 484 16 500

87

Presentage 96.8 3.2

Figure 6.4: Nonacademic staff

Figure 6.5: Number of saving accounts for the sample taken out of university premises. Here data were parted as academic & non academic sector and students for the convenience of analyzing process. Have a servings Accounts Student Academic & Non Acodemic

Yes 297 187 484

No 15 1 16

312 188 500

According to the above graphs it is clear that about 97% of the total maintain their accounts in the both government and private banks. Some of them maintain their accounts in Peoples bank and Bank of Ceylon which are incorporated to the university. Now I try to analyse these data.

88

Figure 6.6: Analysis of havings Accounts for the sample taken out of university premises. First I separated the savings accounts holders from my sample and them obtained set were parted as university Peoples bank and university Bank of Ceylon accounts holders. Furthermore, those sets parted as academic, non-academic and students. Following these steps, I was able to give much attractive aspect to my work. University Bank People’s Bank Bank Of Ceylon Both No Total

No. of Accounts 210 68 140 82 500

Precentage 42 13.6 28 16.4 100

Now let us consider about the savings accounts which are maintained by the students. University Bank Student Academic & Non Academic

People’s Bank 144 66 210

Bank Of ceylon Both 20 30 48 52 68 82

No 118 22 140 500

Among them, a considerable number of students maintain their accounts in the Peoples bank which is incorporated with the university, rather than in the Bank of Ceylon. The reason for this is that, they have to retrieve their bursaries and Mahapola scholarship fund through the Peoples bank. But it is marvelous to see the connection in between the academic/non-academic staff and the bank incorporated with the university. They are the major customers for both these bank for them. We can see this clearly by studding the following facts. To do this, I have chosen all the staff members who keep their savings accounts in these banks.

89

Figure 6.7: University staff & student maintain their accounts with the university People’s bank & BanK Of Ceylon according my sample. Bank People’s Bank Bank Of Ceylon

Accounts 215 228 443

Money(RS) 49,449,300.00 3,954,310.00 53,430,610.00

According to the above graphs, we can say that most number of staff members maintain their savings accounts in the Bank of Ceylon. Now let us have some idea about the capitals of these banks, due to these accounts. By comparing the above two graphs, we can say that the Peoples bank is the one who contributed to the biggest part of circulation of money. So I decided to study about this moreover, and divide the above set in to two parts as academic and non-academic. Academic Non Academic

People’s Bank 122 93 215

Bank Of Ceylon 58 170 228

According to the above graphs, we can say that, most of the academic staff members maintain their accounts in the Peoples bank and most numbers of non-academic staff members maintain their accounts in the Bank of Ceylon incorporated to the university. Academic Non Academic

People’s Bank 3,653,590.00 1,335,750.00 49,499,300.00

90

Bank Of Ceylon 1,957,860.00 1,996,450.00 3,954,310.00

Figure 6.8: Analysis of university staff & student maintain their accounts with the university People’s bank & Banl Of Ceylon according my sample. Generally academic scullery scale is little bit higher than the non-academic salary scale. Due to these facts it is natural to observe this kind of large capital in the Peoples bank relative to the Bank of Ceylon. Now I try to study nature of the savings accounts maintain by the sample members except the banks incorporated with the university. Banks People’s Bank Pank Of Ceylon Commercial Seylan Sampath Other Total

No.of Accounts 95 72 52 14 55 212 500

Precentage 18.6 14.4 10.4 2.8 11.0 42.8 100

By dividing this sample further (as academic, non-acodemic and student) distribution can be achieved much attraction. Bank Student Academic & Non Aca.

People’s 73 22 95

BOC 57 15 72

Commercial 36 16 52

Selan 7 7 14

Sampath 25 30 55

Other 115 97 212

500

According to the above information, we can say that most of them maintain savings accounts in the government bank as well as in the private banks. We can see this situation

91

Figure 6.9: Number of account obtain in university Banks very sharply among the students. High compactions and their effort of introducing new accounts for the young generations are the major reasons for this. Private sector has a well developed computer network, specilly commercial and sampath banks. So it is possible to receive any amount of money at any time, at any part of the island. Since most of the university students are staying at out side of their houses, theyhave used to maintain their accounts in the private banks. At present, even though the seylan bank is very popular among the business community it is not so popular among the ordinary people. We can see this fact according to the above graphs. So far I have studied about the customers who maintain their accounts in the government and private banks. But there are about 4% of sample members who do not maintain any kind of savings accounts. This percentage becomes 16 students and one non-academic staff member out of the whole sample. In order to investigate this situation, first I studied the effect of their residence place. Place of living No.of obtained In the Matara urban area 10 [h] Out side of the Matara urban area 4 Hostal in side university 2 Hostal in out side university 16 According to the above graphs, we can say that most of them (10) live around the matara town and they may take pocket money from houses to fulfill their daily needs. Because

92

Figure 6.10: How the money of payments devided each of the University banks. of these facts, sometimes they did not want to maintain a savings account. Moreover, in the case of our non-academic member, he was unaware about these savings accounts. So it is clear that if it is possible to aware them about these accounts, they will definitely open their own savings accounts. Next I would like to further analyse data according to by them. But before that I have to remind about the samples which I have parted. First I have randomly selected a sample with 500 entities, out of 7000 total collection. Then the sample was dividing in to two parts as, members who maintain savings accounts and do not. Their flavor of maintain teller cards is different form one to another. So I decided to investigate the reasons for this. Following graphs show the number of sample members who maintain and who do not maintain teller card facilities. Obtained Teller card Yes [h] No

No.of People 380 104 484

Presentage 78.51 21.49 100.00

In order to study their requirements in detail and much convenient way, the sample was parted as academic, non-acodemic and students.

93

Figure 6.11: Number of accounts obtain in University Banks Obained Teller card Student Academic & Non academic

Yes 252 128 380

No 45 59 104 484

According to the above graphs, most of the sample members have obtained teller facility. Relative to the academic and non academic staff members, a considerable number of students have obtained this facility. Since most of them are staying at out side of their houses. They have to use bursary and their pocket money all through the month and they have used to deposit money in the banks due to provide secure, and it is reasonable to use a teller cards. But when we consider about the case of non-academic staff, we can see some kind of collapse of their teller card usage, and there may be verity of reasons behind this. One such possibility is, that they may think since it is possible to retrieve money at any time via the teller machine. It could be turned in to a waste. On the other hand most of them afraid of new technology. Then I decided to analyse this case according to the places where these teller machines have been established. Bank Obtain Teller Card University People’s Bank 261 Other Bank 78 Both 41 No card 484

94

Figure 6.12: How the money of payments devided each of the University banks.

Figure 6.13: University staff and student maintain their accounts with the govenment & private banks accounts with the govenment & private banks accouding my sample. In order to study this case moreover. I parted the above sample as academic and nonacademic staff.

Student Academic & Non Academic

Uni.People’s Bank Other Bank Both163 53 20 98 25 21 261 78 41

No card 59 45 104

484

According to the above graphs, it is clear that academic as well as the non-academic staff members have used to get use of teller facilities from the Peoples bank incorporated to the university. On the other hand, most of the students have used to maintain a private bank teller card due to the facilities they provide, and they use Peoples bank savings accounts only to retrieve their bursaries and Mahapola scholarships. But most of them wish to

95

Figure 6.14: University staff and student maintain their accounts with the govenment & private banks accordin my sample.

Figure 6.15: Number of obtained haven’t savings accounts. withdraw these accounts when they pass out from the university and it is not so good in the point of view of the government banks. As I have analysed the above, I found that, even if the most of the university population use the teller facilities provided by the Peoples bank which is incorporated to the university, it is not so enough. This kind of situation may be occurred due to the following reasons. (1.) Since the teller machine is established out side of the university premises. They have to go to go out side if they to retrieve money. But these people are very busy and it is not so covenant at all. (2.) Frequently failures of the machine. (3.) The process in between retrieving money and take back to the university is no so secure.

96

Figure 6.16: Sample members have obtain Teller facility (4.) This teller machine is established little bit far away from the university premises and one can see very frequent collisions in between the university students and the young boys who are living in the village. So the most of the university people little bit scare to retrieve at the evening. One such an example is, few months ago someone has attempted to commit robbery here. Because of these things, I decide to investigate the requirements of installation of a teller machine within the university premises. Following are the ideas which were revealed by my sample members about their needs. At what place are you interested In side the University Out side the University

No. of interesterd 472 28 500

With a view to analyses this furthermore, I have parted my sample as academic, on-academic and students. Student Academic & Non academic

In side the University 297 175 472

Out side the University 15 13 28

500

According to the above graphs, it is clear that most of them need the installation of a teller machine within the university premises, and lots of reasons are behind this. (1.) It is very convenient to retrieve money. (2.) It saves time. (3.) Much secure.

97

Figure 6.17: The sample members have obtained Teller facility or not.

Figure 6.18: Where the sample members have obtained teller facility. Because of these things, girls who are staying within the university premises also can retrieve money without any hesitate. As well as the majority of the sample need the installation of a teller machine within the university premises, there are a few who dont wish such a facility. This minority thinks with the availability of this facility, their expenses will be increased without any control. According to my point of view, bearing these kinds of ideas, harm a lot for the majority. In my study, I have specially studied about the Bank of Ceylon which is situated within the university premises. Now I try to analyse about the sample members who would ready to get the facilities with the availability of a Bank of Ceylons teller machine.

98

Figure 6.19: Where the sample members have obtain teller facility.

Figure 6.20: New teller machin to be withen or outside of university area.

99

Figure 6.21: New teller machin to be withen or outside of university area.

Figure 6.22: The number of custermers who are interested in opening a new account. Are you interesting a new Account Interesting Having Not interesting Total

No of New comers 196 140 164 500

If we analyse above graphs we can clearly see that there are about 140 customers who are dealing with the Bank of Ceylon from long time and they wish to get the teller facilities. Moreover there are about 160 new customers who are going to open savings accounts. This is much beneficial for the bank.

100

Figure 6.23: Are you interested in opening a new account.

Figure 6.24: who are Interesred in new teller facility Are you interesting in opening a new Accoount Student Academic staff Tempory staff Non academic staff

Interesting

Having

Not intetesting

141 11 14 31

56 11 4 69

115 7 30 11

312 29 48 111 500

Then the above graph indicates the members who dont like to deal with the bank. Most of them are final year student and temporary academic staff. They are going to leave the university very nearly. So they dont what to maintain accounts furthermore. Are you interesting in opening a new Account Student Academic staff Temporary staff Non academic staff

101

Uses 197 22 17 100 336

not uses 115 7 31 11 164

500

Installation of a teller machine within the university premises will provide a lot of indirect benefits for the banking sector. Already the majority of the academic and non-academic members ready to obtain the credit card facilities from the bank with the bank with the availability of a teller machine which I have mentioned above. Since most of their scales are good enough it is possible to issue credit cadres with higher values. So it is much suitable to study about the customers who are willing to deal with the banks. As an example, let us observe the column chart of the facility of fisheries biology. It is clear that most of them wish to get the teller facilities within the university premises. Because most of them little bit scare to go to the Matara town. So finally, considering each and every end which I have analysed, we can decide that it is much suitable to install a teller machine within the university premises.

102

Chapter 7 Discussion Man was using diverse stratergies for providing his basic needs such as foods, cloths and housing since his civilization. Although they were practicing the exchange of goods and services in past, the situation changed with the introduction of the exchange unit called “money”. The unit of money subsequently became the exchange unit of both goods and services exchanges. The industrial revolution resulted diverse services and as a result of this, the demand for money also increased. This high demand for money led rapid spreading of banks and other financial services all around the world. With the help of modern technology, the current electronic banking systems capable of providing money for any one, any where of the world. In Sri Lanka, some of recently established private banks have done vast changes in banking and financial sector of the country. The compitition among these banks for advancement one over another has resulted many benifits for the customers. The University of Ruhuna, Sri Lanka is situated in a peacefull and elegant premises close to Matara town, southern region. Large number of residental and non-residental people from all over the country access the university premises daily for both academic and non-academic purposes. As example, the daily visitors represent the distant districts like Jafna, Kilinochchi while some others from Badulla, Anuradhapura and Polonnaruwa districts. Most of these regular visitors(particulary students) reside in university hostals while some others are from private loggins around the university. To accomplish the financial and banking needs of this diverse community of ruhuna university, both governmental and private banks are functioning with this context. Being a students of university of ruhuna, which also means being a member of it’s community I am really interesting to run a reaserch on the banking needs of this community which will produce guidences for a more suphosticated banking system withing the uni-

103

h Figure 7.1: Number of account obtain in university Banks versity premises. As an approach to the current study, it was worth studying the set up for performing the early banking needs of university of ruhuna which was established 28 years ago, in wellamadama, matara. Initially, the university has kept it’s trust on governmental banks opening an internal branch of “Bank of cylon” withing the university premises. Both academics and non-academics then started dealing with this Bank Of Ceylon internal branch for their financial purposes. By 1980’s, another Sri lanka bank called “People’s bank” achieved it’s significant involvement in Sri Lankan economy launching it’s branches throught the country which also established an external people’s bank branch adjacent to the university premises. As result of more interesting banking services, with more advantageble account systems of this newly established bank, people of the university community who were previously dealing with bank of cylon internal branch started opening atleast one saving account with this external People’s bank. This new trend has diverted most of academics and non-academics from the internal Bank Of Cylon branch to external People’s bank branch. This situation is clearly illustrated in following charts and figures. The Figure 7.1 seems to show that still the Bank Of Cylon is holding much of university accounts, although the Figure 7.2 shows the reality where the profit sharing between the two banks clearly illustrates the later failure of banks of cylon withing the university banking context.

104

h Figure 7.2: How the money of payments devided each of the University banks. particularly, the more efficient banking services of People’s bank have moved much of academic’s accounts from internal bank of cylon branch to the external People’s bank, The Bank Of Ceylon is experencing a great loss of income due to the fact that academic accounts exchange great some of money because the academics belong to the highst salary scale withing the university community. Therefore, the bank of cylon can increase their profit if they recover the academics trust on their bank. As the second part, I thought of studying the financial exchanges of university students who are the next prominent component of the university community. The interviewes with students reviwed that they have kept their trust mostly on private banks, as examples “Commercial” and “Sampath” banks which are rich in latest technology. The reason for this particular attraction is the two bank have offered the facilities for quick money transfer for any where of the country which is an essential need for students who are depending on parents money which are to be credited to the students accounts from their distant areas. Many students are using People’s bank branch only for obtaining their Mahapola scholarship installments and bursaries, because it is the only bank having anthority on mahapola and bursary transactions. To my feeling, the two government banks are highly unluckey that they were not capable of holding the trust of students who makes the blood stream of a living country. To my openion, studying the reasons for overcomming of people’s bank over Bank Of Ceylon withing the university premises may be a guide for future advancements in uni-

105

h Figure 7.3: Number of accounts obtain in University Banks versity banking and financial services. Under this context, the behaviour of the university community came in to place where the most of academics, non-academics and students are fairly busy with their daily works so that they are unable to waste their working hours for bankings. The overlapping of their working hours with the bank opening hours caused this problem and they were in need of any 24 hours banking service at least for withdrawal of money from their accounts. The People’s bank focused on this issue and they established a teller machine at the external People’s bank branch several years ago facilitating easy withdrawal of money both day and night. Although this was a real help for students, some other issues associated with this teller machine also worth considering. Some conflicts between students and the village community are frequent and during such conflict periods, it is unsafe the students to behave out side of the university premises and hostals. Such conflicts therefore limit the students to reach the out side teller machine. This situation is particular for evenings where the students mostly use the teller machine during which the students can also be attacked by the village boys. Also some robberies for withdrawn money also not rear during the way between university premises and the outside teller machine. Since the teller machine is established so close to the Matara - Kataragama road, many outsiders also complete with students for withdrawal of money majority of Matara area is Sinhalese and therefore the students who have come from north and eastern areas (where some conflicts are going on) are suspecious of leaving the university premises to the external teller machine in evenings

106

h Figure 7.4: Number of accounts obtain in University Banks

h Figure 7.5: University staff and student maintain their accounts with the govenment & private banks accordin my sample.

107

h Figure 7.6: who are Interesred in new teller facility because they may be questioned by either villagers or security persons. The above mentioned reasons implies the need to install an internal teller machine withing the university premises. As the first step of my study, the university comprising around 5000 was selected as the study population. The population was then dividid in to different sectors (as descibed under data analysis) and a sample of 500 was selected for the study. Except 16 of the sample, all the other were currenty dealing with either bank. The study shows that the best place to establish an internal teller machines is the internal Bank Of Cylon branch. With establishing this internal teller machine, the Bank Of Cylon branch may re-attact around 7000 of the university community. Except 164 of my 500 sample, all the others aggreed to start banking with the internal Bank Of Cylon branch. The majority of disagreed people were final year students who are to complete their university education near future and then to leave the university. Also, the disagreed accedemics were mostly temporary tutors whose appointments will be terminated at the year end. Therefor, I suggest to establish a teller machine at the internal Bank Of Ceylon approving the ideas of the majority of the current study population. The methodslogy called “Logistic regression ”, which is capable of isolating the most sffective factor and out of severel factors affecting on a given issue. As example, in my study the following customer facts were concederd as the facts that may potentially affect on establishing an internal teller machine withing the univarsity premises. • Nationality • Gender

108

• Loggin • banking with other banks and satisfaction on their services • The satisfaction on the teller machine services they are experiecing Wether establishing an intenal teller machine would be more easier, more safe and more time efficient. Then the logistic regression was conducted and the factors having nign p− value was removed and the remaning factors were studied. Thus, in my study, the GENDER = 0.4562 < 0.25. This factor was removed and restudied for another factor. Second, the study was carried out in minitab and the T C = 0.2562 < 0.25 factor was removed. This factor reduction proccess is called “step down wise ” method and the ultimatly remaining factors having p value less than 0.025 conceded to affect significantly on the study problem. The results of the study will futher be described under the conclusion section. The results of the study will further be described under the conclusion section. The logistic transformation equation obtained at the end of the data analysis can be used for move clear and more reasonable results.

109

Chapter 8 Conclusion 8.1

Results

Since a long time, people have used to deposit their main requirements such as foods, cloths and money after their daily usage, with a view to use in the future. Step by step with use of money, the concept of banking was become popular all over the world, and rapidly improved with the new out comes of the technology revolution. so that they introduced teller and credit card facilities for the convenience of public. As an example, it is clear that a lot of people who are in our university also keeping context with the banking sector, and most of them like to keep their accounts in the government banks rather then the private banks. Considerable number of these people uses the teller machines and credit cards. But most of the university people have face for a lot of troubles, due to these teller machines are installed within the urban areas. So my main intention is to investigate this matter. About 3500 people come daily, in to the university premises for their academic and official work. The goal of this study to survey study of the requirement of a teller machine within the university premises for the public usage. We prepared to questionnaires for collect data and interview more than 500 people around the university area in wellamadama premises comprising academic, non academic staff, internal external student and security staff. The goal of the analysis was to identify who were interested for ATM facility for university premises, and why were they inerest for this facilities. The following variables were tracked in a computer file called ,“INT − AT M”. the variable “PLACE” was treated as a categorical variable in the regression analysis, so three dummy variable had to distinguish the four places they are living. These variables were difine so that the “ Living in side matara urban area ” (PLMA) was the referent

110

Variable Identification code

Place of living

Gender Having sarvings account in BOC Having teller facility Interested for BOC ATM facility in University primises

Code ID-Student ID-Non Academic ID-Academic ID-Temparary Academic Inside the Matara area Outside the Matara area Hostal in side University Hostal out side University 1-Female 0-Male 1-Yes 0-No 1-Yes 0-No 1-Yes 0-No

Abbreviation IDS IDNA IDA IDTA PLMA PLOM PLIUH PLOUH GENDER HABOC TC INTEREST

Table 8.1: INTERESTED Vs IDA ,IDS,IDTA,PLOM,GENDER,PLIUH,PLOUH,HABOC,TC group, follows:



 1 if P LMA P LMA = 0 Othre

P LOM =

1 if P LOM 0 other

 1 if P LOUH P LOUH = 0 Other

 1 if P LIUH P LIUH = 0 Other

Also The variable “ ID” was treated as a categorical variable. So dummy variable had to distinguish the five sections where the members are working. These variables were difine, so that “ Students” (IDS) was the referent group, follows:   1 if IDS 1 if IDA IDS = IDA = 0 Othre 0 other  1 if IDT A IDT A = 0 Other

 1 if IDNA IDNA = 0 Other

The following block of edited computer out comes from fitted Minitab’s Logistic procedure for the dichotomous out come variable. INTEREST on the predictor GENDER,

111

TC, and IDj for j=1,2,3,4 and P LACEj , for j = 1, 2, 3, 4. The logit from of the model being fit given as, logit[P r(Y = 1)] = β0 + β1(IDNA) + β2 (IDA) + β3 (IDT A) +β4 (P LOM) + β5 (P LIUH) + β6 (P LOUH) +β7 (GENDER) + β8 (HABOC) + β9 (T C) Where, Y denotes the depend variable INTEREST. The results of fitting the univariable Variable Identification code

Place of living

Gender Having sarvings account in BOC Having teller facility Interested for BOC ATM facility in University primises

Code ID-Student ID-Non Academic ID-Academic ID-Temparary Academic Inside the Matara area Outside the Matara area Hostal in side University Hostal out side University 1-Female 0-Male 1-Yes 0-No 1-Yes 0-No 1-Yes 0-No

Abbreviation IDS IDNA IDA IDTA PLMA PLOM PLIUH PLOUH GENDER HABOC TC INTEREST

Table 8.2: INTERESTED Vs IDA ,IDS,IDTA,PLOM,GENDER,PLIU,PLOU,HABOC,TC logistic regression models to these data are given in result Table. In this table we present, for each variable listed in the first column, the following information. (1) The estimated slope coefficient (s) for the univariable logistic regression model containing only this variable (2) The estimated standad error of the estimated slop coefficient (3) Normal values of variables. (4) The significance level for the likelihood ratio test. (5) The estimated odds ratio, which is obtained by exponentiating the estimated coefficient. For example, the variable AGE the odds ratio is for a 5−point increase. This wos done since a change of 1 year or 1 point would not be clinically meaningful.

112

(6) The 95% CI for the odds ratio. Where y denotes the dependent variable DENGUE. Using the given table, we now focus on the information provided under the heading “ Analysis of Maximum likelihood estimaters.” From this information, we can see that the ML coefficients obtained for the fitted model are βˆ0 = 1.6810, βˆ1 = 0.9152, βˆ2 = 0.2731, βˆ3 = −1.0998, βˆ4 = −0.1135, βˆ5 = −0.6987 βˆ6 = −1.4646, βˆ7 = −0.2563, βˆ8 = 1.5276, βˆ9 = −0.8361 So that fitted model is given (in logit form) by, logit[P r(Y = 1)] = 1.6810 + 0.9152(IDNA) + 0.2731(IDA) − 1.0998(IDT A) −0.1135(P LOM) − 0.6987(P LIUH) − 1.4646(P LOUH) −0.2563(GENDER) + 1.5276(HABOC) − 0.8361(T C) Based on this fitted model and the information provided in the computer output, we can compute the estimated odds ratio ratio for finding factors of interesting the ATM machine within the university premises for the public usage. Basd on this fitted model and the information provided in the computer output, and the information provided in the computer output, we can compute the estimated p- value for finding factors ofinteresting the ATM macine facility within the university premises for the public usage. We do this using the previously state rule for (Forward selection with a test for backward estimation) adjusted p- value for (0 − 1) variable. With the exception of GENDER there is evidence that each of the variable has some association (p−value= 0.293 < 0.25) with the outcome, remaning the factors of interesting the ATM machine within the university premises for the public usage. The results in Table 9.3, when campared to Table 9.2, indicate weaker associations for some covariates when controlling for other variables. The new reduced model is written in logit from as, logit[P r(Y = 1)] = β0 + β1(IDNA) + β2 (IDA) + β3 (IDT A) +β4 (P LOM) + β5 (P LIUH) +β6 (P LOUH) + β7 (HABOC) + β8 (T C) By using the computer output we can see that the ML coefficients obtained for the new fitted model are, βˆ0 = 1.510, βˆ1 = 0.9230, βˆ2 = 0.3067, βˆ3 = −1.1366, βˆ4 = −0.1302, βˆ5 = −0.7866 βˆ6 = −1.4239, βˆ7 = 1.5263, βˆ8 = −0.8070,

113

Variable Identification code

Place of living

Gender Having sarvings account in BOC Having teller facility Interested for BOC ATM facility in University primises

Code ID-Student ID-Non Academic ID-Academic ID-Temparary Academic Inside the Matara area Outside the Matara area Hostal in side University Hostal out side University 1-Female 0-Male 1-Yes 0-No 1-Yes 0-No 1-Yes 0-No

Abbreviation IDS IDNA IDA IDTA PLMA PLOM PLIUH PLOUH GENDER HABOC TC INTEREST

Table 8.3: INTEREST Vs IDA ,IDS,IDTA,PLOM,PLIU,PLOU,HABOC,TC So that new fitted model is given (in logit form) by, logit[P r(Y = 1)] = 1.510 + 0.9230(IDNA) + 0.3067(IDA) − 1.1366(IDT A) −0.1302(P LOM) − 0.7866(P LIUH) −1.4239(P LOUH) + 1.5263(HABOC) − 0.8070(T C) By usin equation 2.4 & 2.5, 2.6,we can get,

pˆi = β0 + β1(IDNA) + β2 (IDA) + β3 (IDT A) + β4 (P LOM) + β5 (P LIUH) ln 1 − pˆi +β6 (P LOUH) + β7 (HABOC) + β8 (T C) pˆi = e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5 (P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)} 1 − pˆi The logit transformation is given by, pˆi =

e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5(P LIU H)+β6 (P LOU H)+β7(HABOC)+β8 (T C)} 1 + e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5 (P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)}

The fitted values are given by using 2.6 is, π ˆ (x) =

e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5(P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)} 1 + e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5 (P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)}

114

By using the privious results, we can substitued the values for logistic transformation, π ˆ (x) =

8.2

e{1.510+0.9230(IDN A)+0.3067(IDA)−1.1366(IDT A)−0.1302(P LOM )+......+1.5263(HABOC)} 1 + e{1.510+0.9230(IDN A)+0.3067(IDA)−1.1366(IDT A)−0.1302(P LOM )+....+1.5263(HABOC)}

Conclusion

Compared with the other subjects, the non academic staff (IDNA) were interested in more than three times to the requirment of a Teller machine facility within university premises for the public usage, with the odds ratio (OR) of 2.52 and confidence Interval (CI) of 1.21 and 5.24. Also the current account owners in “ Bank Of Ceylon” were interested more than 5 times with the odds ratio (OR) of 4.61 and confidence Interval (CI) of 2.46 and 8.63. But Temporary and Demostrate staff were interested more than three times less for the requirment of this “ BOC” teller machine (OR=0.33; CI=0.17-0.66). Also the Academic staff & Students are less interested. But they were interested to the requirment “ people’s Bank” Teller machine facility within university primises for the public usage.

115

Chapter 9 Appendx

116

List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

Example Data set-I . . . . . . . . . . . . . . . . . . . . . . . Example Data set-II . . . . . . . . . . . . . . . . . . . . . . Motivation for the Least-squares Regression line . . . . . . .  Line A and B Both satisfing the criation ni=1 (yi − yˆi ) = 0 . The least-squares procedure minimizes the sum of the squares uals ei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of possible population regression lines . . . . . . . Depth of water Vs Water Tempature. . . . . . . . . . . . . . Quadaratic model:μ = β0 + β1 x + β2 x2 . . . . . . . . . . . . Cubic model:μ = β0 + β1 x + β2 x2 + β3 x3 . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . of the resid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

4 4 5 6

. 7 . 9 . 10 . 14 . 15

2.1 Scatterplot by CHD by AGE for 100 subjects. . . . . . . . . . . . . . . . . 19 2.2 Plot of the presentage of subjects with CHD in each age group. . . . . . . 20 3.1 Design variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1 Comparison of the weight of two groups of boys with different distribution of age. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Plot of the logits under three different models showing the presence and absence of interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1 Different types of models for relationship between the logit and a continuous variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.1 6.2 6.3 6.4 6.5 6.6

Size of the University population . . . . . . . . . . . . . . . . . . . . . . . Academic staff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Temporary and Demonstrate staff . . . . . . . . . . . . . . . . . . . . . . . Nonacademic staff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of saving accounts for the sample taken out of university premises. Analysis of havings Accounts for the sample taken out of university premises.

117

85 86 87 88 88 89

6.7 University staff & student maintain their accounts with the university People’s bank & BanK Of Ceylon according my sample. . . . . . . . . . . . . 6.8 Analysis of university staff & student maintain their accounts with the university People’s bank & Banl Of Ceylon according my sample. . . . . 6.9 Number of account obtain in university Banks . . . . . . . . . . . . . . . 6.10 How the money of payments devided each of the University banks. . . . . 6.11 Number of accounts obtain in University Banks . . . . . . . . . . . . . . 6.12 How the money of payments devided each of the University banks. . . . . 6.13 University staff and student maintain their accounts with the govenment & private banks accounts with the govenment & private banks accouding my sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.14 University staff and student maintain their accounts with the govenment & private banks accordin my sample. . . . . . . . . . . . . . . . . . . . . 6.15 Number of obtained haven’t savings accounts. . . . . . . . . . . . . . . . 6.16 Sample members have obtain Teller facility . . . . . . . . . . . . . . . . . 6.17 The sample members have obtained Teller facility or not. . . . . . . . . . 6.18 Where the sample members have obtained teller facility. . . . . . . . . . 6.19 Where the sample members have obtain teller facility. . . . . . . . . . . . 6.20 New teller machin to be withen or outside of university area. . . . . . . . 6.21 New teller machin to be withen or outside of university area. . . . . . . . 6.22 The number of custermers who are interested in opening a new account. . 6.23 Are you interested in opening a new account. . . . . . . . . . . . . . . . . 6.24 who are Interesred in new teller facility . . . . . . . . . . . . . . . . . . . 7.1 7.2 7.3 7.4 7.5

Number of account obtain in university Banks . . . . . . . . . . . . . . . How the money of payments devided each of the University banks. . . . . Number of accounts obtain in University Banks . . . . . . . . . . . . . . Number of accounts obtain in University Banks . . . . . . . . . . . . . . University staff and student maintain their accounts with the govenment & private banks accordin my sample. . . . . . . . . . . . . . . . . . . . . 7.6 who are Interesred in new teller facility . . . . . . . . . . . . . . . . . . .

118

. 90 . . . . .

91 92 93 94 95

. 95 . . . . . . . . . . .

96 96 97 98 98 99 99 100 100 101 101

. . . .

104 105 106 107

. 107 . 108

List of Tables 1.1 Example Data set-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Example Data set-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Computations for finding β0 and β1 . . . . . . . . . . . . . . . . . . . . . . 2.1 frequncy table of AGE group by CHD . . . . . . . . . . . . . . . . . . 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Results of fitting the logistic regression model to the data in Table 2.1 . 2.4 Estimated convariance matrix of the estimated coefficicent in Table 2.3

. . . .

3 4 8

. . . .

20 21 27 34

3.1 An example of the coding the design variables for race coded at three levels 3.2 Table 3.2, estimated coefficients for a multiple Logistic regression model using the variables AGE, weight at least menstrual period (LWT), Race and Number of first trimester physician visits (FTV) for the low birth weight study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Estimated coefficients for a multiple Logistic Regression model sing the variable LWT and RACE from the low birth wight stutdy. . . . . . . . . . 3.4 Estimated covariance matrix of the estimated coefficients in Table 3.3 . . .

36

4.1 values of the logistic regression model when the independent variable is dichotomous outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 cross-classification of AGE dichotomized at 55 years and CHD for 100 subjects 4.3 Illustration of the coding of the design variable using the reference cell method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Illustration of the coding of the design variable using the deviation from means method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 cross-classification of hypothetical data on RACE and CHD status for 100 subjects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 specification of the design variables for RACE using reference cell coding with white as the reference group. . . . . . . . . . . . . . . . . . . . . . . .

119

42 43 46

50 51 54 54 55 56

4.7 Results of fitting the logistic regression model to the data in table 4.5 using the disign variablesa in table 4.6. . . . . . . . . . . . . . . . . . . . . . . . 4.8 specification of design variable for RACE using deviation form means coding. 4.9 Results of fitting the logistic regression model to the data in Table 4.5 using the design variables in Table4.8 . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Descriptive statistics for two groups of 50 mens on AGE and whether they had seen a physician(PHY)(1=Yes,0=No)within the last months. . . . . . 4.11 Resuls of fitting the logistic regression model to the data summarized in Table 4.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Estimate logistic regression coefficients , deviance, and the likelihood ratio test statistic (G) for an example showing evidence of confounding but no interation(n=400) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.13 Table 4.13 Estimate logistic regression coefficients, deviance, and the likelihood ratio test statistic (G) for an example showing evidence of confounding but no interation(n=400) . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14 Table 4.14 estimated logistic regression coefficients, Deviance, the likelihood ratio teststatistic (G), and the P-value for the change for models containing LWD and AGE from the low birthwight containing LWD and AGE from the low birthwight data(n=189) . . . . . . . . . . . . . . . . . . 4.15 Estimated covariance matrix for the estimated parameters in model 3 of Table 4.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.16 Estimated odds ratios and 95 present confidence intervals for LWD,controlling for AGE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56 58 60 63 65

67

68

71 71 72

5.1 Univariable logistic regression models for the USI(n=575) . . . . . . . . . . 81 5.2 Results for a multivariable model containing the covariates sinificant at the level of Table5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 8.1 INTERESTED Vs IDA ,IDS,IDTA,PLOM,GENDER,PLIUH,PLOUH,HABOC,TC111 8.2 INTERESTED Vs IDA ,IDS,IDTA,PLOM,GENDER,PLIU,PLOU,HABOC,TC112 8.3 INTEREST Vs IDA ,IDS,IDTA,PLOM,PLIU,PLOU,HABOC,TC . . . . . 114

120

Bibliography [1] Albright, Winston ,and Zappe Schaum’s Outline Series Theory and Problems of Bisiness Statistics , Schaum’s Outline Series, McGraw-Hill [2] Amemiya, T.and Powell, Applied Statistics with Microsoft Exel , Duxbury Thomson Learning [3] Berk, k.and carry, Wayne L. Winston, Christopher Zappe, Data Analysis and Decision Making with Microsoft Exel , Thomson Brooks / Cole [4] carver, Data Analysis with MINITAB 12. , Duxbury Thomson Learning [5] Conway, D.and Roberts Regression Analysis in Employment Discrimination cases , Statistic and the Law. New York, NY: Wiley, 1986 [6] David W Hosmer ,Applied Logistic Regression (3rd ed.) , John Wiley and Sons [7] Freund, R.and Littell, SAS system for Regression analysis (2nd ed.) , Duxbury Thomson Learning [8] Hidebrand Statistical Thinking for managers (4th ed) , Duxbury Thomson Learning [9] Keleinbaum, Kupper & Muller Applied Regression Analysis and Other Multivariable Methods (2nd ed.), Duxbury press, Cole publishing company [10] Keleinbaum, Kupper & Muller Applied Regression Analysis and Other Multivariable Methods (3rd ed.), Duxbury press, Cole publishing company [11] Lehmann, Zeitz Statistical Explorations With Microsoft Excel , Duxbury press, 1990 [12] Hidebrand Statistical Thinking for managers (4th ed) , Duxbury Thomson Learning

121

[13] Keleinbaum, Kupper & Muller Applied Regression Analysis and Other Multivariable Methods (2nd ed.), Duxbury press, Cole publishing company [14] State collage MINITAB Users Guide, reliase for 12th Windows , State collage, PA: Minitab, Inc [15] Shiffler, Adoms Introduction Business Statistic with computer Applications (2nd ed.) , Duxbury Thomson Learning ,1995 [16] Terry Dietman Applied Regression Analysis For Business and Economics , Thomson Brooks / Cole

122

Related Documents

Requirement
May 2020 16
Requirement
July 2020 10
Requirement Of Staff
December 2019 18
Objective Of The Study
June 2020 11
World Of Requirement
October 2019 26