Personal Learning Paper On Quantitative Techniques
Prepared By: Submitted To: Prof. Priyabrata Nayak
Contents
Introduction History Data Collecting Arranging Data Using Data Array Using Frequency Distribution Central Tendency Dispersion Skewness Measures of Central Tendency Mean Median Mode Probability Probability Distribution Discrete Probability Distribution
Contents
Sampling Central Limit Theorem Standard Error Sample Size Estimation Point Estimation Interval Estimation Correlation Coefficient of Determination
Introduction to Statistics
The word statistics means different things to different people. To a football fan, statistics are rushing, passing, and first down numbers; to the chargers’ coach in the second example, statistics is the chance that the giants will through the pass over center. To the manager of a power station, statistics are the amounts pollution being released into the atmosphere. To the Food and Drug administrator in our third example, statistics is the likely percentage of undesirable effects in the population using the new prostate drug. To the Community Bank in the fourth example, statistics is the chance that Sarah will repay her loan on time. Each of these people is using the word correctly, yet each person uses it in a different way. All of them are using statistics to help them make decisions
History
The word statistik comes from the Italian word statista (meaning “statesman”). It was first used by Gottfried Achenwall(1719-1772), a professor at Marlborough and Gottingen. Dr. E. A. W. Zimmerman introduced the word statistics into England. Its use was popularized by Sir John Sinclair in his work Statistical Account of Scotland 1791-1799. Long before the eighteenth century, however, people had been recording and using data.
Data Collecting
Statisticians select their observations so that all relevant groups are represented in the data. To determine the potential market for a new product, for example, analysts may study 100 consumers in a certain geographical area. Analysts must be certain that this group contains people representing variables such as income level, race, education, and neighborhood. Past data is used to make decisions about the future.
Arranging Data using the Data Array
The data array is one of the simplest ways to present data. It arranges the data in ascending or descending order.
Table 1-1 Sample of Daily Production in Yards of 30 Carpet Looms
16.2 15.7 16.4 15.4 16.4
15.8 16.0 15.2 15.7 16.6
15.8 16.2 15.9 15.9 15.6
15.8 16.1 15.9 16.0 15.6
16.3 16.8 15.9 16.3 16.9
15.6 16.0 16.8 16.0 16.3
Table 1-2 Data Array of daily production in yards of 30 Carpet Looms
15.2 15.4 15.6 15.6 15.6
15.7 15.7 15.8 15.8 15.8
15.9 15.9 15.9 15.9 16.0
16.0 16.0 16.0 16.1 16.2
16.2 16.3 16.3 16.3 16.4
16.4 16.6 16.8 16.8 16.9
The table 1-1 contains the Raw Data and the table 1-2 rearranges the data in a data array in ascending order. Advantages of Data Array We can quickly locate the lowest and highest values in the data. We can easily divide the data into sections. We can see whether any values appear more than once in the array. We can observe the distance succeeding values in the data.
Arranging Data using the Frequency Distribution In statistics, a graph or data set organized to show the frequency of occurrence of each possible outcome of a repeatable event observed many times. Simple examples are election returns and test scores listed by percentile. A frequency distribution can be graphed as a histogram or pie chart. For large data sets, the stepped graph of a histogram is often approximated by the smooth curve of a distribution function (called a density function when normalized so that the area under the curve is 1). The famed bell curve or normal distribution is the graph of one such function. Frequency distributions are particularly useful in summarizing large data sets and assigning probabilities.
Central Tendency Measure that indicates the typical Median value of a distribution. The mean and the median are examples of measures of central tendency.
Dispersion A
term used in statistics that refers to the location of a set of values relative to a mean or average level. Investopedia Says: In finance, dispersion is used to measure the volatility of different types of investment strategies. Returns that have wide dispersions are generally seen as more risky because they have a higher probability of closing dramatically lower than the mean. In practice, standard deviation is the tool that is generally used to measure the dispersion of returns.
Skewness The degree to which a distribution departs from symmetry about its mean value. In probability theory and statistics, Skewness is a measure of the asymmetry of the probability distribution of a realvalued random variable. Roughly speaking, a distribution has positive skew (right-skewed) if the right (higher value) tail is longer or fatter and negative skew (left-skewed) if the left (lower value) tail is longer or fatter. The two are often confused, since most of the mass of a right (or left) skewed distribution is to the left (or right) of its respective tail.
Measures of Central Tendency The three most common measures of central tendency are the mean, the median, and the mode.
Measures of Central Tendency Arithmetic Mean The arithmetic mean is the most common measure of central tendency. It simply the sum of the numbers divided by the number of numbers. The symbol m is used for the mean of a population. The symbol M is used for the mean of a sample. The formula for m is shown below: m= SX N
where SX is the sum of all the numbers in the numbers in the sample and N is the number of numbers in the sample. As an example, the mean of the numbers 1+2+3+6+8= 20 5 =4 regardless of whether the numbers constitute the entire population or just a sample from the population. The table, Number of touchdown passes, shows the number of touchdown (TD) passes thrown by each of the 31 teams in the National Football League in the 2000 season. The mean number of touchdown passes thrown is 20.4516 as shown below. m= SX
=
N
634 31
= 20.4516
Number of touchdown passes 37
33
33
32
29
28
28
23
22
22
22
21
21
21
20
20
19
19
18
18
18
18
16
15
14
14
14
12
12
9
6
Although the arithmetic mean is not the only "mean" (there is also a geometric mean), it is by far the most commonly used. Therefore, if the term "mean" is used without specifying whether it is the arithmetic mean, the geometric mean, or some other mean, it is assumed to refer to the arithmetic mean.
Measures of Central Tendency Median The median is also a frequently used measure of central tendency. The median is the midpoint of a distribution: the same number of scores are above the median as below it. For the data in the table, Number of touchdown passes, there are 31 scores. The 16th highest score (which equals 20) is the median because there are 15 scores below the 16th score and 15 scores above the 16th score. The median can also be thought of as the 50th percentile. Let's return to the made up example of the quiz on which you made a three discussed previously in the module Introduction to Central Tendency and shown in table 2. Three possible datasets for the 5-point make-up quiz Student
Dataset 1
Dataset 2
Dataset 3
You
3
3
3
John's
3
4
2
Maria's
3
4
2
Shareecia's
3
4
2
Luther's
3
5
1
For Dataset 1, the median is three, the same as your score. For Dataset 2, the median is 4. Therefore, your score is below the median. This means you are in the lower half of the class. Finally for Dataset 3, the median is 2. For this dataset, your score is above the median and therefore in the upper half of the distribution. Computation of the Median: When there is an odd number of numbers, the median is simply the middle number. For example, the median of 2, 4, and 7 is 4. When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12 is 4+7 2
=5.5.
Measures of Central Tendency Mode The mode is the most frequently occuring value. For the data in the table, Number of touchdown passes, the mode is 18 since more teams (4) had 18 touchdown passes than any other number of touchdown passes. With continuous data such as response time measured to many decimals, the frequency of each value is one since no two scores will be exactly the same (see discussion of continuous variables). Therefore the mode of continuous data is normally computed from a grouped frequency distribution. The Grouped frequency distribution table shows a grouped frequency distribution for the target response time data. Since the interval with the highest frequency is 600-700, the mode is the middle of that interval (650). Grouped frequency distribution Range
Frequency
500-600
3
600-700
6
700-800
5
800-900
5
900-1000
0
1000-1100
1
Probability Probability theory is the mathematical study of phenomena characterized by randomness or uncertainty. More precisely, probability is used for modelling situations when the result of an experiment, realized under the same circumstances, produces different results (typically throwing a dice or a coin). Mathematicians and actuaries think of probabilities as numbers in the closed interval from 0 to 1 assigned to "events" whose occurrence or failure to occur is random. Probabilities P(A) are assigned to events A according to the probability axioms. The probability that an event A occurs given the known occurrence of an event B is the conditional probability of A given B; its numerical value is (as long as P(B) is nonzero). If the conditional probability of A given B is the same as the ("unconditional") probability of A, then A and B are said to be independent events. That this relation between A and B is symmetric may be seen more readily by realizing that it is the same as saying when A and B are independent events.
Probability Distribution Outcomes of an experiment and their probabilities of occurrence. If the experiment were to be repeated any number of times, the same probabilities should also repeat. For example, the probability distribution for the possible number of heads from two tosses of a fair coin having both a head and a tail would be as follows: Number of Heads Tosses Probability of Event 0 (tail, tail) . 25 1 (head, tail) + (tail, head) . 50 2 (head, head) . 25
In mathematics and statistics, a probability distribution, more properly called a probability distribution function, assigns to every interval of the real numbers a probability, so that the probability axioms are satisfied. In technical terms, a probability distribution is a probability measure whose domain is the Borel algebra on the reals. A probability distribution is a special case of the more general notion of a probability measure, which is a function that assigns probabilities satisfying the Kolmogorov axioms to the measurable sets of a measurable space. Additionally, some authors define a distribution generally as the probability measure induced by a random variable X on its range - the probability of a set B is P(X 1(B)). However, this article discusses only probability measures over the real numbers.
Discrete Probability Distribution The binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli trial. In fact, when n = 1, then the binomial distribution is the Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance. Example A typical example is the following: assume 5% of the population is green-eyed. You pick 500 people randomly. The number of green-eyed people you pick is a random variable X which follows a binomial distribution with n = 500 and p = 0.05 (when picking the people with replacement).
Sampling In many disciplines, there is often a need to describe the characteristics of some large entity, such as the air quality in a region, the prevalence of smoking in the general population, or the output from a production line of a pharmaceutical company. Due to practical considerations, it is impossible to assay the entire atmosphere, interview every person in the nation, or test every pill. Sampling is the process whereby information is obtained from selected parts of an entity, with the aim of making general statements that apply to the entity as a whole, or an identifiable part of it. Opinion pollsters use sampling to gauge political allegiances or preferences for brands of commercial products, whereas water quality engineers employed by public health departments will take samples of water to make sure it is fit to drink. The process of drawing conclusions about the larger entity based on the information contained in a sample is known as statistical inference. There are several advantages to using sampling rather than conducting measurements on an entire population. An important advantage is the considerable savings in time and money that can result from collecting information from a much smaller population. When sampling individuals, the reduced number of subjects that need to be contacted may allow more resources to be devoted to finding and persuading nonresponders to participate. The information collected using sampling is often more accurate, as greater effort can be expended on the training of interviewers, more sophisticated and expensive measurement devices can be used, repeated measurements can be taken, and more detailed questions can be posed.
Sampling Definitions The term "target population" is commonly used to refer to the group of people or entities (the "universe") to which the findings of the sample are to be generalized. The "sampling unit" is the basic unit (e.g., person, household, pill) around which a sampling procedure is planned. For instance if one wanted to apply sampling methods to estimate the prevalence of diabetes in a population, the sampling unit would be persons, whereas households would be the sampling unit for a study to determine the number of households where one or more persons were smokers. The "sampling frame" is any list of all the sampling units in the target population. Although a complete list of all individuals in a population is rarely available, an alphabetic listing of residents in a community or of registered voters are examples of sampling frames.
Central Limit Theorem A central limit theorem is any of a set of weakconvergence results in probability theory. They all express the fact that any sum of many independent identically distributed random variables will tend to be distributed according to a particular "attractor distribution". The most important and famous result is called The Central Limit Theorem which states that if the sum of the variables has a finite variance, then it will be approximately normally distributed. Since many real processes yield distributions with finite variance, this explains the ubiquity of the normal distribution.
Standard Error In statistics, the standard error of a measurement, value or quantity is the estimated standard deviation of the process by which it was generated, including adjusting for sample size. In other words the standard error is the standard deviation of the sampling distribution of the sample statistic (such as sample mean, sample proportion or sample correlation).
Sample Size Sample size, usually designated N, is the number of repeated measurements in a statistical sample. They are used to estimate a parameter, a descriptive quantity of some population. N determines the precision of that estimate. Larger N gives smaller error bounds of estimation. A typical statement is to say that one can be 95% sure the true parameter is within +or- B of the estimate, where B is an error bound that decreases with increasing N. Such a bounded estimate is referred to as the confidence interval for that parameter.
Estimation Estimation is the calculated approximation of a result which is usable even if input data may be incomplete, uncertain, or noisy. In statistics, see estimation theory, estimator. In mathematics, approximation or estimation typically means finding upper or lower bounds of a quantity that cannot readily be computed precisely. While initial results may be unusable uncertain, recursive input from output, can purify results to be approximately accurate, certain, complete and noise-free.
Point Estimation In statistics, point estimation involves the use of sample data to calculate a single value (known as a statistic) which is to serve as a "best guess" for an unknown (fixed or random) population parameter. More formally, it is the application of a point estimator to the data. Point estimation should be contrasted with Bayesian methods of estimation, where the goal is usually to compute (perhaps to an approximation) the posterior distributions of parameters and other quantities of interest. The contrast here is between estimating a single point (point estimation), versus estimating a weighted set of points (a probability density function).
Interval Estimation In statistics, interval estimation is the use of sample data to calculate an interval of possible (or probable) values of an unknown population parameter. The most prevalent forms of interval estimation are confidence intervals (a frequentist method) and credible intervals (a Bayesian method).
Point Estimation In statistics, regression analysis is used to model relationships between variables and determine the magnitude of those relationships. The models can be used to make predictions. Introduction Regression analysis models the relationship between one or more response variables (also called dependent variables, explained variables, predicted variables, or regressands) (usually named Y), and the predictors (also called independent variables, explanatory variables, control variables, or regressors,) usually named X1,...,Xp). Multivariate regression describes models that have more than one response variable. Types of regression Simple and multiple linear regression Simple linear regression and multiple linear regression are related statistical methods for modeling the relationship between two or more random variables using a linear equation. Simple linear regression refers to a regression on two variables while multiple regression refers to a regression on more than two variables. Linear regression assumes the best estimate of the response is a linear function of some parameters (though not necessarily linear on the predictors). Nonlinear regression models If the relationship between the variables being analyzed is not linear in parameters, a number of nonlinear regression techniques may be used to obtain a more accurate regression.
Correlation Degree of relationship between business and economic variables such as cost and volume. Correlation analysis evaluates cause/effect relationships. It looks consistently at how the value of one variable changes when the value of the other is changed. A prediction can be made based on the relationship uncovered. An example is the effect of advertising on sales. A degree of correlation is measured statistically by the Coefficient of Determination (r-squared).
Coefficient of Determination Statistical measure of Goodness-Of-Fit. It measures how good the estimated regression equation is, designated as r2 (read as rsquared). The higher the r-squared, the more confidence one can have in the equation. Statistically, the coefficient of determination represents the proportion of the total variation in the y variable that is explained by the regression equation. It has the range of values between 0 and 1. It is computed as