Surveys
What is it?
►A
survey is a measurement process used to collect information during a highly structured interview – sometimes with a human interviewer and other times without. ► The questions are carefully chosen or crafted, sequenced, and precisely asked for each participant.
The sources of errors in the communication approach ► Selection
or crafting inappropriate questions ► Asking questions in an inappropriate order ► Use of inappropriate transitions and instructions to elicit information.
Interviewer error ►
► ► ► ► ► ►
Failure to secure participant cooperation : It is likely that interviewers don’t do a good job of enlisting participants to co-operate. Failure to record answers accurately and completely Failure to consistently execute interview procedures Failure to establish appropriate interview environment Falsification of individual answers or whole interviews Inappropriate influencing behaviour Physical presence bias
Participant error ► 1. 2. 3.
For a successful survey three broad conditions must be met by the participants: They must posses the information being targeted by the investigative questions. Must understand his/her role in the interview as the provider of accurate information. Must have adequate motivation to cooperate.
Participation based error ► 1. 2. 3.
Three factors influence participation: Must believe that the experience will be pleasant and satisfying Must believe that answering the survey is important and worthwhile use of his or her time. Must dismiss any mental reservations that he or she might have about participation.
Choice of the process ► Refer
to the word document on surveys
Experiments and test markets ► 1. 2.
What are experiments? A study involving intervention by the researcher beyond that required for measurement. The usual intervention is to manipulate some variable in the setting and observe how it affects the subjects being studied. i.e the researcher manipulates the independent variable and then observes whether the hypothesized dependent variable is affected by the intervention.
Conducting an experiment ► 1. 2. 3. 4. 5. 6. 7.
A researcher must accomplish certain activates to do a successful experiment: Select relevant variables Specify the treatment levels. Control the experimental environment Choose the experimental design Select and assign the subjects Pilot test, revise and test Analyze the data
►
HYPOTHESIS: It is a relational statement as it describes the relationship between two or more variables.
Treatment levels: In an experiment , participants experience a manipulation of the independent variable called the experimental treatment. The treatment levels of the independent variable are the arbitrary or natural groups the researcher makes within the independent variable of an experiment. e.g.: if salary is hypothesized to have an effect on employees exercising of stock of purchase options, it might be divided into high , middle and low ranges to represent three levels of the independent variable.
► ►
►
Miscellaneous terms ►
► ►
The control group is composed of subjects who are not exposed to the independent variables in contrast to those who receive the experimental treatment. When the subjects do not know they are receiving treatment they are said to be blind. When the experimenters do not know that they are giving treatment to the experimental group or to the control group the experiment is said to be double blind.
Sampling Design
Lucky ones get to work with these
The rest of us mere mortals have to make do with…….
100 50 0
1st Qtr
3rd Qtr
East West North
Sampling is choosing which subjects to measure in a research project. Regardless, sampling will determine how much and how well the researcher may generalize his or her findings. A bad sample may well render findings meaningless.
Key concepts and terms
Population: The population is the set of people or entities to which findings are to be generalized. The population must be defined explicitly before a sample is taken. Enumerations or censuses are collections of data from every person or entity in the population.
Random sampling is data collection in which every person in the population has a chance of being selected which is known in advance. Normally this is an equal chance of being selected.
If data are a random sample, the researcher must report not only the magnitude of relationships uncovered but also their significance level (the chance the findings are due to the chance of sampling).
The sampling frame is the list of ultimate sampling entities, which may be people, households, organizations, or other units of analysis. The list of registered students may be the sampling frame for a survey of the student body at a university. Telephone directories are often used as sampling frames, for instance, but tend to under-represent the poor (who have fewer or no phones) and the wealthy (who have unlisted numbers). Random digit dialing (RDD) reaches unlisted numbers but not those with no phones, while over representing households owning multiple phones. In multi-stage sampling, there will be one sampling frame per stage
Significance is the percent chance that a relationship found in the data is just due to an unlucky sample, and if we took another sample we might find nothing. That is, significance is the chance of a Type I error: the chance of concluding we have a relationship when we do not. Social scientists often use the .05 level as a cutoff: if there is 5% or less chance that a relationship is just due to chance, we conclude the relationship is real (technically, we fail to accept the null hypothesis that the strength of the relationship is not different from zero).
HYPOTHESIS: It is a relational statement as it describes the relationship between two or more variables.
Treatment levels: In an experiment , participants experience a manipulation of the independent variable called the experimental treatment. The treatment levels of the independent variable are the arbitrary or natural groups the researcher makes within the independent variable of an experiment. e.g.: if salary is hypothesized to have an effect on employees exercising of stock purchase options, it might be divided into high , middle and low ranges to represent three levels of the independent variable.
Miscellaneous terms
The control group is composed of subjects who are not exposed to the independent variables in contrast to those who receive the experimental treatment. When the subjects do not know they are receiving treatment they are said to be blind. When the experimenters do not know that they are giving treatment to the experimental group or to the control group the experiment is said to be double blind.
Ruling out Chance as an Explanation
When an independent variable appears to have an effect, it is very important to be able to state with confidence that the effect was really due to the variable and not just due to chance. consider a hypothetical experiment on a new antidepressant drug. Ten people suffering from depression were sampled and treated with the new drug (the experimental group); an additional 10 people were sampled from the same population and were treated only with a placebo (the control group). After 12 weeks, the level of depression in all subjects was measured and it was found that the mean level of depression (on a 10-point scale with higher numbers indicating more depression) was 4 for the experimental group and 6 for the control group
The most basic question that can be asked here is: "How can one be sure that the drug treatment rather than chance occurrences were responsible for the difference between the groups?" It could be that by chance, the people who were randomly assigned to the treatment group were initially somewhat less depressed than those randomly assigned to the control group
Confidence intervals are directly related to coefficients of significance. For a given variable in a given sample, one could compute the standard error, which, assuming a normal distribution, has a 95% confidence interval of plus or minus 1.96 times the standard error. If a very large number of samples were taken, and a (possibly different) estimated mean and corresponding 95% confidence interval was constructed from each sample, then 95% of these confidence intervals would contain the true population value, assuming random sampling. The formula for calculating the confidence interval, significance levels, and standard errors, etc will be discussed later.
Standard error. If we took several samples of the same thing we would, of course, be able to compute several means, one for each sample. If we computed the standard deviation of these sample means as an estimate of their variation around the true but unknown population mean, that standard deviation of means is called the standard error. Standard error measures the variability of sample means. However, since we normally have only one sample but still wish to assess its variability, we can compute estimated standard error by this formula: SE = sd/SQRT(n) where sd is the standard deviation for a variable and n is sample size. We are estimating that SE diminishes proportional to the square root of n. The larger the n, the smaller the SE. Often estimated standard error is just called 'standard error.'
Census & sample survey
A census is basically a complete enumeration of all items in the population. Such and inquiry should cover all items and nothing should be left to chance. But remember that in practice this may not always be true. A census would require a great deal of time , money and resources. Due to these reasons most studies undertake sample survey's instead.
What happens in a Sample design process
The respondents basically select a representative of the total population. The selected constitute what is called a sample and the process is called sampling technique. The survey that us conducted is called sample survey. Arithmetically : if we let the population size be “N” and a part of it be “n” where (n
Two Sorts of Statistics • Descriptive statistics • To describe and summarize the characteristics of the sample • Applied in the context of exploratory techniques • Inferential statistics • To infer something about the population from the sample • Applied in the context of confirmatory methods
From Descriptive to Inferential
We have to look at some aspects of the data we use first
The most important aspect of inferential statistics is the selection of the sample
A statistic is meaningless if the sample is not representative
We must consider:
Data Acquisition, Quality, & Collection Procedures
Sampling Design & Methods
Data Acquisition
Any descriptive summaries that we form from a data set, or any inferences that we draw from the data set fundamentally reply upon the notion that the observations that the data record are an accurate reflection of the phenomenon of interest at the time they were taken
To have any confidence in the usefulness of a dataset, we need to be aware of how the data was collected, and by whom, and make use of that data to inform our judgment about how sound that source of data is for a given purpose
Data Acquisition • The fundamental distinction we can draw between sources of data is data that you have collected yourself, versus data that has been collected by others and archived • Collected - In many ways, this is the best sort of data because you can be absolutely certain of the methods used, although this can be expensive • Archived - Has the competing merit of already being available, possibly having been collected over a period of time, and others have undertaken the expense of doing so
Collected Data
• Collected Data - a.k.a. primary data, is collected directly by the researcher through experiments, measurements, field surveys etc. • Benefits: Total certainty as to methods used and error associated with them, can be customized to the research question, the methods can be precisely repeated on another occasion or in another location • Drawbacks: Collecting data is expensive, there may not be a comparable historical record of similar measurements, gives critics an opportunity to criticize your data collection as well!
Collected Data • Collected Data Cont. - We can further sub-divide collected data into categories that denote the sort of collection procedure used to produce the data: • Experimental (controlled experiment) data is produced under repeatable conditions and is presumably an objective description of some phenomenon (often used in physical geography) • Non-experimental data, such as interview or questionnaires are used to assess more qualitative or subjective ideas or concepts (often applied in the human geography context)
Archived Data •Archived Data - Data that is already available because it has been collected by someone else •Benefits: The expense of collecting the data has been absorbed already, the methods used are often a standard approach that allows for inter-comparison with historical records or records for other places •Drawbacks: One cannot be as sure of the data quality, methods and associated errors here (sometimes metadata is not available), the variable of interest may not be available, or your definition may vary slightly from that used by others
Archived Data • Archived Data cont. - We can characterize archived data as being internal (meaning it was collected by another member of your organization), or external (meaning it was collected by someone you do not know as well) … we can call these: • Secondary data, which is obtained directly from those that did collect the data • Tertiary data, which we can obtain from a third-party (sometimes via publication, sometimes not), often this is data which has already been analyzed or transformed somehow
Data Quality • The further removed we are from those that actually collect and create a data set, the worse off we are when using that data • The results of any statistical study are only as good as the data that was used, thus the quality of the data is very important because it in turn determines the quality and reliability of descriptions and inferences based upon it • Data obtained externally should be used only after a serious investigation and consideration of its quality and reliability
Sampling Populations • Typically, when we collect data, we are somewhat limited in the scope of what information we can reasonably collect • Ideally, we would enumerate each and every member of a population so we could know its parameters perfectly • In most cases this is not possible, because of the size of the population (infinite populations?) and associated costs (time, money, etc.) • Usually it is not necessary, because by collecting data on an appropriate subset of the population we can create statistics that are adequate estimates of population parameters • Instead, we sample a population, trying to get information about a representative subset of the population
Sampling Concepts • We must define the sampling unit - the smallest sub-division of the population that becomes part of our sample • We want to minimize sampling error when we design how we will collect data: Typically the sampling error ↓ as the sample size ↑ because larger samples make up a larger proportion of the population (and a complete census, for example, theoretically has no sampling error) • We want to try and avoid sampling bias when we design how we will collect data: Bias here is referring to a systematic tendency in the selection of members of a population to be included in a sample, i.e. any given member of a population should have an equal chance of being included in the sample (for random sampling)
Steps in Sampling 1. Definition of the population - We first need to identify the population we wish to sample, and do so somewhat formally because any inferences we draw are really only applicable to that population 2. Construction of a sampling frame - This involves identifying all the individual sampling units within a population in order that the sample can be drawn from them. In a survey-type study, this could involve procuring a list of all the potential individuals who could be included in a sample.
Steps in Sampling Cont. 3. Selection of a sampling design - This is a critical decision about how to collect the sample. We will look at some different sampling designs in the following slides 4. Specification of information to be collected - The formal definition of what data we will collect and how Often, a pilot sample is conducted to refine the sampling design and specifications to help minimize biases that only become apparent once the sampling design and specifics are tested 5. Collection of the data - When we have steps 1-4 straight, we go about collecting the sample
Types of Samples (Designs) •
We can distinguish between two families of sampling designs: • Non-probability designs are not concerned with being representative by virtue of minimizing bias, are typically used for non-scientific purposes, and are not appropriate for statistical inference studies, although they can be useful in an investigative sense • Probability designs aim to representative of the population they sample, follow rules of randomness in selection to minimize bias, and are those that are used in scientific studies were inferential statistics will be used
Non-probability Sampling Designs •
Some types of non-probability designs: • Volunteer sampling - A ‘self-selecting’ sample, which is convenient, but rarely representative • Quota sampling - Researchers select individuals to include based on fulfilling counts of sub-groups • Convenience sampling - Individuals are included in the sample because they are available/accessible • Judgmental or purposive sampling - Those that are chosen to be included in the sample are chosen based upon some preconceived notions of what sorts of individuals would be most appropriate for this investigative purpose (e.g. product testing based on ideas about the market for a product)
Probability Sampling Designs Random •
•
• •
Random sampling - In general, we need some degree of randomness in the selection of a sample to be able to draw any meaningful inferences about a population, but in some cases this may conflict with representativeness These are drawn in such a way that every unit of a population has an equal chance of being chosen and the selection of one unit has no impact on whether or not another individual will be selected (independence) This can be done with or without replacement (which determines whether the same unit can be drawn twice) We can generate random numbers using a table or using a computer, and can scale the 0 to 1 values to any required range of values
Probability Sampling Designs - Systematic Representative approaches place restrictions on selection: • Systematic sampling - This approach uses every kth element of the sampling frame, by beginning at a randomly chosen point in the frame, e.g. given a sampling frame of size = 200, to create a sample of size n=10 from such a sample, select a random point to begin within the frame and then include every 20th value in the systematic sample • This approach assumes that the assignment of the individuals in the sampling frame is random (i.e. they have not been placed in the frame in some order or grouping), and this should be checked before systematically sampling from a frame
Probability Sampling Designs - Systematic Some problems with systematic sampling: • The possible values of sample size n are somewhat restricted by the size of the sampling frame, since the interval should divide evenly into the size of the sampling frame • If the population itself exhibits some periodicity, then a stratified sample is likely to not be representative
Probability Sampling Designs Stratified We may need to place restrictions on how we select units for inclusion in a sample to ensure a representative sample. •Stratified sampling - Divide the population into categories and select a random sample from each of these •This approach can be used to decrease the likelihood of an unrepresentative sample if the classes/categories/strata are selected carefully (the individuals within a strata must be very much alike, which means that the population must be able to divided into relatively homogeneous groups) •We need to know something about the population in order to make good decisions about stratification
Probability Sampling Designs Stratified •We can take a stratified sample that is •Proportional - Where the random sample drawn from each class/category/stratum is the same size OR •Disproportional - Where random samples of different sizes are drawn from each class/category/stratum, with the sample size usually being chosen on the basis of the size of that subpopulation. This approach is best used when the sizes of the categories are significantly different, although it can also be applied to mitigate cost issues (i.e. it may be more costly to sample in a swamp than in a grassy field, so we might choose to take less samples in the swamp, although this clearly would be nothing to enhance representativeness in our sample)
Probability Sampling Designs Stratified WARNING: •A class/category/stratum that is homogeneous with respect to one variable may have high variation with respect to another variable! Thus, stratification must be performed with some foreknowledge of how the sample will be analyzed, and if the sampling is being performed in a preliminary fashion (still seeking the relationships), there is a danger that the stratification will be found to be inappropriate after the fact
Probability Sampling Designs Cluster Another sampling approach that subdivides the population into categories is cluster sampling • Cluster sampling - Divides the population into categories based on convenience rather than some structure designed to promote unbiased representation of a particular variable across all clusters, and sampling is performed within individual clusters •Certain clusters are selected for intensive study, usually by a random procedure, and the content of clusters should each be individually be heterogeneous (a cross-section of the range of values seen in the whole population), and thus representative •This is usually applied for reasons of cost and convenience
Choosing a Sampling Design •In a geographic context: •Stratified sampling works best if the regions are reasonably homogeneous •Cluster sampling works best if the regions are heterogeneous •From an efficiency point of view (the number of samples required), stratified sampling is best since it can be representative using a smaller number of samples, but if there is no clear means of rational stratification, then clustering might be the way to go •Many sampling designs are hybrids of approaches (e.g. stratify by ethnic group, cluster to pick neighborhoods, select houses randomly)