Université Paris I
Faye Steiner INTRODUCTION TO STATA
What is Stata For ?
Handling Data (Stata allows creation and transformation of datasets Analysis (descriptive statistics, multivariate analysis (regression, maximum likelihood, etc.), hypothesis testing) Graphs
I. The Basics 1. Stata Data Sets Stata deals with files in which data are organized as in a matrix. For example, where Var denotes a variable and Obs an observation of that variable: Var1 Obs1 # Obs2 # … # ObsN #
Var2 Yes No No No
… VarN
Values taken by different variables may be numbers or characters (e.g. Yes). Stata works with columns of the matrix, so by default a command that works on a set of variables will automatically treat the entire set of observations (unless you specify otherwise). For example, if you create Var3 = Var1 + Var2, this will create a variable that sums the values of Vars 1 and 2 for all N observations and puts them in the column Var3. 2. Getting In and Out of Stata To get in, click on “intercooled stata 9” from the Windows start menu or the stata icon. You will see the screen divided into 4 windows. The “main” one is the stata result window which will display the results of your commands. At the left are the “variables” and “review” windows. The variables window lists all variables currently loaded in memory (it is empty until you load a dataset). The review window gives a list of previously used commands. The last window is the Stata “command” window- this is where you enter commands. In order to get out of stata, at the command prompt, type: . exit or if stata refuses, then enter . exit, clear Or choose “exit” from the file window. Be careful: stata will drop all data that you have not saved. If you want to save data, you must use the save command: . save filename 3. Command Handling
When you enter a command, stata executes it and produces the results or an error message in the results window. Note that most stata commands can be abbreviated (e.g. “gen” for “generate”) Some useful commands for Dealing with directories and files . pwd - tells you where you are . mkdir “new directory name” - makes a directory with that name . cd “directory name” - puts you in that directory . dir - lists the contents of your directory 4. Stata’s Command Syntax In general, stata has 2 types of commands: • Commands that report things about data (e.g. describe, list, summarize…) • Commands that change the data (e.g. use, drop, generate) Most commands have a common syntax (and you will see this in Stata help): . [by varlist_a] command [varlist_b] [=expression1] [if expression2] [in range] [, options] (note that what is written between brackets, [], is optional, and the brackets themselves are not part of the syntax.) 5. Variable Names Stata allows names of variables to be as long as 32 characters. Blanks are not allowed, and names must begin by a letter or by “_”. It is not a good idea to begin a variable name with “_” because Stata creates its own variables when executing this command that always begin with this character. Note that Stata makes the distinction between upper and lower case characters, so for example, “Sex” and “sex” are two different variables. Examples of some acceptable variable names: income income_female_consumers year dummy1 income1988 6. Operators Arithmetic: + * / ^
addition subtraction multiplication division power elevation
x + y x− y for example the formula − is entered as –(x+y^(x-y))/(x*y) in Stata. xy
Comparisons (for if/then statements): > greater than < less than >= greater than or equal to <= lessthan or equal to == equal ~= not equal Logical Operators: & | ~
and or (note this is keys Alt Gr and 6) not (note this is keys Alt Gr and 2)
7. Working interactively: Keeping Logs If you are going to work interactively, you need to keep a “log” of your session. This will create a text file that is a transcript of your stata session. To start the log, you type the command, specifying a filename: . log using “filename.log” When you finish, you close the log: . log close 8. .do Files If you are working interactively, you can simply type a command and execute it by hitting the Enter key. This is practical for short commands (one line). However, you will generally want to create a “.do” file in which you type a series of commands and can then have Stata execute them as a whole. This is recommended because you will thus have a record of what commands you executed and on what data, if you obtain an error message from using a particular command you can easily correct the error and resubmit (or re-execute) the program, and you can easily save your work! (e.g. homework1_faye.do, homework2_faye.do) To create your .do file, you type your commands (the Stata program) in a text editor and then save the document with the extension “.do”. For this, I recommend that you use the text editor in Stata. To open the do-file editor from Stata, click on the envelope icon (or go to Window=> Do-file Editor. Type your commands, and then save your program with File => Save as “homework_faye.do”. Once you have saved your .do file, you can execute it in three ways: (i) Click on the execute icon in the do-editor (ii) File => Do and then specify the path of your program (iii) In the interactive command window, type “do C:\homework_faye.do” from and you will see your program executed in the results window. II. Working with Data
1. Loading data into Stata Typically you will start by loading a dataset into memory. If your dataset is large, you may need to ask stata to increase the amount of memory available: . set mem 200m This gives us 200 kilo bytes of memory in which to work. We then load our big data file: . use “big dataset.dta” The variable names will appear in the variables window when the data is loaded. You can find out more about the dataset with the commands: . describe (gives variable name, label type) . browse (displays the dataset or subset that you specify in the stata browser) . list variables (displays the values of the specified variables) . tabulate variables (gives a frequency table of the specified variables) You may want to create “labels” for your data. See stata help for more on this. 2. Data Manipulation The following data-manipulation commands are useful: generate, replace, egen, rename, drop, keep, sort, by, reshape. Of these, you will make extensive use of . generate . replace (especially with by) You use generate to create new variables, and replace to change values of existing ones. Both commands use standard algebraic syntax. To create a new variable: . generate newvar = expression To replace the contents of an old one: . replace oldvar = expression You can conditionally replace values by typing: . replace oldvar = expression1 if expression2 3. Some examples You can create a dummy variable from the variable “sex” that takes the values “F” or “M”: . generate dummy_sex = 0 . replace dummy_sex = 1 if sex = “F” This creates a numeric variable “dummy_sex” that takes the value 0 for men and 0 for women. Suppose you want to sort the dataset according to the values of a variable in the database. For example, you have a variable “day” and you want to sort all of the other variables in your database in order of “day”. Let’s say your database is called database1. . sort database1 by day Suppose you want to drop all of the observations on men from your database. You can either “drop” the men, or “keep” the women:
. drop if sex == “M” . keep if sex == “F” Note that these are equivalent. Note that in using the conditional command “if”, you must use a double equal sign: == (or => or =<). 4. Missing Values Missing values are typically indicated by a period. For example, to use only observations where a variable’s values are not missing, specify: . [command] if variable ~= ”.” 5. Commands to Read ASCII Data To enter data directly from the keyboard, use edit and input commands. These commands can be used to enter a small amount of data To import ASCII data created prior to your stata session, use the insheet, infile, and infix commands. These commands are very similar to each other. The insheet command is used to import ASCII data when they have been created in excel or any other equivalent software (i.e. there is only one observation per line, and data are separated by commas or tabs). For example: . insheet using “database1.txt”, tab/comma . infile [varlist_a] using “C:\data\database2.txt” For more information on reading data with or without a stata “dictionary” see stata help. III. Appending and Merging Datasets 1. Appending data The command “append” is designed to solve the problem of adding more data to an existing dataset. It combines datasets vertically. With append, you do not specify the names of 2 datasets. Rather you use one dataset and append the other- the “using” dataset is copied to the end of the one already in memory. For example, suppose you have the following two datasets: one.dta Var1 Var2 1 2 3 4 5 6
Var1 7 9
two.dta Var2 Var3 8 11 10 12
First you must establish the dataset that is the base to be appended. Then append it: . use one, clear . append using two This results in the following dataset in memory: Var1
one.dta Var2 Var3
1 3 5 7 9
2 4 6 8 10
. . . 11 12
The dataset one had 3 observations, and the dataset two had 2 observations, so the resulting dataset has 3 + 2 = 5 observations. Note that since the original “one” dataset does not contain Var3, stata appends it with missing values for the observations in the original “one” dataset. 2. Merging data While append combines datasets vertically, adding observations to the end of a dataset, merge combines datasets horizontally, adding variables. There are essentially two forms of merging, “one-to-one” merge, where stata just joins two datasets by adding new variables and their values horizontally, observation by observation, and the “match-merge” where Stata compares observation values according to a predetermined criterion (an identifier) and adds the variables and their values where it finds a match in the two datasets. We will typically do “match-merges”, but to understand how stata merges, it is useful to see an example of a “one-to-one” merge to see how stata merges and to learn about the “_merge” variable it creates and its values. One-to-One Merges Stata’s merge command combines datasets horizontally, observation by observation. Suppose we merge the following datasets:
one.dta Var1 Var2 1 2 3 4 5 6 . use one, clear . merge using two
two.dta Var3 11 12
This returns the following dataset: Var1 1 3 5
one.dta Var2 Var3 2 11 4 12 6 .
_merge 3 3 1
The dataset “one” is the “master data” and the dataset two is the “using data”. Stata creates a new variable “_merge” that indicates what it has done. We can see the Stata horizontally
added Var2 to the master data. Since the using data had only 2 observations, Stata fills the third observation for Var3 with a missing value. _merge may take different values that indicate what Stata has done and where the merged data comes from: _merge = 1 => the data appeared only in the master dataset _merge = 2 => the data appeared only in the using dataset _merge = 1 => the data appeared in both the master and the using datasets Now, consider another example, still of a simple one-to-one merge: one.dta Var1 Var2 1 2 3 4 5 6 . .
two.dta Var1 Var3 100 . . 12 . . . 200
. use one, clear . merge using two This returns the following dataset: Var1 1 3 5 .
one.dta Var2 Var3 2 . 4 12 6 . . 200
_merge 3 3 1 2
Here, we can observe several things. First, wherever the two datasets have variables and observations in common, it is the values of the master dataset that are retained! So for example, looking at observation 1 of Var1, the merged dataset includes the value 1, not 100. Similarly, for observation 3 Var1 in the merged dataset takes the value 5, not missing. Second, we can see that for the fourth observation, the merged dataset includes missing values for Var1 and Var2 and _merge takes the value 2 since the original “one” dataset had no fourth observation, while “two” contained a value for Var3. Match Merges This is what we will usually want to do to avoid mixing apples and oranges. The problem that you may have noticed above is that when you do a one-to-one merge, stata combines the data observation by observation. This works fine if the datasets have the same number of observations, the observations in each dataset correspond to the same person or firm or whatever, and the observations are in the same order. But if any of these conditions is violated, the result is meaningless, and you may not even realize it! Match merges avoid this problem by explicitly identifying the variable (or variables) by which observations are matched.
Let’s see an example. Suppose we now have the following datasets: one.dta ID Var1 A 1 B 3 C 5 E 7
Var2 2 4 6 8
two.dta ID Var3 A 100 B 200 D 300
We now specify ID as the variable by which Stata should match and merge observations: . use one, clear . merge ID using two This returns the following dataset: ID A B C D E
Var1 1 3 5 . 7
one.dta Var2 Var3 2 100 4 200 6 . . 300 8 .
_merge 3 3 1 2 1
The master dataset has data on ID = A,B, C, and E, while the using dataset has data on Var3 for ID = A, B, and D. Stata did not blindly merge observations horizontally, but smartly matched data on ID = 1 from the master dataset to data on ID = 1 from the using dataset, etc. Notice that our identifying variable was unique. One caveat: in order to do a match merge, all datasets must be sorted according to the matching variable (here ID). So before merging, if the data were not sorted, you would need to sort as follows: . use two . sort ID . save two, replace . use one . sort ID . merge ID using two A final note: often we may merge data from two datasets and want to use only the observations where we had data from both datasets. First, we can see what happened in our merge by doing tabulate. This will tell us how many observations we have where data came from only “one” (_merge = 1), from only “two” (_merge = 2), or from both “one” and “two” (_merge = 3). . tabulate _merge
The above returns a frequency table on the variable _merge. Now, if we want to use only the observations for which there were data in both “one” and “two”, we use the _merge variable and either drop the observations where _merge = 1 or 2, or keep the observations where _merge = 3 (the two commands are equivalent): . drop if _merge ==1 | _merge ==2 . keep if _merge ==3 3. Other useful Stata Basics Labels You can give a « label » to a database, a variable, or its values. This is a way to keep a description of a variable in Stata’s memory in addition to just the variable’s name. To give a database a label: . label data “name you choose” To give a variable a label: . label variable “name you choose” To label the values of a categorical variable, for example the variable sex which contains 1 men and 2 for women: . label define sexlabel 1 male 2 female Some functions with generate (“gen”) and egen Gen y = abs(x) Gen y = exp(x) Gen y = ln(x) Gen y = log10(x) Gen y = sqrt(x) Gen y = int(x) Gen y = round(x) Gen y = max(x1, … Xn) Gen y = min(x1, … Xn)