Advanced Sas Programming Techniques (www.studysas.blogspot.com)

October 2019
PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Advanced Sas Programming Techniques (www.studysas.blogspot.com) as PDF for free.

More details

Words: 24,918
Pages: 81

Preview
Full text

Advanced SAS Programming Techniques A Workshop Presented to the Alaska Chapter The American Fisheries Society E. Barry Moser Department of Experimental Statistics Louisiana State University and Louisiana State University Agricultural Center Baton Rouge, LA 70803 Phone: 504-388-8376 FAX: 504-388-8344 E-mail: [email protected] September 29-October 3, 1997

Contents 1 Introduction

3

2 The DATA Step

4

2.1 The DATA STEP process : : : : : : : : : : : : 2.1.1 An implicit loop : : : : : : : : : : : : : 2.1.2 RETURN, DELETE, and OUTPUT : : 2.1.3 Compound Statements : : : : : : : : : : 2.1.4 Data Set Options : : : : : : : : : : : : : 2.1.5 DROP, KEEP, and RETAIN : : : : : : 2.2 Input/Output : : : : : : : : : : : : : : : : : : : 2.2.1 List Input : : : : : : : : : : : : : : : : : 2.2.2 Column Input : : : : : : : : : : : : : : : 2.2.3 Pointer Control and Formatted Input : 2.2.4 The PUT Statement : : : : : : : : : : : 2.2.5 SAS Formats and Informats : : : : : : : 2.3 SAS Functions : : : : : : : : : : : : : : : : : : 2.3.1 Mathematical Functions : : : : : : : : : 2.3.2 Random Number Generators : : : : : : 2.3.3 String Functions : : : : : : : : : : : : : 2.3.4 Date and Time Functions : : : : : : : : 2.3.5 PUT and INPUT Functions : : : : : : : 2.4 Looping and Arrays : : : : : : : : : : : : : : : 2.4.1 Univariate and Multivariate Data Views 1

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

4 4 5 7 8 10 10 10 13 14 18 19 21 21 22 23 24 25 26 27

CONTENTS

2

2.4.2 Indeterminant DO Loops : : : : : : : : : : : : : 2.5 The NULL Data Set : : : : : : : : : : : : : : : : : : : : 2.6 Data Step Examples : : : : : : : : : : : : : : : : : : : : 2.6.1 Simple Random Sampling Without Replacement 2.6.2 Data Recoding : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

3 Working With Files

38

3.1 External Files : : : : : : : : : : : : 3.1.1 FTP Access : : : : : : : : : 3.1.2 WWW Access : : : : : : : 3.2 Including External SAS Code : : : 3.3 The SAS Data Library : : : : : : : 3.3.1 The LIBNAME Statement : 3.3.2 Library Procedures : : : : : 3.4 File Import/Export/Transport : : 3.4.1 Import/Export : : : : : : : 3.4.2 Transport : : : : : : : : : : 3.5 The X Files : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

4 The Macro Language 4.1 4.2 4.3 4.4

Macro Variables : : Macro Procedures : Bootstrap Example : Cluster Dendrogram

38 44 45 45 45 46 47 51 51 53 55

57 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

5 SAS Special Files 5.1 Autoexec.sas 5.2 Con g.sas : : 5.3 Pro le.sct : :

32 33 35 35 36

57 59 62 66

70 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

6 SAS Internet Tools 6.1 Capturing OUTPUT for the Web :

70 72 75

76 : : : : : : : : : : : : : : : : : : : : : : :

76

Chapter 1

Introduction The SAS1 system, composed of many diverse components, is a very powerful programming environment, data management and data analysis environment, and report generation and graphics presentation environment. This manuscript was developed for a short-course in \advanced SAS." Obviously the coverage will have to be quite limited. The coverage is designed around material that I have encountered through my teaching, research, and statistical consulting work that I believe will be relevant and useful for others dealing with basic data management and statistical analysis needs. This manuscript is not intended as a SAS language or SAS system reference manual, it hardly scratches the surface. Nor is it designed to show how to do statistical analysis with the SAS system. The manuscript will rst focus on the data step, as a lot of the power of the SAS environment can be demonstrated through the data step. Next, the input/output and library system will be discussed. Later the macro language will be introduced. And nally several chapters dealing with various parts of the system, graphics, data analysis procedures, and the internet will be introduced. As this is an \advanced" course, some items will be introduced before they are actually covered in some detail. This was purposefully done so as to avoid completely arti cial-looking or contrived examples (although a few do exist, sorry). Further, since not all of the \basics" are covered, keep copies of the SAS manuals available. The best way to learn the SAS system and to bene t from this course is to experiment with the examples and to create your own. Again, DO NOT hessitate to modify the examples and to create new ones.

SAS, SAS/BASE, SAS/GRAPH, SAS/ACCESS, SAS/ASSIST, SAS/FSP, SAS/INSIGHT, SAS/OR, SAS/ETS, SAS/IntrNet, SAS/IML, and SAS/STAT are registered trademarks or copyrights of SAS Institute, Inc., Cary, NC. 1

3

Chapter 2

The DATA Step 2.1 The DATA STEP process 2.1.1 An implicit loop To understand much of what happens in the data step, one rst needs to understand its overall design. When originally conceived, the SAS data step was designed to get data stored in some \raw" format into the SAS data format and to perform any transformations and computations on that data prior to data analysis with procedures that would follow. Thus, the data step was designed with an implicit loop around the data input. That is, rather than the programmer having to explicitly write a loop around the input code, as would need to be done with FORTRAN (and most other programming languages), the loop was already assumed to be needed, and was, therefore, automatically supplied. At the end of the implied loop, the resulting data, in the form of variables, is output to the new SAS data set. The programmer writes the code needed to process a single observation of data, and the data step will then automatically repeat this same code for each observation, outputing each new observation in turn into the new SAS data set. This same basic process is also followed when a SAS data set is created from an existing SAS data set, such as when several SAS data sets are concatenated or merged together. The example below illustrates the basic looping process using a portion of the ier data set. Title2 "Simple Data Step"; Data One; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; Datalines; 7 21 74 2 7 3 0 5 3.5 1.4 57.00 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 2 4 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 10.1 23.4 203.00 ; Proc Print Data=One; Run;

4

CHAPTER 2. THE DATA STEP

5

Data Step Examples Simple Data Step OBS

MO

DAY

YR

AR

ST

SEX

AGE

SN

LT

WT

TSL

1 2 3 4 5

7 1 12 2 9

21 9 18 15 13

74 76 74 76 75

2 2 2 2 2

7 3 4 1 2

3 2 1 1 2

0 0 0 0 1

5 1 5 5 5

3.5 5.3 5.4 6.0 10.1

1.4 3.0 3.4 5.7 23.4

57.00 0.00 83.20 111.00 203.00

2.1.2 RETURN, DELETE, and OUTPUT The behavior of the data step loop can be modi ed by several statements. The RETURN statement causes execution of a loop to \return" to a speci c point in a loop. When the loop is the implicit data step loop, execution returns immediately to the beginning of the data step loop. If no OUTPUT statements are contained in the data step, then the RETURN statement also outputs the current observation in whatever stat it is in into the SAS data set. Title2 "RETURN Statement"; Data One; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; If TSL=0 Then RETURN; ConvFact=Lt/TSL; Datalines; 7 21 74 2 7 3 0 5 3.5 1.4 57.00 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 2 4 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 10.1 23.4 203.00 ; Proc Print Data=One; Run; Data Step Examples RETURN Statement OBS

MO

DAY

YR

AR

ST

1 2 3 4 5

7 1 12 2 9

21 9 18 15 13

74 76 74 76 75

2 2 2 2 2

7 3 4 1 2

SEX

AGE

3 2 1 1 2

0 0 0 0 1

SN 5 1 5 5 5

LT

WT

3.5 5.3 5.4 6.0 10.1

1.4 3.0 3.4 5.7 23.4

TSL

CONVFACT

57.00 0.00 83.20 111.00 203.00

0.061404 . 0.064904 0.054054 0.049754

Notice that the conversion factor, CONVFACT, for the second observation is missing (represented by a period). This observation could be dropped from the data set by several methods. We'll consider a couple to further illustrate the behavior of the data step. The OUTPUT statement can be used to output an observation to a data set or data sets. When it is present in a data step, observations will ONLY BE OUTPUT when the OUTPUT statement is executed. Consider the example below using both the RETURN and OUTPUT statements.

CHAPTER 2. THE DATA STEP

6

Title2 "RETURN and OUTPUT Statements"; Data One; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; If TSL=0 Then RETURN; ConvFact=Lt/TSL; OUTPUT; Datalines; 7 21 74 2 7 3 0 5 3.5 1.4 57.00 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 2 4 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 10.1 23.4 203.00 ; Proc Print Data=One; Run; Data Step Examples RETURN and OUTPUT Statements OBS

MO

DAY

YR

AR

ST

1 2 3 4

7 12 2 9

21 18 15 13

74 74 76 75

2 2 2 2

7 4 1 2

SEX

AGE

3 1 1 2

0 0 0 1

SN 5 5 5 5

LT

WT

3.5 5.4 6.0 10.1

1.4 3.4 5.7 23.4

TSL

CONVFACT

57.00 83.20 111.00 203.00

0.061404 0.064904 0.054054 0.049754

Note that the second observation was never output to the new SAS data set. Now, what if we would like the observations with zero (or no) total scale length (TSL) to be placed into one data set and the others to be placed into another data set. Let's use the OUTPUT statement to do this for us. Title2 "OUTPUT to Different Data Sets"; Data WithTSL NoTSL; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; ConvFact=Lt/TSL; If TSL=0 Then OUTPUT NoTSL; Else OUTPUT WithTSL; Datalines; 7 21 74 2 7 3 0 5 3.5 1.4 57.00 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 2 4 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 10.1 23.4 203.00 ; Title3 "With TSL > 0"; Proc Print Data=WithTSL; Run; Title3 "TSL Not Measured"; Proc Print Data=NoTSL; Run;

CHAPTER 2. THE DATA STEP

7

Data Step Examples OUTPUT to Different Data Sets With TSL > 0 OBS

MO

DAY

YR

AR

ST

1 2 3 4

7 12 2 9

21 18 15 13

74 74 76 75

2 2 2 2

7 4 1 2

SEX

AGE

3 1 1 2

0 0 0 1

SN 5 5 5 5

LT

WT

TSL

CONVFACT

3.5 5.4 6.0 10.1

1.4 3.4 5.7 23.4

57.00 83.20 111.00 203.00

0.061404 0.064904 0.054054 0.049754

Data Step Examples OUTPUT to Different Data Sets TSL Not Measured OBS

MO

DAY

YR

AR

ST

SEX

AGE

1

1

9

76

2

3

2

0

SN

LT

WT

TSL

CONVFACT

1

5.3

3

0

.

One problem that we will have here is that the conversion factor will be computed on each and every observation and so we get a divide by zero error on the second observation (TSL=0). Further, since no conversion factor is possible when the scale length is not measured, we should like to drop the CONVFACT variable from the NoTSL data set.

2.1.3 Compound Statements A compound statement is a programming statement that consists of several simple statements. In the SAS Language, compound statements are constructed using the DO and END statements. Often a compound statement is the object of a conditional statement such as the IF statement. Using the Flier data set from the previous example, we can prohibit the divide by zero error and output the observations to the proper data sets using a compound statement. Title2 "Compound Statements"; Data WithTSL NoTSL; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; If TSL=0 Then OUTPUT NoTSL; Else Do; ConvFact=Lt/TSL; OUTPUT WithTSL; End; Datalines; 7 21 74 2 7 3 0 5 3.5 1.4 57.00 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 2 4 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 10.1 23.4 203.00 ; Title3 "With TSL > 0"; Proc Print Data=WithTSL; Run; Title3 "TSL Not Measured"; Proc Print Data=NoTSL; Run;

CHAPTER 2. THE DATA STEP

8

Data Step Examples Compound Statements With TSL > 0 OBS

MO

DAY

YR

AR

ST

1 2 3 4

7 12 2 9

21 18 15 13

74 74 76 75

2 2 2 2

7 4 1 2

SEX

AGE

3 1 1 2

0 0 0 1

SN

WT

TSL

CONVFACT

3.5 5.4 6.0 10.1

1.4 3.4 5.7 23.4

57.00 83.20 111.00 203.00

0.061404 0.064904 0.054054 0.049754

5 5 5 5

LT

Data Step Examples Compound Statements TSL Not Measured OBS

MO

DAY

YR

AR

ST

SEX

AGE

1

1

9

76

2

3

2

0

SN

LT

WT

TSL

CONVFACT

1

5.3

3

0

.

Unfortunately, the CONVFACT variable is still present in the NoTSL data set, although it was not computed for any observations in this data set. A DROP statement could be used to remove the CONVFACT variable from ALL output data sets, but this is not what we would like either.

2.1.4 Data Set Options There are a number of options that can be speci ed for accessing SAS data sets, including passwords to protect a data set from being read, edited, or written to. These options are speci ed within parenthesis following the data set name and can be used within a data step or a procedure step. Some options that are commonly used are Speci es a list of variables to be retained in a new data set. DROP= Speci es a list of variables to be excluded from a new data set. LABEL= Speci es a label that is to be stored in the SAS data set and will be shown using various data set utilities. RENAME= Permits variable names to be changed. The old names are available in the current data step, while the new names will actually be stored in the new data set. KEEP=

Options that are used for the processing of an existing SAS data set, such as when using SET, MERGE, and UPDATE statements, include Speci es a list of variables to be accessible from an existing data set. DROP= Speci es a list of variables to be inaccessible from an existing data set. FIRSTOBS= Speci es the observation number with which processing should begin. OBS= Speci es the last observation number to be processed, after which processing will stop. KEEP=

CHAPTER 2. THE DATA STEP

9

IN= Names a new variable that will have the value 1 when the current observation is read from the data set and 0 when the current observation is read from another data set.

Now we can use this information to update our program so that we can have a dierent set of variables for the two new data sets. Title2 "DROP Data Set Option"; Data WithTSL NoTSL(DROP=ConvFact); Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; If TSL=0 Then OUTPUT NoTSL; Else Do; ConvFact=Lt/TSL; OUTPUT WithTSL; End; Datalines; 7 21 74 2 7 3 0 5 3.5 1.4 57.00 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 2 4 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 10.1 23.4 203.00 ; Title3 "With TSL > 0"; Proc Print Data=WithTSL; Run; Title3 "TSL Not Measured"; Proc Print Data=NoTSL; Run; Data Step Examples DROP Data Set Option With TSL > 0 OBS

MO

DAY

YR

AR

ST

1 2 3 4

7 12 2 9

21 18 15 13

74 74 76 75

2 2 2 2

7 4 1 2

SEX

AGE

3 1 1 2

0 0 0 1

SN

WT

TSL

CONVFACT

3.5 5.4 6.0 10.1

1.4 3.4 5.7 23.4

57.00 83.20 111.00 203.00

0.061404 0.064904 0.054054 0.049754

5 5 5 5

LT

Data Step Examples DROP Data Set Option TSL Not Measured OBS 1

MO 1

DAY

YR

AR

ST

SEX

AGE

9

76

2

3

2

0

SN

LT

WT

1

5.3

3

TSL 0

Finally, this is what we had wanted. Note that the KEEP= option could also have been used, but would have required that we list each of the variables that we wanted to keep in the new data set. The choice as to KEEP or DROP is usally based upon which list is shorter or easiest to write.

CHAPTER 2. THE DATA STEP

2.1.5 DROP, KEEP, and RETAIN

10

There are a few statements in the SAS data step that are not executable statements and can be placed anywhere within the SAS data step. As with the DROP= and KEEP= data set options, the DROP and KEEP statements are used to specify which variables are to be excluded or included in the new data set or data sets being created. Others include the FORMAT, LENGTH, and ARRAY statements. The RETAIN statement can also be placed anywhere but can serve several roles. Normally, after the variables have been output for an observation at the end of the data step loop, the variables' values are reset to missing before the next observation is processed. The RETAIN statement alters this behavior by not resetting the data values for any variables given in its list of variables. Another use of the RETAIN statement is to give initial values to speci c variables. Examples of the RETAIN statement will be encountered later.

2.2 Input/Output One of the many powerful features of the SAS language is the diversity of methods, and modi cations to them, for inputing data into a SAS data set. We will review several of the more important methods and then consider some options and modi cations that become more and more useful as the data become more complicated to read. For reading data from a spreadsheet, for example, you may want to skip to the section on le import and export on page 51. We will next look at output methods that can be very useful for generating reports that have to be in very speci c formats. The basic statement used for reading raw data from a le is the INPUT statement. It takes as its arguments a list of variables, pointer placement instructions, and informat information. For options that are used to modify the standard behavior of the input process, the INFILE statement is used.

2.2.1 List Input The list input method is the simplest of methods and works in very many situations. Essentially, one speci es the variable names in the order in which they occur in the data set with no pointer placement instructions. The values on a line will be read in order and placed into the variables according to their order in the INPUT statement. This is seen in the \Simple Data Step" example on page 4. The behavior of the input process, however, depends upon having at least as many data values on each line as there are variables in the INPUT statement. If the number of data values is less than the number of variables speci ed in the INPUT statement, then, by default, the input pointer will be moved to the beginning of the next line of input and the remainder of the variables will be lled using data on this new line. The SAS data step parser will report the following information in the SAS LOG: NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

CHAPTER 2. THE DATA STEP

11

This may not be what you have intended. Consider the example below where there are several missing data values scattered throughout the input data set. Note that the data values become mismatched with which variables they should correspond with. Title2 "List Input / Data One; Input Mo Day Yr Ar Datalines; 7 21 74 2 7 1 9 76 2 3 12 18 74 2 15 76 2 1 9 13 75 2 2 1 28 75 2 3 ;

Missing Data"; St Sex Age Sn Lt Wt TSL; 3 2 1 1 2 1

0 0 0 0 1 1

5 3.5 1 5.3 5 5.4 5 6.0 5 5 13.5

3.0 3.4 5.7 23.4 33.2

0.00 83.20 111.00 203.00 242.00

Proc Print Data=One; Run; Data Step Examples List Input / Missing Data OBS

MO

DAY

YR

AR

ST

1 2 3

7 12 9

21 18 13

74 74 75

2 1 2

7 0 2

SEX

AGE

SN

LT

WT

TSL

3 5 2

0.0 5.4 1.0

5.0 3.4 5.0

3.5 83.2 23.4

1 2 203

9 15 1

First of all, the month and day information for observation 2 was used as the weight and total scale length data for observation 1. Second, observe that the remainder of the data for observation 2 was discarded. Third, observation 3 has missing data for the area and station variables, but the sex and age data were used for these variables. Finally, we note that instead of 6 observations, the nal data set has only 3. What has happened is the following. When list input is used, the input processor simply scans a line of data until it comes to non-blank information. It will then place this data into the next variable, in order, from the INPUT statement. It does not matter in which columns the data are placed on the line, except that their left-to-right order corresponds with the order of the variables in the INPUT statement. If insucient values are found to ll the variables, then the processor moves on to the next line of data. When all variables have been lled, the processor stops reading in data and the current line of data being read is, by default, discarded. This is why the remainder of line 2 was not used. You should now be able to duplicate the assignment of the data to the variables by following the above rules. How to modify the list input to handle the missing data? If the missing data all occur at the end of a data line, as might happen with repeated measurements data where dropouts would have no data after dropping out, then the MISSOVER option of the INFILE statement can be used. This option speci es that the pointer is NOT to be moved to the next line if no more values are found on a data line, rather, the remaining variables are to be assigned the value for missing data (by default, a period for numeric data and a blank for character data).

CHAPTER 2. THE DATA STEP

12

Title2 "Missing Data / Missover Option"; Data Trout; /* Flack Lake trout Catch Data*/ Infile Datalines MISSOVER; Input Year Age3-Age9; Datalines; 1975 0 105 674 446 16 2 1976 46 422 838 726 70 4 1977 3 310 1224 1068 65 1978 14 354 1264 1172 69 0 1979 6 429 1222 1067 192 ; Proc Print Data=Trout; Run;

2 4 6

Data Step Examples Missing Data / Missover Option OBS

YEAR

AGE3

AGE4

AGE5

AGE6

AGE7

1 2 3 4 5

1975 1976 1977 1978 1979

0 46 3 14 6

105 422 310 354 429

674 838 1224 1264 1222

446 726 1068 1172 1067

16 70 65 69 192

AGE8 2 4 . 0 .

AGE9 2 4 . 6 .

Notice that we did get 5 observations with the correct data values. In this case the missing values are not truly missing, but rather are zero values. The data step below would change the missing values to zeros. See the section on looping and arrays on page 26 to see how to solve this problem more generally and easily. Title2 "Missing Data / Missover Option"; Data Trout; /* Flack Lake trout Catch Data*/ Infile Datalines MISSOVER; Input Year Age3-Age9; If Age8=. Then Age8=0; If Age9=. Then Age9=0; Datalines; 1975 0 105 674 446 16 2 1976 46 422 838 726 70 4 1977 3 310 1224 1068 65 1978 14 354 1264 1172 69 0 1979 6 429 1222 1067 192 ;

2 4 6

Now what if the missing data are interior to other data line values? In order to continue to use list input, one needs to enter the missing value symbol in the data lines for these values. This is illustrated below. Title2 "List Input / Data One; Input Mo Day Yr Ar Datalines; 7 21 74 2 7 1 9 76 2 3 12 18 74 . . 2 15 76 2 1 9 13 75 2 2 1 28 75 2 3 ;

Missing Data"; St Sex Age Sn Lt Wt TSL; 3 2 1 1 2 1

0 0 0 0 1 1

5 3.5 1 5.3 5 5.4 5 6.0 5 . 5 13.5

. 3.0 3.4 5.7 23.4 33.2

. 0.00 83.20 111.00 203.00 242.00

CHAPTER 2. THE DATA STEP

13

Proc Print Data=One; Run; Data Step Examples List Input / Missing Data OBS

MO

DAY

YR

AR

ST

SEX

AGE

SN

LT

WT

1 2 3 4 5 6

7 1 12 2 9 1

21 9 18 15 13 28

74 76 74 76 75 75

2 2 . 2 2 2

7 3 . 1 2 3

3 2 1 1 2 1

0 0 0 0 1 1

5 1 5 5 5 5

3.5 5.3 5.4 6.0 . 13.5

. 3.0 3.4 5.7 23.4 33.2

TSL . 0.0 83.2 111.0 203.0 242.0

2.2.2 Column Input When the values for variables are placed in speci c columns, as can be produced from a printed spreadsheet or data base program, then column input can be very valuable. In addition, since it only looks for data in the speci ed columns, it is easy to skip over unwanted information and to easily handle missing data. Consider the data set above with the internal missing values, then read the data using column input. If you are using the SAS Program Editor window to edit the data and you have \line numbers" turned on, then you can enter the \cols" line command over any of the line numbers and you will get a reference line to determine the columns from. Title2 "Column Input / Missing Data"; Data One; Input Mo 2-3 Day 6-7 Yr 10-11 Ar 14 St 18 Sex 22 Age 27 Sn 31 Lt 32-36 Wt 37-42 TSL 43-50; /* Some comment lines to help us find the columns */ /* 1 2 3 4 5*/ /*345678901234567890123456789012345678901234567890*/ Datalines; 7 21 74 2 7 3 0 5 3.5 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 23.4 203.00 1 28 75 2 3 1 1 5 13.5 33.2 242.00 ; Proc Print Data=One; Run; Data Step Examples Column Input / Missing Data OBS

MO

DAY

YR

AR

ST

SEX

AGE

SN

LT

WT

1 2 3 4 5 6

7 1 12 2 9 1

21 9 18 15 13 28

74 76 74 76 75 75

2 2 . 2 2 2

7 3 . 1 2 3

3 2 1 1 2 1

0 0 0 0 1 1

5 1 5 5 5 5

3.5 5.3 5.4 6.0 . 13.5

. 3.0 3.4 5.7 23.4 33.2

TSL . 0.0 83.2 111.0 203.0 242.0

CHAPTER 2. THE DATA STEP

14

One can also mix the various input methods. Thus, the input line could have been written as Input Mo Day Yr Ar 14 St 18 Sex Age Sn Lt 32-36 Wt 37-42 TSL 43-50;

since the date, sex, age, and scale number information were never missing. It is more common to use either all list or all column input.

2.2.3 Pointer Control and Formatted Input There are also a number of special pointer controls that can move the pointer around in the input line or to change its behavior. Similar to column input, the pointer can be positioned at a particular column using @n, then list (or formatted) input can be used to read the data in, where the value n is the column number. One can also use the symbols +n and -n to move the pointer forward or backward along the line. When the data for an observation falls on more than a single physical input line, the pointer can be moved on to additional lines using the symbol #n, where n is the line number relative to the current line. For character data, the symbol & can be used to tell the pointer that a single space does not separate variable values for a speci c character variable. The ier data set is modi ed so that we can illustrate these techniques. The data for a single sh will now be placed on 2 physical lines for input and the rst variable will be the scale reader's name. Title2 "Pointer Control / Multiple Input Lines"; Data One; Length Name $13; Input @1 Name & Mo Day Yr @26 Ar 1. @30 St 1. Sex Age Sn #2 @1 Lt 5.1 @6 Wt 6.1 @12 TSL 8.2; /* Some comment lines to help us find the columns */ /* 1 2 3 4 5*/ /*345678901234567890123456789012345678901234567890*/ Datalines; Bill Smith 7 21 74 2 7 3 0 5 3.5 Bill Smith 1 9 76 2 3 2 0 1 5.3 3.0 0.00 John P. Doe 12 18 74 1 0 5 5.4 3.4 83.20 John P. Doe 2 15 76 2 1 1 0 5 6.0 5.7 111.00 John P. Doe 9 13 75 2 2 2 1 5 23.4 203.00 Bill Smith 1 28 75 2 3 1 1 5 13.5 33.2 242.00 ; Proc Print Data=One; Run; Data Step Examples Pointer Control / Multiple Input Lines OBS 1 2 3 4

NAME Bill Bill John John

Smith Smith P. Doe P. Doe

MO

DAY

YR

AR

ST

SEX

AGE

7 1 12 2

21 9 18 15

74 76 74 76

2 2 . 2

7 3 . 1

3 2 1 1

0 0 0 0

SN 5 1 5 5

LT

WT

3.5 5.3 5.4 6.0

. 3.0 3.4 5.7

TSL . 0.0 83.2 111.0

CHAPTER 2. THE DATA STEP 5 6

John P. Doe Bill Smith

9 1

13 28

75 75

15 2 2

2 3

2 1

1 1

5 5

. 13.5

23.4 33.2

203.0 242.0

Note that we had to leave at least 2 spaces separating the reader's name from the month value since we used the & symbol, otherwise the month value would have been read as part of the name. Secondly, since we did not specify a character format, we needed to de ne the length of the character variable using the LENGTH statement. Otherwise the length would have been determined from the rst value read in. Using formatted character input we could have used the INPUT statement Input @1 Name $Char13. Mo Day Yr @26 Ar 1. @30 St 1. Sex Age Sn #2 @1 Lt 5.1 @6 Wt 6.1 @12 TSL 8.2;

and have dropped the LENGTH statement from the program. SAS also has input formats, called \informats" for reading in other types of data such as social security numbers, phone numbers, and dates and times. We will give some examples of working with SAS dates and times later. The input formats can be explicitly given on the INPUT statement, or can be assigned to the variables using the INFORMAT statement. When reading complex data les controlling the pointer can be very important. Normally after the input statement has been executed for an observation, the physical input line is discarded and the next physical input line is moved into the input buer. Sometimes, however, you would like to read what is on the physical line at speci c points, say looking for special key words, then you use an input statement that depends upon the key word. The \trailing" @ and @@ signs can be used to hold the pointer on the current input line. The single @ sign holds the line until another input statement releases it or until the data step loop restarts. The double @@ sign holds the current line even after the data step loop restarts. They are called \trailing" because the symbol is placed at the very end of the input statement (just before the semicolon).

Example: Multiple observations per input line In this example more than one observation is placed on a single physical line. To keep the example simple, we will assume that there is no missing data, or if missing data occurr they are indicated in the data set by a \." surrounded by a space. The data set consists of lengths and weights of ier sun sh. Title2 "Multiple Observations Per Input Line"; Data One; Input Lt Wt @@; Datalines; 3.5 1.4 5.3 3.0 5.4 3.4 6.0 5.7 10.1 23.4 13.5 33.2 8.7 16.8 11.9 44.4 12.0 40.9 16.0 103.4 9.1 17.9 10.5 29.1 17.2 127.1 17.9 132.7 12.1 47.9 17.2 136.4 17.3 138.3 17.5 134.4 13.2 42.2 16.4 110.0 16.7 101.6 15.3 92.9 15.4 76.3 18.0 125.8 18.5 131.6 ; Proc Print Data=One; Run;

CHAPTER 2. THE DATA STEP

16

Data Step Examples Multiple Observations Per Input Line OBS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

LT

WT

3.5 5.3 5.4 6.0 10.1 13.5 8.7 11.9 12.0 16.0 9.1 10.5 17.2 17.9 12.1 17.2 17.3 17.5 13.2 16.4 16.7 15.3 15.4 18.0 18.5

1.4 3.0 3.4 5.7 23.4 33.2 16.8 44.4 40.9 103.4 17.9 29.1 127.1 132.7 47.9 136.4 138.3 134.4 42.2 110.0 101.6 92.9 76.3 125.8 131.6

Example: Conditional input In this example, we have data on lengths and weights of sh collected on dierent sampling dates. To keep the amount of typing to a minimun, the data were coded such that the date of collection falls on one line by itself and then on separate lines come the length-weight data pairs measured on that date. Then the next date appears followed by its length-weight data pairs. Since there need not be the same number of sh measured on each date, we must test to see whether the input line contains a date or contains length-weight data. Since all of the date lines contain a \/" while none of the length-weight data lines do, we can test for the presence of the \/" on the input line using the INDEX() function.

CHAPTER 2. THE DATA STEP Title2 "Conditional Input"; Data One; Drop Test; Retain Date; Length Test $8; Input Test @; If Index(Test,'/') Then /* is a date value */ Input @1 Date MMDDYY8.; Else /* is a fish LT WT observation */ Do; Input @1 Lt Wt; Output; End; Datalines; 4/6/97 3.5 1.4 5.3 3.0 5.4 3.4 6.0 5.7 10.1 23.4 13.5 33.2 8.7 16.8 11.9 44.4 12.0 40.9 16.0 103.4 9.1 17.9 5/12/97 10.5 29.1 17.2 127.1 17.9 132.7 12.1 47.9 17.2 136.4 17.3 138.3 17.5 134.4 6/4/97 13.2 42.2 16.4 110.0 16.7 101.6 15.3 92.9 15.4 76.3 18.0 125.8 18.5 131.6 ; Proc Print Data=One; Format Date MMDDYY8.; Run;

17

CHAPTER 2. THE DATA STEP

18

Data Step Examples Conditional Input OBS

DATE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

04/06/97 04/06/97 04/06/97 04/06/97 04/06/97 04/06/97 04/06/97 04/06/97 04/06/97 04/06/97 04/06/97 05/12/97 05/12/97 05/12/97 05/12/97 05/12/97 05/12/97 05/12/97 06/04/97 06/04/97 06/04/97 06/04/97 06/04/97 06/04/97 06/04/97

LT

WT

3.5 5.3 5.4 6.0 10.1 13.5 8.7 11.9 12.0 16.0 9.1 10.5 17.2 17.9 12.1 17.2 17.3 17.5 13.2 16.4 16.7 15.3 15.4 18.0 18.5

1.4 3.0 3.4 5.7 23.4 33.2 16.8 44.4 40.9 103.4 17.9 29.1 127.1 132.7 47.9 136.4 138.3 134.4 42.2 110.0 101.6 92.9 76.3 125.8 131.6

Note that we needed to retain the date variable so that its value would be maintained through each loop of the data step. The date is only changed when a date value is found on the input line. Also note that in this example the date format for the DATE variable was speci ed in the PROC PRINT section rather than in the data step. If the FORMAT statement is placed in the data step, then the format will be used by future procedures. If it is placed in a procedure step, then the format is local to that procedure and will not carry on to future procedures.

2.2.4 The PUT Statement The PUT statement provides the output interface for the SAS data step. It works very similar to the INPUT statement in that a list of variables and constants can be given, or pointer control can be used to position the pointer to get data placed into speci c positions, or a combination of them can be used. Additionally, SAS formats (to be discussed later on page 19) can also be used. This all provides for a very powerful report writing tool. A simple example will be used to illustrate some of this methodology.

CHAPTER 2. THE DATA STEP

19

Title2 "Report Writing"; Data One; Input @1 Name $Char13. Mo Day Yr @26 Ar 1. @30 St 1. Sex Age Sn #2 @1 Lt 5.1 @6 Wt 6.1 @12 TSL 8.2; File PRINT; /* Send to the OUTPUT window */ If _N_ = 1 Then Put // @10 "Flier Sunfish Scale Report"; PUT @5 Name $13. @20 Mo 2. '/' Day 2. '/' Yr 2. @30 "Area=" Ar @41 St= / @5 Sex Age Sn Lt 10.3 Wt 10.0 TSL 10.2; Datalines; Bill Smith 7 21 74 2 7 3 0 5 3.5 Bill Smith 1 9 76 2 3 2 0 1 5.3 3.0 0.00 John P. Doe 12 18 74 1 0 5 5.4 3.4 83.20 John P. Doe 2 15 76 2 1 1 0 5 6.0 5.7 111.00 John P. Doe 9 13 75 2 2 2 1 5 23.4 203.00 Bill Smith 1 28 75 2 3 1 1 5 13.5 33.2 242.00 ; Data Step Examples Report Writing Flier Sunfish Scale Report Bill Smith 7/21/74 Area=2 ST=7 3 0 5 3.500 . . Bill Smith 1/ 9/76 Area=2 ST=3 2 0 1 5.300 3 0.00 John P. Doe 12/18/74 Area=. ST=. 1 0 5 5.400 3 83.20 John P. Doe 2/15/76 Area=2 ST=1 1 0 5 6.000 6 111.00 John P. Doe 9/13/75 Area=2 ST=2 2 1 5 . 23 203.00 Bill Smith 1/28/75 Area=2 ST=3 1 1 5 13.500 33 242.00

There are several features to notice in this particular example, besides its very unattractive appearance. Note that string constants can be printed in the report simply by enclosing the data within quotes. Note also that the output format does not have to be the same as the input format. Further, the name of a variable along with its value can be obtained simply by listing the variable's name on the PUT statement followed immediately by the equals sign. The pointer control symbol / is used to move the pointer to the next line. The FILE statement can also be used to redirect the output report to a le or device such as a printer.

2.2.5 SAS Formats and Informats The SAS System provides a large number of built-in formats, as well as informats. The date and time formats can be especially useful. When date data are input using a date informat, the date is actually stored internally as a numerical value representing the number of days

CHAPTER 2. THE DATA STEP

20

since a speci c date. We'll see what that value is in a moment. What this means for us is several things. First, it becomes easy to sort the data by date without having to worry about a character values such as \01/19/97" being sorted before \05/03/92". Secondly, it is easy to compute the number of days between any two SAS dates, simply take the dierence between them. Thirdly, the output format can be changed from that used upon input. Let's look at an example. Title2 "SAS Dates"; Data One; Retain Now; If _N_=1 Then Now=Today(); /* Get the current day */ Input Name $Char10. Birthday mmddyy8.; If Birthday=. Then Birthday=0; Bday=Birthday; Sday=Birthday; MDYday=Birthday; DaysOld=Now-Birthday; Format Bday WeekDatX29. Sday Date7. Now MDYday mmddyy8.; Datalines; Bob 12/08/87 Erica 3/15/92 Sammy 06/7/57 Keith . ; Proc Sort Data=One; By Birthday; Run; Proc Print Data=One NoObs; Var Name MDYday Sday Bday Birthday Now DaysOld; Run; Data Step Examples SAS Dates NAME Sammy Keith Bob Erica

MDYDAY

SDAY

06/07/57 01/01/60 12/08/87 03/15/92

07JUN57 01JAN60 08DEC87 15MAR92

1

BDAY BIRTHDAY Friday, 7 June Friday, 1 January Tuesday, 8 December Sunday, 15 March

1957 1960 1987 1992

-938 0 10203 11762

NOW DAYSOLD 09/16/97 09/16/97 09/16/97 09/16/97

14711 13773 3570 2011

Now we discover that SAS dates are relative to January 1, 1960 as this is the date corresponding to the SAS date of zero. Note that because we did not assign a SAS date format to the BIRTHDAY variable, that the actual SAS date value was printed. The example also demonstrated the SAS formats WEEKDATXn. and DATEn., where n is the format width. If the width is not sucient to write out the entire names, as requested, then abbreviations will be used where possible. Decimal values of SAS dates can also be used to contain internal values for time. See the BASE SAS documentation for these formats. SAS date-time values are entered in a date set as MM/DD/YY:hh:mm:ss where MM/DD/YY is the date, while hh is the hour, mm is the minutes, and ss is the seconds in the time. When typed within the data step, such as in an IF statement, a date is enclosed in quotes and followed by the letter d, such as "09/29/97"d while times are followed by the letter t, such as "8:30"t. Note that hour takes on the values 0 through 23, where 0 is midnight. Formats can also be constructed using PROC FORMAT. Once constructed, these formats can be used as with any other format. These formats provide a very nice way for coding data very simply for input, but then producing reports with very nice labels for values. We'll create a format for the SEX variable used in the ier data and use it to write out the labels

CHAPTER 2. THE DATA STEP

21

on the printout. Title2 "Proc FORMAT"; Proc Format; Value Sex 1="Male" 2="Female" 3="Unknown"; Run; Data One; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; Format Sex Sex.; Datalines; 7 21 74 2 7 3 0 5 3.5 1.4 57.00 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 2 4 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 10.1 23.4 203.00 ; Proc Print Data=One; Run; Data Step Examples Proc FORMAT OBS

MO

DAY

YR

AR

ST

1 2 3 4 5

7 1 12 2 9

21 9 18 15 13

74 76 74 76 75

2 2 2 2 2

7 3 4 1 2

SEX Unknown Female Male Male Female

AGE

SN

LT

WT

0 0 0 0 1

5 1 5 5 5

3.5 5.3 5.4 6.0 10.1

1.4 3.0 3.4 5.7 23.4

TSL 57.00 0.00 83.20 111.00 203.00

2.3 SAS Functions The SAS system provides a very large number of functions for computing a wide diversity of things. Some of the more common functions that are encountered in basic data analysis and management are described below. Keep in mind that there are many more functions and function families than are described herein.

2.3.1 Mathematical Functions ABS(value) returns

the absolute value of the numeric argument. COS(value) returns the cosine in radians of value. EXP(value) returns the constant e raised to the power given by value. INT(value) returns the integer part of a real number. LOG(value) returns the natural logarithm of value. LOG10(value) returns the base 10 logarithm of value. MOD(value,divisor) returns the integer remainder when value is divided by divisor.

CHAPTER 2. THE DATA STEP

22

ROUND(value,decimals) returns value

rounded o to the nearest value based upon the decimals value. For example, ROUND(123.456,0.1) returns 123:5 and ROUND(123.456,10.0) returns 120. If the decimals value is omitted, a value of 1 is assumed. SIN(value) returns the sine of value. SQRT(value) returns the square root of value. SUM(v1,v2,...,vn) returns the sum of the non-missing values contained in the argument list.

2.3.2 Random Number Generators All of the random number generators require a seed to start them. A seed of zero can be used to seed the generator with a value derived from the system clock. For a given positive integer seed, a generator will return exactly the same random number series. There are other generators available, and through programming and the use of existing generators, variates from other distributions can be generated. returns a standard normal random variate. seed is the value which speci es the position within the psuedorandom number stream the variates are selected from. RANBIN(seed,n,p) returns a binomial random variate from the binomial distribution with parameters n and p. RANNOR(seed) is the same as the NORMAL(seed) function. RANPOI(seed,lambda) returns a Poisson random variate from the Poisson distribution with parameter lambda. RANUNI(seed) returns a continuous uniform random variate from the interval (0; 1). UNIFORM(seed) is the same as RANUNI(seed).

NORMAL(seed)

A simple example to generate a standard normal variate Z, and from it a normal variate Y with mean u and standard deviation s is given below. Assume u=5 and s=3. Title2 "Normal Random Variates"; Data One; Drop I u s; Retain u 5 s 3; Do I=1 To 25; Z=Normal(0); /* Use system clock as seed */ Y=u + s*Z; Output; End; Run; Proc Print Data=One; Run; Proc Means Data=One Mean Std;

CHAPTER 2. THE DATA STEP

23

Var Z Y; Run; Data Step Examples Normal Random Variates OBS

Z

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

2.01531 0.58587 0.18383 -1.08207 -1.87971 -0.87702 0.63108 1.53379 -0.34128 0.47535 1.26282 0.64412 2.11873 0.25801 0.08347 -1.37542 -1.00606 -0.68600 -0.91837 -1.28510 -1.69460 -1.37999 0.35292 0.76663 0.00566

Y 11.0459 6.7576 5.5515 1.7538 -0.6391 2.3689 6.8932 9.6014 3.9761 6.4261 8.7885 6.9323 11.3562 5.7740 5.2504 0.8737 1.9818 2.9420 2.2449 1.1447 -0.0838 0.8600 6.0588 7.2999 5.0170

Data Step Examples Normal Random Variates Variable Mean Std Dev -----------------------------------Z -0.0643208 1.1334234 Y 4.8070375 3.4002701 ------------------------------------

2.3.3 String Functions String functions can be very useful for the processing of complex data sets and for subsetting data sets according to values contained within strings. Some commonly used string functions include: COMPRESS(string) returns a new string

the end of the string.

INDEX(string,value) returns

that has blanks removed and then padded at

the position in the string where the rst character of value begins within the string. If value is not contained within the string, then the function returns zero. Thus, it is very useful for checking for the presence or absence

CHAPTER 2. THE DATA STEP

24

of a certain value in a string variable, such as testing that a string variable contains the last name of a particular person. LEFT(string) will left align a character string by removing any leading blanks from the string. Note that the string's length is not changed by this function. LENGTH(string) returns the \length" of a string. Here, the \length" is de ned to be the position of the right-most non-blank character in the string, rather than the number of characters reserved for storage of the string. LOWCASE(string) converts all uppercase characters to lowercase characters in string. RIGHT(string) will right align a character string by removing trailing blanks from the end of the string and inserting them at the beginning of the string. Thus, it does not change the length of the string. SCAN(string,n,delimiters) returns the nth word from the character string string, where words are delimited by the characters in delimiters. If delimiters is omitted from the function, then blanks and most punctuation and special characters are used as the delimiters. Consult the SAS help or SAS/BASE documentation. If there are fewer words in the string than given by n, then a blank character string is returned. SUBSTR(string,start,n) returns a substring or part of string beginning with the character at the position start in the string and continuing for n characters. If n is omitted, then the remainder of the string is extracted. SUBSTR(string1,start)=string2 replaces the characters in string1 beginning at position start in string1 with string2. TRIM(string) returns a new string whose trailing blanks have been removed and whose length corresponds with the position of the last non-blank character in string. A blank string, however, is returned as a string with one blank character. TRIMN(string) is like TRIM(string), but a blank string is returned as a null string (length of zero). UPCASE(string) returns a new string will any lowercase characters replaced with their uppercase counterparts.

2.3.4 Date and Time Functions with no argument returns the current system date. DATETIME() with no argument returns the current system date and time as a SAS date-time value. DHMS(date,hour,minute,second) returns a SAS date-time value by combining a SAS date value with the hour, minute, and second values. MDY(month,day,year) returns a SAS date from month, day, and year values.

DATE()

CHAPTER 2. THE DATA STEP

25

with no argument returns the current system time. TODAY() is the same as DATE(). WEEKDAY(date) returns an integer from 1 to 7, 1=Sunday, 7=Saturday, corresponding to the day of the week for the SAS date date.

TIME()

2.3.5 PUT and INPUT Functions The PUT(value,format.) and INPUT(value,informat.) functions permit input/output (I/O) to variables rather than to some I/O device. The value can be a variable or a constant value, and the format or informat should conform to the value's type. The example below uses the date function MDY(), the string functions COMPRESS(), SCAN(), and UPCASE(), the I/O function PUT(), and the numeric function LOG(). It also uses the string operator || that concatenates or joins two strings together. Title2 "SAS Functions"; Data One; Length Name $13 Chardate $8; Input @1 Name & Mo Day Yr @26 Ar 1. @30 St 1. Sex Age Sn #2 @1 Lt 5.1 @6 Wt 6.1 @12 TSL 8.2; SASdate=MDY(Mo,Day,Yr); Chardate=Compress(Put(Mo,Z2.)||"/"||Put(Day,Z2.)||"/"||Put(Yr,Z2.)); SASdate2=Input(Chardate,mmddyy8.); If Wt NE . Then LnWt=Log(Wt); Lastname=Upcase(Scan(Name,3," ")); If Lastname=" " Then Lastname=Upcase(Scan(Name,2," ")); Format SASdate SASdate2 mmddyy8.; Datalines; Bill Smith 7 21 74 2 7 3 0 5 3.5 Bill Smith 1 9 76 2 3 2 0 1 5.3 3.0 0.00 John P. Doe 12 18 74 1 0 5 5.4 3.4 83.20 John P. Doe 2 15 76 2 1 1 0 5 6.0 5.7 111.00 John P. Doe 9 13 75 2 2 2 1 5 23.4 203.00 Bill Smith 1 28 75 2 3 1 1 5 13.5 33.2 242.00 ; Proc Print Data=One; Var Name Lastname Mo Day Yr SASdate Chardate SASdate2 Wt; Run; Data Step Examples SAS Functions OBS 1 2 3 4 5 6

NAME Bill Bill John John John Bill

Smith Smith P. Doe P. Doe P. Doe Smith

LASTNAME

MO

DAY

YR

SASDATE

CHARDATE

SASDATE2

WT

SMITH SMITH DOE DOE DOE SMITH

7 1 12 2 9 1

21 9 18 15 13 28

74 76 74 76 75 75

07/21/74 01/09/76 12/18/74 02/15/76 09/13/75 01/28/75

07/21/74 01/09/76 12/18/74 02/15/76 09/13/75 01/28/75

07/21/74 01/09/76 12/18/74 02/15/76 09/13/75 01/28/75

. 3.0 3.4 5.7 23.4 33.2

CHAPTER 2. THE DATA STEP

26

Note the use of the Zn. format to supply the leading zeros needed to mimic the look of the MMDDYYn. format. Recognize that time calculations and ordering (sorting) can be made with the SASDATE and SASDATE2 variables, while the CHARDATE varible is simply a character representation of the date. If we had wanted to \ ag" all observations with a date prior to "07/04/75"d, then it is easy using the SAS date variables, If SASDATE < "07/04/75"d Then Flag="PRIOR"; Else Flag="AFTER";

while much more programming is needed if we use the character variable. Basically, we would have to create our own \SAS date" representation of the character variable for such comparisons.

2.4 Looping and Arrays In a number of circumstances the same task needs to be performed multiple times over a set of observations, variables, or times. The implicit loop of the data step is already seen to provide a loop around the observations in an input data set. SAS arrays can be used to facilitate looping over a set variables. The ARRAY statement lists the names of variables to be treated as a set such that they can be referenced through an indexing variable to an array. The syntax of the ARRAY statement is ARRAY arrayname{number|* {,number,number,...}} list_of_variables

where arrayname is the name for the array and {number|*} is either the number of variables in the that dimension of the array or is \*" indicating that all variables listed are to be used in the one dimensional array. Using the Flier sun sh data, let's assume that we wished to create a set of new variables that were the logarithmic transformation of a set of the original variables. Rather than write a number of assignment statements, we can use arrays, a single assignment statement, and a loop to complete the task. The following SAS code illustrates the brute force method that we wish to avoid. Title2 "Brute Force Approach to Repetitive Task"; Data One; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; If TSL=0 Then DELETE; LnLt=Log(Lt); LnWt=Log(Wt); LnTSL=Log(TSL); Datalines; 7 21 74 2 7 3 0 5 3.5 1.4 57.00 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 2 4 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 10.1 23.4 203.00 ;

CHAPTER 2. THE DATA STEP

27

Now let's re-write this program using arrays and the basic \do" loop. Title2 "Arrays and Do Loop"; Data One(DROP=I); Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; If TSL=0 Then DELETE; Array RawVars{3} Lt Wt TSL; Array NewVars{3} LnLt LnWt LnTSL; Do I=1 To 3; NewVars(I)=Log(RawVars(I)); End; Datalines; 7 21 74 2 7 3 0 5 3.5 1.4 57.00 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 2 4 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 10.1 23.4 203.00 ;

The data sets produced by each of these programs are the same. However, imagine that instead of 3 variables that needed to be transformed, there were many more. Multidimensional arrays are also allowed simply by specifying multiple subscript sizes.

2.4.1 Univariate and Multivariate Data Views In many instances measurements are made at the same location or on the same individuals through time. These repeated measures data can be viewed and analyzed using both univariate and multivariate approaches. In the multivariate approach, each measurement made on the same individual is treated as a dierent variable, while in the univariate approach, each measurement is treated as a separate observation made on the same individual. In the latter case, one variable is used to identify the individual while another is used to hold the value of the measurement. Below we will treat the Flack lake trout data in both multivariate and univariate formats. This data set has catch data reported on trout aged 3 through 9 years old for the years 1968 through 1979. First consider the multivariate view of the data where each catch number for each age is kept in separate variables. Title2 "Multivariate Data View"; Data TroutM; /* Flack Lake Trout Catch Data*/ Input Year Age3-Age9; Datalines; 1968 13 129 646 954 99 1969 19 169 416 1031 243 1970 40 354 606 479 152 1971 32 606 1424 644 157 1972 0 226 1178 1156 116 1973 2 165 593 982 428 1974 53 209 560 410 30 1975 0 105 674 446 16 1976 46 422 838 726 70 1977 3 310 1224 1068 65 1978 14 354 1264 1172 69 1979 6 429 1222 1067 192 ; Proc Print Data=TroutM; Run;

19 47 18 23 16 22 0 2 4 0 0 0

4 18 7 17 5 11 4 2 4 0 6 0

CHAPTER 2. THE DATA STEP

28

Data Step Examples Multivariate Data View OBS

YEAR

AGE3

1 2 3 4 5 6 7 8 9 10 11 12

1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979

13 19 40 32 0 2 53 0 46 3 14 6

AGE4

AGE5

AGE6

AGE7

129 169 354 606 226 165 209 105 422 310 354 429

646 416 606 1424 1178 593 560 674 838 1224 1264 1222

954 1031 479 644 1156 982 410 446 726 1068 1172 1067

99 243 152 157 116 428 30 16 70 65 69 192

AGE8

AGE9

19 47 18 23 16 22 0 2 4 0 0 0

4 18 7 17 5 11 4 2 4 0 6 0

Now, let's rearrange the data into the univariate view, where one variable will contain the age of the catch and another will contain the number caught of that age. We will start with the raw data and use a loop to read in the data. Title2 "Multivariate To Univariate Data View I"; Data TroutU; /* Flack Lake trout Catch Data*/ Input Year @; Do Age=3 To 9; Input Number @; Output; End; Datalines; 1968 13 129 646 954 99 19 4 1969 19 169 416 1031 243 47 18 1970 40 354 606 479 152 18 7 1971 32 606 1424 644 157 23 17 1972 0 226 1178 1156 116 16 5 1973 2 165 593 982 428 22 11 1974 53 209 560 410 30 0 4 1975 0 105 674 446 16 2 2 1976 46 422 838 726 70 4 4 1977 3 310 1224 1068 65 0 0 1978 14 354 1264 1172 69 0 6 1979 6 429 1222 1067 192 0 0 ; Proc Print Data=TroutU(Obs=22); Run;

CHAPTER 2. THE DATA STEP

29

Data Step Examples Multivariate To Univariate Data View I OBS

YEAR

AGE

NUMBER

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

1968 1968 1968 1968 1968 1968 1968 1969 1969 1969 1969 1969 1969 1969 1970 1970 1970 1970 1970 1970 1970 1971

3 4 5 6 7 8 9 3 4 5 6 7 8 9 3 4 5 6 7 8 9 3

13 129 646 954 99 19 4 19 169 416 1031 243 47 18 40 354 606 479 152 18 7 32

Here we only listed the rst 22 observations of the data set to illustrate the format of the univariate view. Title2 "Multivariate To Univariate Data View II"; Data TroutU; Drop Age3-Age9; Array Ages{3:9} Age3-Age9; Set TroutM; Do Age=3 To 9; Number=Ages(Age); Output; End; Run; Proc Print Data=TroutU; Run;

This data and print step will produce the same output as the previous one. In this instance we are accessing a SAS data set that already has the data in the multivariate view. Notice that the ARRAY statement speci es the beginning and ending value of the array index (3 and 9). Notice also that the DROP statement is needed to remove the \old" variables AGE3 through AGE9 from the new data set.

PROC TRANSPOSE Before leaving this section it is worth looking at a procedure developed to convert between the univariate and multivariate data views. Although not appropriate for all problems, it can be very useful for many. PROC TRANSPOSE takes an input data set, and based upon

CHAPTER 2. THE DATA STEP

30

some structure commands, creates a new data set with a dierent con guration. For this example we will again use the lake trout data and we will input it into the multivariate view. Then PROC TRANSPOSE will be called to convert it to the univariate view. The DATA= and OUT= options on the PROC TRANSPOSE statement specify the input and output SAS data sets, respectively. The VAR statement lists the variables that will be transposed. The BY statement instructs PROC TRANSPOSE to treat each year separately, i.e., we want to transpose all of the values of the speci ed variables before moving on to the next year. Title2 "PROC TRANSPOSE"; Data TroutM; /* Flack Lake Trout Catch Data*/ Input Year Age3-Age9; Datalines; 1968 13 129 646 954 99 19 1969 19 169 416 1031 243 47 1970 40 354 606 479 152 18 1971 32 606 1424 644 157 23 1972 0 226 1178 1156 116 16 1973 2 165 593 982 428 22 1974 53 209 560 410 30 0 1975 0 105 674 446 16 2 1976 46 422 838 726 70 4 1977 3 310 1224 1068 65 0 1978 14 354 1264 1172 69 0 1979 6 429 1222 1067 192 0 ; Proc Transpose Data=TroutM Out=TroutU; By Year Notsorted; Var Age3-Age9; Run; Proc Print Data=TroutU; Run;

4 18 7 17 5 11 4 2 4 0 6 0

Data Step Examples PROC TRANSPOSE OBS

YEAR

_NAME_

COL1

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1968 1968 1968 1968 1968 1968 1968 1969 1969 1969 1969 1969 1969 1969

AGE3 AGE4 AGE5 AGE6 AGE7 AGE8 AGE9 AGE3 AGE4 AGE5 AGE6 AGE7 AGE8 AGE9

13 129 646 954 99 19 4 19 169 416 1031 243 47 18

Only the rst 14 observations are listed. The NAME variable contains the names of the original variables, while the COL1 variable contains the values that those variables held. There are some options on the PROC TRANSPOSE procedure line that can be used to make the variable names more attractive. However, we will use the data step below to make the data set look like one that we might have created from scratch if our original intent was

CHAPTER 2. THE DATA STEP

31

to have a univariate view. This will make the reverse transpose to follow more realistic looking. Note the use of the INPUT() and SUBSTR() functions to convert values such as AGE4 into a numeric value, here, 4. Then we dropped the NAME variable as it is no longer needed. We also renamed the COL1 variable to NUMBER. /* Make the data set look more like one that would have come from reading the data directly into the univariate view. This makes the example a little more realistic. */ Data TroutU; Set TroutU; Age=Input(Substr(_NAME_,4),2.); Rename Col1=Number; Drop _NAME_; Run; Proc Print Data=TroutU; Run; Data Step Examples PROC TRANSPOSE OBS

YEAR

NUMBER

AGE

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1968 1968 1968 1968 1968 1968 1968 1969 1969 1969 1969 1969 1969 1969

13 129 646 954 99 19 4 19 169 416 1031 243 47 18

3 4 5 6 7 8 9 3 4 5 6 7 8 9

Again, only the rst 14 observations are listed here. In transposing the data set back, we would like to have the same variable names as before. This would have been easy had we not adjusted the data set as above. However, that is not particularly realistic, since you would not likely transpose a data set and then reverse the transpose exactly as given later. Since variable names cannot be numbers, we need to have a way to handle the numeric values for AGE. PROC TRANSPOSE will use formats when available and this will be our way out. For a very large number of levels of AGE, it might be more productive to construct a character variable (like NAME ) containing the names of the variables that we wish to create. The ID statement gives the variable containing the names of the new variables. Since we used a FORMAT statement for AGE, the formatted values will be used.

CHAPTER 2. THE DATA STEP

32

Proc Format; Value Ages 0="AGE0" 1="AGE1" 2="AGE2" 3="AGE3" 4="AGE4" 5="AGE5" 6="AGE6" 7="AGE7" 8="AGE8" 9="AGE9"; Proc Transpose Data=TroutU Out=TroutM; By Year Notsorted; Var Number; Id Age; Format Age Ages.; Run; Proc Print Data=TroutM; Run; Data Step Examples PROC TRANSPOSE OBS

YEAR

_NAME_

AGE3

1 2 3 4 5 6 7 8 9 10 11 12

1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979

NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER

13 19 40 32 0 2 53 0 46 3 14 6

AGE4

AGE5

AGE6

AGE7

129 169 354 606 226 165 209 105 422 310 354 429

646 416 606 1424 1178 593 560 674 838 1224 1264 1222

954 1031 479 644 1156 982 410 446 726 1068 1172 1067

99 243 152 157 116 428 30 16 70 65 69 192

AGE8

AGE9

19 47 18 23 16 22 0 2 4 0 0 0

4 18 7 17 5 11 4 2 4 0 6 0

With the exception of the NAME variable, the data set looks like the data set we started with. Thus, it is relatively easy to go either direction with PROC TRANSPOSE when dealing with univariate and multivariate views.

2.4.2 Indeterminant DO Loops There are occasions when a loop is needed, but it is not known in advance of the loop, how many iterations will be needed. This is usually determined from the data themselves, or by some other mechanism that triggers the end of the looping process. The SAS data step has DO WHILE and DO UNTIL statements to handle indeterminant loops. The DO UNTIL loop is always executed at least once while the DO WHILE loop is executed only if the WHILE condition is met. The end of the scope of the loop is given by an END statement. As an example, consider a data set that has as physical data lines, a part-time employee's name and a list of the hours worked by the employee. We would like to create a data set that gives the employee name and hours such that each hour value is treated as a separate observation. Since the number of values on each physical line is unknown, we we read until there is nothing left to read on the line. The MISSOVER option is used to keep the pointer from moving to the next line to read new data.

CHAPTER 2. THE DATA STEP

33

Title2 "DO UNTIL Loop"; Data Pay; Length Employee $15; Infile Datalines MISSOVER; Input Employee & Hours @; Do Until (Hours=.); Output; Input Hours @; End; Datalines; Bob Jones 3.5 8 8 2 3 Erin Walsh 6 7.5 1.5 Tom N. Smith 2 3 3 1 6 7 3 ; Proc Print; Run; Data Step Examples DO UNTIL Loop OBS

EMPLOYEE

HOURS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Bob Jones Bob Jones Bob Jones Bob Jones Bob Jones Erin Walsh Erin Walsh Erin Walsh Tom N. Smith Tom N. Smith Tom N. Smith Tom N. Smith Tom N. Smith Tom N. Smith Tom N. Smith

3.5 8.0 8.0 2.0 3.0 6.0 7.5 1.5 2.0 3.0 3.0 1.0 6.0 7.0 3.0

The use of the single @ sign keeps the pointer on the same line until all data for an employee have been read. Then the loop exits and the data step begins again, but because the single @ sign was used, the previous line gets discarded and a new one is worked on.

2.5 The NULL Data Set In some problems the power of the SAS data step is needed, but no new or changed SAS data set will be produced. This happens, for example, when the data step is used to produce a report, for system management tasks such as part of an interactive program, or to obtain information to be used by the macro language. The reserved data set name _NULL_ is used as the name for the \new" data set when one is not wanted. As an example with the ier data set, let's say that we wanted to list all observations in a data set for which the sex of the sh is female and that we wanted the average weight of these sh printed at the end of the report.

CHAPTER 2. THE DATA STEP

34

Title2 "Null Data Set"; Data _NULL_; File Print; Infile Datalines EOF=EOF; If _N_=1 Then Put " DATE WEIGHT"; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; If Sex=2; Date=MDY(Mo,Day,Yr); Put Date MMDDYY8. +2 Wt 8.1; N+1; TotWt+Wt; Return; Eof: AveWt=TotWt/N; Put // "Average Weight =" AveWt; Return; Datalines; 7 21 74 2 7 3 0 5 3.5 1.4 57.00 1 9 76 2 3 2 0 1 5.3 3.0 0.00 12 18 74 2 4 1 0 5 5.4 3.4 83.20 2 15 76 2 1 1 0 5 6.0 5.7 111.00 9 13 75 2 2 2 1 5 10.1 23.4 203.00 ... lines omitted ... ;

DATE 01/09/76 09/13/75 04/18/76 01/28/75 04/18/76 04/18/75 03/22/75 04/19/75 06/24/75

WEIGHT 3.0 23.4 16.8 40.9 17.9 47.9 101.6 76.3 125.8

Data Step Examples Null Data Set

Average Weight =50.4

This example also demonstrates several other features of the SAS language that have not yet been discussed. The 2 lines N+1; and TOTWT+WT; are called sum statements. They take the value or value of the variable to the right of the plus sign and add it to the value of the variable to the left of the plus sign. They are not exactly like, for example, N=N+1; because when the data step loop begins again, unless N is retained it will lose its value. This does not happen with variables in sum statements. They are automatically retained. Note also the strange IF SEX=2; statement that appears to be missing the THEN clause. This statement is equivalent to the one IF SEX NE 2 THEN DELETE;. It is called a subsetting IF statement. Lastly is the data step \subroutine." The data step does not have subroutines that are separated from the data step, rather they are contained within it and all variables are global to the main and subroutine sections. These \subroutines" behave like goto statements, but control can return to the code immediately following the call. These subroutines are called from input/output coditions, such as above where the condition is an \End of File"

CHAPTER 2. THE DATA STEP

35

condition and the INFILE option EOF is used to point to a subroutine to be executed when the condition is met, or from LINK and GOTO statements. If the LINK statement is used, control returns to the spot immediately following the LINK statement, while the GOTO statement simply redirects program execution through the subroutine and control usually returns to the top of the data step loop. Note the use of the RETURN statement. The rst use prevents execution from continuing into the subroutine, while the second marks the end of the subroutine. Had the rst RETURN been left out, a \running" average of the sh weights would have been generated for the females.

2.6 Data Step Examples 2.6.1 Simple Random Sampling Without Replacement This example will demonstrate how to take a simple random sample (SRS) without replacement from a frame or list of the population under study. Here, we will treat our ier data set as a population and take a simple random sample from it. For real problems, the frame might be a list of names and addresses of persons on a license sales list. Those selected from the list would then be sent a questionnaire. The example demonstrates several features covered in this chapter, as well as showing some very simple macro language features. Title2 "Simple Random Sample"; Data FLIER; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; Datalines; ... data go here ... ; %Let Size=12; /* Define The Sample Size */ Data SRS; Drop Seed K NN Prob; Retain Seed 123459; /* Random Number Seed */ Retain K &Size; /* Sample Size */ NN=N; /* Population Size */ Call Symput("PopSize",Trim(Left(Put(NN,20.)))); I=0; Do Until ((I>=N) or (K=0)); I+1; Set FLIER Point=I NObs=N; Prob=K/NN; If Ranuni(Seed) <= Prob Then /* Select This One */ Do; Output; K=K-1; /* We Need One Less In The Sample */ End; NN=NN-1; /* We Have One Less To Choose From */ End; Stop; /* We Are Done */ Run; Title3 "Sample of Size &Size From a Population of Size &Popsize"; Proc Print Data=SRS; Run;

CHAPTER 2. THE DATA STEP

36

Data Step Examples Simple Random Sample Sample of Size 12 From a Population of Size 25 OBS

MO

DAY

YR

AR

ST

SEX

AGE

SN

LT

1 2 3 4 5 6 7 8 9 10 11 12

7 1 2 9 1 4 1 4 2 4 2 1

21 9 15 13 28 18 28 18 23 18 8 28

74 76 76 75 75 76 75 76 75 75 75 75

2 2 2 2 2 3 2 3 2 2 2 2

7 3 1 2 3 3 3 2 3 1 3 3

3 2 1 2 1 1 1 2 1 2 1 1

0 0 0 1 1 2 2 3 3 4 4 6

5 1 5 5 5 5 5 5 5 5 4 1

3.5 5.3 6.0 10.1 13.5 11.9 16.0 9.1 17.2 12.1 17.5 18.5

WT 1.4 3.0 5.7 23.4 33.2 44.4 103.4 17.9 127.1 47.9 134.4 131.6

TSL 57.00 0.00 111.00 203.00 242.00 244.00 307.80 171.40 346.00 266.60 357.75 409.00

This particular method uses unequal probability methods in the selection of each element into the sample, but because each sample of size 12 has the same probability of selection as every other sample of size 12, the sample is a simple random sample. Since we do not know when the 12 elements will be selected, it could be the rst 12 or the last 12 observations, we use an indeterminant loop, the DO UNTIL loop. Note also that the data step loop is executed only once here. Otherwise, multiple samples of size 12 would have been taken. The STOP statement is used to keep the data step from looping again by simply stopping execution of the data step.

2.6.2 Data Recoding Reversing a Likert scale A Likert scale is often used in questionnaires to measure a degree of belief or agreement with an idea or statement. If one wanted to measure the degree of satisfaction with a shing trip, a questionnaire might ask Would you consider this to be the best shing trip that you have had in the last 5 years? 1=Strongly Disagree 2=Disagree Somewhat 3=Neutral 4=Agree Somewhat 5=Strongly Agree A typical way of analyzing a collection of such types of questions asked of the same individuals is to create a \scale" which is often the simple sum of the scores (1-5) over each question. However, in many questionnaires, some questions may be worded in such a way that a 5 means \positive" or \agree", while for others, a 5 means \negative" or \disagree".

CHAPTER 2. THE DATA STEP

37

Thus, some questions may need the Likert scale reversed. This is very easily accomplished using a mathematical transformation. Let's make the example more interesting by stating that variables Q1, Q20, Q32, and Q184 need reverse coding. Data One; Input Subject Q1-Q200; Array Reverse {*} Q1 Q20 Q32 Q184; Do I=1 To 4; Reverse(I)=6-Reverse(I); End; Datalines; data follow here

Using formats For some problems the data recoding can be particularly complicated. Either a simple mathematical transformation is not possible (as could be done above), or one may need to convert between numeric and character data. The IF-THEN-ELSE statements or CASE expression can be used for these purposes, but often require a lot of programming. Often times the PUT() and INPUT() functions can be used to simplify this process. Assume that we have a data set on vegetation collected from transects run across a marsh in which the species of plant and coverage along a 5m stick are recorded. To facilitate data entry, only a 2-character abbreviation for the species is input. However, the scientist would like the complete name in the data set. The example below illustrates one approach to accomplishing this. Title2 "Data Recoding"; Proc Format; Value $Veg "wg"="Wire Grass" "sp"="Spartina patens" "br"="Bull Rush" "wi"="Widgeon Grass"; Run; Data One; Input Sp $ Coverage @@; Length Species $20.; Species=Put(Sp,$Veg.); Drop Sp; Datalines; wi 0.4 wi 0.6 wg 1.2 br 0.2 sp 4.8 ; Proc Print Data=One; Run; Data Step Examples Data Recoding OBS

COVERAGE

1 2 3 4 5

0.4 0.6 1.2 0.2 4.8

SPECIES Widgeon Grass Widgeon Grass Wire Grass Bull Rush Spartina patens

Chapter 3

Working With Files The SAS System works with data stored in a special format called a SAS data set. Although it is possible for SAS to work with data in other program formats using SAS/VIEWS and \data engines", it is much more common to work with data in the SAS data set format. The SAS data set is specially formatted to permit the SAS System to quickly and easily work with the data. For example, SAS procedures can determine whether the data have been sorted without rst having to read the entire data set.

3.1 External Files Often the rst step in a SAS program is the input of \raw" data into a SAS data set for analysis. Later in the program, we may wish to create an external le or report for use elsewhere. For these actions we must learn about some of the interfaces that the SAS system has with external (non-SAS) les. For input into a data step, external les can typically be referenced using several dierent methods, some of which depend upon the platform. On the WIN95 and Unix platforms, the lename for the \raw" data can simply be placed in the INFILE statement. Assume that the le of interest is named raw.dat and is stored in the subdirectory fish. On the WIN95 platform we could use the statement Infile "c:\fish\raw.dat";

while on the Unix platform it might be referenced as Infile "fish/raw.dat";

where the fish subdirectory is relative to our current working directory on Unix. Alternatively, the FILENAME statement can be used to link a le reference to the actual le. This can make programs much easier to use on multiple platforms and to update and modify later. The basic statement looks like FILENAME

leref "path-and- le-name";

Using the WIN95 le given earlier a skeleton program would look like 38

CHAPTER 3. WORKING WITH FILES

39

FILENAME fish "c:\fish\raw.dat"; Data one; Infile fish; Input ....; Run;

Notice that the INFILE statement is still needed, but it speci es the le reference rather than the actual le name. The FILENAME statement can also be used to reference more than one le in a subdirectory. Study the following example. FILENAME fish "c:\fish"; /* specify the subdirectory only */ Data Halibut; Infile Fish(Halibut.dat); Input ...; Run; Data Coho; Infile Fish(Coho.dat); Input ...; Run;

The SAS system assumes that data les will have the extension .DAT, and so the extension can be dropped from the INFILE statement. Thus the code could have been written, FILENAME fish "c:\fish"; /* specify the subdirectory only */ Data Halibut; Infile Fish(Halibut); Input ...; Run; Data Coho; Infile Fish(Coho); Input ...; Run;

An example of these methods is shown for the ier data set. The orignal data set was also broken into 2 pieces called FLIERS1.DAT and FLIERS2.DAT, and the variable headings were removed from each of these two data sets. To better illustrate the results, the SAS log for the code is shown. The rst method uses only the INFILE statement. 238 239 240 241 242

Title2 "External File Reference With Infile"; Data Flier; Infile "c:\projects\alaska97\flier.dat" Firstobs=2; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; Run;

NOTE: The infile "c:\projects\alaska97\flier.dat" is: FILENAME=c:\projects\alaska97\flier.dat, RECFM=V,LRECL=256 NOTE: 664 The The NOTE: The NOTE: The 243 244 245

records were read from the infile "c:\projects\alaska97\flier.dat". minimum record length was 102. maximum record length was 102. data set WORK.FLIER has 664 observations and 11 variables. DATA statement used 0.77 seconds.

Proc Print Data=Flier(Obs=10); Run;

NOTE: The PROCEDURE PRINT used 0.11 seconds.

CHAPTER 3. WORKING WITH FILES

40

Next, the le reference is moved to the FILENAME statement. 247 248 249 250 251 252

Title2 "FILENAME File Reference"; Filename Fish "c:\projects\alaska97\flier.dat"; Data Flier; Infile Fish Firstobs=2; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; Run;

NOTE: The infile FISH is: FILENAME=c:\projects\alaska97\flier.dat, RECFM=V,LRECL=256 NOTE: 664 The The NOTE: The NOTE: The

records were read from the infile FISH. minimum record length was 102. maximum record length was 102. data set WORK.FLIER has 664 observations and 11 variables. DATA statement used 0.48 seconds.

The FILENAME statement can also be used to refer to directories. When directories are used, the INFILE statement is then used to select the member to process. This method also permits several directories to be concatenated together to search for les, and the same le reference can be used for multiple les, and for reading and writing. 254 255 256 257 258 259

Title2 "FILENAME Directory Reference"; Filename Fish "c:\projects\alaska97"; Data Flier; Infile Fish(Flier.dat) Firstobs=2; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; Run;

NOTE: The infile library FISH is: DIRECTORY=c:\projects\alaska97 NOTE: The infile FISH(Flier.dat) is: DIRECTORY=c:\projects\alaska97, MEMBERNAME=c:\projects\alaska97\Flier.dat, RECFM=V,LRECL=256 NOTE: A total of 664 records were read from the infile library FISH. The minimum record length was 102. The maximum record length was 102. NOTE: 664 records were read from the infile FISH(Flier.dat). The minimum record length was 102. The maximum record length was 102. NOTE: The data set WORK.FLIER has 664 observations and 11 variables. NOTE: The DATA statement used 0.59 seconds.

Since the SAS system assumes that data les end with .DAT, the sux can be dropped from the member name. However, for names that do not have a sux, the name should be enclosed within quotes. 261 262 263 264

Data Flier; Infile Fish(Flier) Firstobs=2; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; Run;

NOTE: The infile library FISH is: DIRECTORY=c:\projects\alaska97 NOTE: The infile FISH(Flier) is: DIRECTORY=c:\projects\alaska97,

CHAPTER 3. WORKING WITH FILES

41

MEMBERNAME=c:\projects\alaska97\Flier.DAT, RECFM=V,LRECL=256 NOTE: A total of 664 records were read from the infile library FISH. The minimum record length was 102. The maximum record length was 102. NOTE: 664 records were read from the infile FISH(Flier). The minimum record length was 102. The maximum record length was 102. NOTE: The data set WORK.FLIER has 664 observations and 11 variables. NOTE: The DATA statement used 0.66 seconds.

Sometimes the data that we would like to work with exists in more than one physical raw le. For example, the catch data might be kept in separate spreadsheets for each harvest year. To analyze the data together, the data sets must be concatenated. Below, the two

ier data sets (the original data set split into 2 pieces) will be input and concatenated using 3 dierent data steps. Note that the le reference from above is being reused. 266 267 268 269 270

Title2 "Data Set Concatenation of Files"; Data Flier1; Infile Fish(Fliers1); Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; Run;

NOTE: The infile library FISH is: DIRECTORY=c:\projects\alaska97 NOTE: The infile FISH(Fliers1) is: DIRECTORY=c:\projects\alaska97, MEMBERNAME=c:\projects\alaska97\Fliers1.DAT, RECFM=V,LRECL=256 NOTE: A total of 298 records were read from the infile library FISH. The minimum record length was 102. The maximum record length was 102. NOTE: 298 records were read from the infile FISH(Fliers1). The minimum record length was 102. The maximum record length was 102. NOTE: The data set WORK.FLIER1 has 298 observations and 11 variables. NOTE: The DATA statement used 0.44 seconds. 271 272 273 274

Data Flier2; Infile Fish(Fliers2); Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; Run;

NOTE: The infile library FISH is: DIRECTORY=c:\projects\alaska97 NOTE: The infile FISH(Flier2) is: DIRECTORY=c:\projects\alaska97, MEMBERNAME=c:\projects\alaska97\Fliers2.DAT, RECFM=V,LRECL=256 NOTE: A total of 366 records were read from the infile library FISH. The minimum record length was 102. The maximum record length was 102. NOTE: 366 records were read from the infile FISH(Fliers2). The minimum record length was 102. The maximum record length was 102.

CHAPTER 3. WORKING WITH FILES

42

NOTE: The data set WORK.FLIER2 has 366 observations and 11 variables. NOTE: The DATA statement used 0.44 seconds. 275 276 277

Data Flier; Set Flier1 Flier2; Run;

NOTE: The data set WORK.FLIER has 664 observations and 11 variables. NOTE: The DATA statement used 0.5 seconds.

As the FILENAME statement can concatenate directories together, it can also concatenate physical les together. The method is to put the list of lenames within parenthesis, each separated by a comma. 279 Title2 "FILENAME Concatenation of Files"; 280 Filename Fish ("c:\projects\alaska97\fliers1.dat","c:\projects\alaska97\fliers2.dat"); 281 Data Flier; 282 Infile Fish; 283 Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; 284 Run; NOTE: The infile FISH is: FILENAME=c:\projects\alaska97\fliers1.dat, RECFM=V,LRECL=256 NOTE: The infile FISH is: FILENAME=c:\projects\alaska97\fliers2.dat, RECFM=V,LRECL=256 NOTE: 298 The The NOTE: 366 The The NOTE: The NOTE: The

records were read from the infile FISH. minimum record length was 102. maximum record length was 102. records were read from the infile FISH. minimum record length was 102. maximum record length was 102. data set WORK.FLIER has 664 observations and 11 variables. DATA statement used 0.66 seconds.

File references can also be listed as 66 Title2 "List All Defined File References"; 67 Filename _ALL_ List; NOTE: Fileref= FISH Physical Name= c:\projects\alaska97\Fliers2.dat c:\projects\alaska97\Fliers1.dat NOTE: Fileref= TMP1 Physical Name= C:\PROJECTS\Alaska97\Files.sas

and can be cleared using 69 Title2 "Clear The FISH File Reference"; 70 Filename Fish Clear; NOTE: Fileref FISH has been deassigned.

Clearing a le reference can free up some memory, but usually is important with complicated programs to insure that an important le is not written over, or the wrong le read as input, due to programming errors. Files can also be created and appended to using techniques similar to the above for input. Basically, de ne a le reference to receive the data and use the FILE statement to direct

CHAPTER 3. WORKING WITH FILES

43

output to the reference. The example below rst writes the FLIER1 data to an external ascii le, then next, this could be at some latter time for example, appends the FLIER2 data to the same external le. 72 73 74 75 76 77 78

Title2 "Creating An External File"; Filename Fish "c:\projects\alaska97"; Data _NULL_; File Fish(Flier1N2); Set Flier1; Put (Mo Day Yr) (3.) (Ar St Sex Age Sn) (2.) (Lt Wt TSL) (7.3); Run;

NOTE: The file library FISH is: DIRECTORY=c:\projects\alaska97 NOTE: The file FISH(Flier1N2) is: DIRECTORY=c:\projects\alaska97, MEMBERNAME=c:\projects\alaska97\Flier1N2.DAT, RECFM=V,LRECL=256 NOTE: A total of 298 records were written to the file library FISH. The minimum record length was 40. The maximum record length was 40. NOTE: 298 records were written to the file FISH(Flier1N2). The minimum record length was 40. The maximum record length was 40. NOTE: The DATA statement used 0.6 seconds. 80 81 82 83 84 85

Title2 "Appending To An Existing External File"; Data _NULL_; File Fish(Flier1N2) Mod; Set Flier2; Put (Mo Day Yr) (3.) (Ar St Sex Age Sn) (2.) (Lt Wt TSL) (7.3); Run;

NOTE: The file library FISH is: DIRECTORY=c:\projects\alaska97 NOTE: The file FISH(Flier1N2) is: DIRECTORY=c:\projects\alaska97, MEMBERNAME=c:\projects\alaska97\Flier1N2.DAT, RECFM=V,LRECL=256 NOTE: A total of 366 records were written to the file library FISH. The minimum record length was 40. The maximum record length was 40. NOTE: 366 records were written to the file FISH(Flier1N2). The minimum record length was 40. The maximum record length was 40. NOTE: The DATA statement used 0.44 seconds.

A sample of what the output data set looks like is given below. 7 1 12 2 12

21 9 18 15 14

74 76 74 76 75

2 2 2 2 2

7 3 4 1 1

3 2 1 1 1

0 0 0 0 0

5 1 5 5 5

3.500 5.300 5.400 6.000 6.100

1.400 57.000 3.000 0.000 3.400 83.200 5.700111.000 6.600109.000

Notice the use of the grouped formats to associate a single format with several dierent variables that appear together. A slightly larger eld width for TSL appears needed. To

CHAPTER 3. WORKING WITH FILES

44

append the second data set to the rst, it was necessary to specify the MOD option on the FILE statement.

3.1.1 FTP Access The FILENAME statement can also be used to access, modify, and create information stored or to be stored on other computers using TCP/IP network methods. The FTP option can be used to open an FTP ( le transfer protocol) connection with another computer. Data can then be read from the FTP server or can be written to the FTP server, just as if the le were stored on the local machine. The syntax for writing the le flier.dat to the Unix1 computer at LSU for user bill and to prompt for the password would be Title2 "FTP Access To Write A File"; Filename Out FTP "flier.dat" /* Name of data set host="unix1.sncc.lsu.edu" /* Host name user="bill" /* User login name prompt /* Prompt for password cd="/u/bill/alaska/stuff" /* Change directory first rcmd="ascii" /* Remote command to exec recfm=v; /* User variable record length Data _NULL_; File Out; Set Flier1; Put (Mo Day Yr) (3.) (Ar St Sex Age Sn) (2.) (Lt Wt TSL) (7.3); Run;

*/ */ */ */ */ */ */

This technique can also be used to input not only les via FTP, but to retrieve any information that can be obtained via FTP. To get a directory listing from the remote machine above, we might write Title2 "FTP Access To List A Directory"; Filename Dir FTP " " ls host="unix1.sncc.lsu.edu" user="bill" prompt cd="/u/bill/alaska/stuff"; Data _NULL_; File Log; Infile Dir; Input; Put _INFILE_; Run;

/* /* /* /* /* /*

Null data set (required) FTP command Host name User login name Prompt for password Change directory first

*/ */ */ */ */ */

The data step simply reads then writes the directory listing to the SAS Log. Note that the data could have been placed into a variable or variables, and then processed using the data step.

CHAPTER 3. WORKING WITH FILES

3.1.2 WWW Access

45

Not to be outdone by the FTP access method, the FILENAME statement can also reference a URL (uniform resource locator), otherwise known as a web page. The page might be a hypertext document or might be a le being distributed via a WWW server. In the example below, we will assume that the ier data set is stored at the URL http://www.stat.lsu.edu/faculty/moser/flier.dat

and we wish to read these data into our program. Title2 "HTTP Access To Read A File"; Filename Fish URL "http://www.stat.lsu.edu/faculty/moser/flier.dat"; Data Flier; Infile Fish Firstobs=2; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL; Run;

Some important aspects of the FTP and URL access, is that data may be stored in some \distant" location, yet reachable by network, and be accessed at runtime. This could be very important for real-time data, such as from data loggers, or can be important when many people are accessing the data and and single, master copy is needed. From a systems programmer perspective, it means that the SAS system can be used to develop an interface to the internet.

3.2 Including External SAS Code As you will begin to see later, we may wish to reuse the same SAS code in many dierent programs. This may be especially true of SAS macros that we may have written to solve various problems. In other cases, the program that we are working on is very large and we would like to maintain the program in separate pieces or les. SAS code external to the currently executing SAS session can be included into the session using the %INCLUDE statement. The argument to the statement can either be a le name or a le reference. Notice that if a le reference is used, then all of the powerful methods described above for reading in data are available. It is assumed that SAS program le names end with the .sas extension. A couple of examples are given below. %Include "c:\projects\alaska97\fishdemo.sas" / Nosource2; Filename Fishdemo URL "http://www.stat.lsu.edu/faculty/moser/fishdemo.sas"; %Include Fishdemo;

The NOSOURCE2 option to the %INCLUDE statement is used to suppress printing the included code to the SAS log.

3.3 The SAS Data Library SAS data sets are stored in SAS libraries. When the SAS System was initially being developed on the IBM mainframes, a SAS library was a specially formatted MVS data set

CHAPTER 3. WORKING WITH FILES

46

whose members were SAS data sets. Now, a SAS library may be nothing more than a reference to a subdirectory in a lesystem on a PC or a Unix workstation, and may contain much more than SAS data sets. The idea behind the library is to organize and reference a collection of similar objects, such as data sets, in a common way. If I have several dierent data sets on shing eort, say one data set for each year since 1975, I may want to keep the data in each year separated, but be able to nd all of the data very easily and be able to combine any or all of it very easily as well. On a Windows PC you can identify the SAS data sets by the sux \ssd", while on an AIX Unix workstation the les have the sux \ssd001". Because they are in a special format, they are not to be read or edited using non-SAS tools.

3.3.1 The LIBNAME Statement The LIBNAME statement is used to specify the name and location of a SAS library. It has the basic format LIBNAME libname "path";

where \libname" is some 1-8 character name and \path" is the path in the lesystem where the library is to be kept. Upon startup, the SAS System internally issues a LIBNAME statement de ning a library named WORK which it uses as the default library. By default, at the end of the SAS session, the WORK library is cleaned of all SAS les. Thus, to keep any SAS data sets permanent, you should store them in a library that you specify. To do this, a two-level data set name is used in the SAS program. A two-level name has two names which are separated by a period. The rst name is the library name de ned in the LIBNAME statement, and the second name is the member or data set name. /* Point to the subdirectory containing the raw effort data */ FILENAME raw "c:\temp\effort97.csv"; /* Point to the SAS Library to store the effort data */ LIBNAME effort "c:\fishing\effort"; DATA effort.year1997; LENGTH location $30.; INFILE raw delimiter=","; INPUT date mmddyy8. location & effort; RUN;

Here, the raw data are input into the SAS data set effort.year1997, which can be used later in the same program or in some future program for analysis, without the need for the above data step. For example, the SAS code below could be used to print the data set at some future time. LIBNAME effort "c:\fishing\effort"; PROC PRINT DATA=effort.year1997; RUN;

CHAPTER 3. WORKING WITH FILES

47

3.3.2 Library Procedures

There are several procedures for working with SAS libraries. The procedures permit update of the information about a data set and utility operations such as copying, deletion or renaming a member. You can also append one member to another.

PROC DATASETS does many things related to the management of data sets. The usual way to operate with this procedure is in full-screen mode (the default). In full-screen mode you can easily modify parameters of the data set, copy, rename, and delete data sets, and list various parameters associated with them. A couple of these tasks are shown below in a non-full-screen session. PROC DATASETS

1777 1778

Title2 "Proc Datasets"; Proc Datasets Library=Work Nofs; -----Directory----Libref: WORK Engine: V612 Physical Name: C:\SAS\SASWORK\#TD16077 #

Name

Memtype

Indexes

1779 1780

1 CURSTAT CATALOG 2 FLIER DATA 3 FLIER1 DATA 4 FLIER2 DATA 5 SASMACR CATALOG Modify Flier(Label="James Geaghan Flier Data Set"); Run;

1781 1782

Delete Flier1; Run;

NOTE: Deleting WORK.FLIER1 (memtype=DATA). 1783 Quit;

CHAPTER 3. WORKING WITH FILES

48

PROC CONTENTS The PROC CONTENTS procedure produces a printout of information about the structure of a data set, such as the variable names, types, lengths, labels, and formats, and the number of observations in the data set, etc. This listing can also be produced from within PROC DATASETS. 1785 1786 1787

Title2 "Proc Contents"; Proc Contents Data=Work.Flier; Run; Proc Contents CONTENTS PROCEDURE

Data Set Name: Member Type: Engine: Created: Last Modified: Protection: Data Set Type: Label:

WORK.FLIER DATA V612 18:55 Thursday, September 25, 1997 18:55 Thursday, September 25, 1997 James Geaghan Flier Data Set

Observations: Variables: Indexes: Observation Length: Deleted Observations: Compressed: Sorted:

664 11 0 88 0 NO NO

-----Engine/Host Dependent Information----Data Set Page Size: Number of Data Set Pages: File Format: First Data Page: Max Obs per Page: Obs in First Data Page:

8192 8 607 1 92 73

-----Alphabetic List of Variables and Attributes----# Variable Type Len Pos -----------------------------------7 AGE Num 8 48 4 AR Num 8 24 2 DAY Num 8 8 9 LT Num 8 64 1 MO Num 8 0 6 SEX Num 8 40 8 SN Num 8 56 5 ST Num 8 32 11 TSL Num 8 80 10 WT Num 8 72 3 YR Num 8 16

This type of listing can be very important for documentation purposes and for nding out what is in a permanent SAS data set.

CHAPTER 3. WORKING WITH FILES

49

PROC COPY PROC COPY is used to copy SAS data set members (and catalogs) from one data library to another. This is useful for making backups. When combined with import/export engines, it can also be used to convert data from one form into another. The example below simply copies a data set from one library to another. 1789 Libname Perm "c:\temp"; NOTE: Libref PERM was successfully assigned as follows: Engine: V612 Physical Name: c:\temp 1790 Proc Copy In=Work Out=Perm; 1791 Select Flier; 1792 Run; NOTE: Copying WORK.FLIER to PERM.FLIER (MEMTYPE=DATA). NOTE: The data set PERM.FLIER has 664 observations and 11 variables.

PROC APPEND is normally used to append one SAS data set to another SAS data set called the base. If the base is a new data set, then a simple copy is used. In the example below, the FLIER2 data set is appended to the original FLIER data set. PROC APPEND

1795 1796 1797

Title2 "Proc Append"; Proc Append Base=Perm.Flier Data=Flier2; Run;

NOTE: Appending WORK.FLIER2 to PERM.FLIER. NOTE: 366 observations added. NOTE: The data set PERM.FLIER has 1030 observations and 11 variables.

PROC DELETE The DELETE procedure simply deletes a SAS data set. 1799 1800 1801

Title2 "Proc Delete"; Proc Delete Data=Perm.Flier; Run;

NOTE: Deleting PERM.FLIER (memtype=DATA).

CHAPTER 3. WORKING WITH FILES

50

PROC CATALOG The CATALOG procedure is necessary for working with SAS catalogs. Catalogs contain a wide variety of information. Rather than data, per se, they contain settings for various aspects of the SAS system. They may contain the help les used by SAS. They can contain information that you have created for a SAS/AF program. We will look at what might show up in as user's SAS Pro le. 2361 2362 2363 2364 2365

Title2 "Proc Catalog"; Proc Catalog Cat=sasuser.profile; Contents; Run; Quit; Files and Libraries Proc Catalog Contents of Catalog SASUSER.PROFILE

# Name

Type

Date

Description

1 2 3 4 5

AFGO KEYS SLIST SLIST WSAVE

09/08/97 04/06/97 09/25/97 06/25/97 09/25/97

Function Key Definitions User profile VIDEO: player preferences

AF DMKEYS PASSIST _VPLAY_ MRUWSAVE

Some of these catalogs might be accessible using functions found in PROC CATALOG, but typically the parameters that they contain are to be set using the programs that these catalogs are associated with (such as the Display Manager). This procedure is more commonly used in its interactive mode.

CHAPTER 3. WORKING WITH FILES

3.4 File Import/Export/Transport

51

3.4.1 Import/Export Many users enter their data using spreadsheets or data base software such as Microsoft Excel, Lotus 1-2-3, or Borland Paradox. If you have SAS/ACCESS installed, special import/export lters are available to directly access and use the data stored in several other software formats. A SAS/AF application is provided on the File->Import and File->Export pull-down menu that automates the programming of this task.

The task of importing an Excel spreadsheet containing age and growth information of Flier sun sh, courtesy of Dr. James Geaghan, LSU, where the rst non-blank row of the spreadsheet contains the variable names (SAS compatible) and the data follow in the remaining rows, is given below using PROC ACCESS. PROC ACCESS DBMS=EXCEL; CREATE work.xcell.ACCESS; PATH='C:\fishing\Flierdat.xls'; WORKSHEET='flier'; GETNAMES YES; SCANTYPE=YES; CREATE work.xcell.VIEW; SELECT ALL; RUN; DATA work.flier; SET work.xcell; RUN;

The PROC ACCESS step creates a view into the data set and speci es the particular worksheet to import, whether variable names are to be gotten from the worksheet, etc. The library name WORK is included to show that the data view is actually two catalog members within the libarary member XCELL. Note that the data step is the process that actually converts

CHAPTER 3. WORKING WITH FILES

52

the data to a SAS data set. It is possible to use the data directly from the spreadsheet by referencing the data view as in the SET statement. PROC PRINT DATA=work.xcell; RUN;

In general, this would not be the most ecient way to access the data as the \conversion" would need to be performed on each data access. It is also possible to export the data to other program formats. PROC DBLOAD is used below to convert the ier data set back into an Excel spreadsheet. PROC DBLOAD DBMS=EXCEL DATA=work.flier; PATH='C:\fishing\newflier.xls'; PUTNAMES YES; LIMIT=0; LOAD; RUN;

Also there are a number of other options for both PROC ACCESS and PROC DBLOAD that can be used to control the variable names, types, data ranges, and other input/output information. What if you do not have SAS/ACCESS? It is not too dicult to input a spreadsheet using the SAS data step. If the data contain no commas, then the \comma separated values" (CSV) format produced by most spreadsheets and many data bases is often convenient. The rst step is to save the spreadsheet into the CSV format (don't remove the original spreadsheet le, the CSV le is just an intermediate step). Next, write a SAS data step to input the data using an INFILE statement containing the DELIMITER="," option. As an example, let's assume that the ier data set has been saved as a CSV le called \ ierdat.csv". A portion of the data set is shown below. Mo,Day,Yr,Ar,St,Sex,Age,Sn,Lt,Wt,TSL,Size1,Size2,Size3,Size4,Size5,Size6,Edge,No 7,21,74,2,7,3,0,5,3.5,1.4,57.00,0.00,0.00,0.00,0.00,0.00,0.00,34.72,1 1,9,76,2,3,2,0,1,5.3,3.0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2 12,18,74,2,4,1,0,5,5.4,3.4,83.20,0.00,0.00,0.00,0.00,0.00,0.00,47.79,3 2,15,76,2,1,1,0,5,6.0,5.7,111.00,0.00,0.00,0.00,0.00,0.00,0.00,60.97,4 12,14,75,2,1,1,0,5,6.1,6.6,109.00,0.00,0.00,0.00,0.00,0.00,0.00,60.04,5

Now we can write a SAS data step to read in these data. Note that we need to skip over the rst line of the data set as it contains the spreadsheet column headings. FILENAME in "flierdat.csv"; DATA flier; INFILE in FIRSTOBS=2 DELIMITER=","; INPUT Mo Day Yr Ar St Sex Age Sn Lt Wt TSL Size1 Size2 Size3 Size4 Size5 Size6 Edge No; RUN;

If character data were included in the data set, then appropriate \informats" would need to be used for reading those variables. The third-party software produce DBMS/COPY is designed to move data between a number of dierent data formats including SAS data sets. This introduces the need for a \third" program, but can greatly ease the data export/import process, especially when the same data are needed in several dierent formats.

CHAPTER 3. WORKING WITH FILES

53

The SAS/CONNECT product that interconnects a network of PC's, Unix workstations, and/or mainframes, can also be used to move SAS data sets from one machine type to another. This product contains PROC UPLOAD and PROC DOWNLOAD for easily moving SAS data sets, but they can also be used to handle ascii les as well. This product also permits you to execute code and work with data on several dierent platforms at the same time.

3.4.2 Transport To copy a SAS data set from one system to another may lead to trouble if the systems and SAS versions are not exactly compatible with one another. For example, the structure of a SAS data set on an IBM MVS mainframe is quite dierent from that on a WIN95 PC. To move or copy a SAS data set, a transport data library can be created. Since it is a library, it may contain more than one SAS data set. The CPORT and CIMPORT procedures are used, respectively, to create and import a transport data library. The process is to rst create the transport library, transport it to the other host system, then import the library. A real advantage to this methodology is that you can send someone a SAS data set and know exactly what is in the data set, while sending them a \raw" data set requires that they know how to read in the data correctly. In other instances, the data may not exist in its \raw" form, but may have been entered directly into a SAS data set. Below is an example that creates a transport library of the ier data set, then imports it as if we had moved the transport library to another system. First, let's create the transport data library. The LIBRARY argument speci es which SAS data library will be used, the FILE argument species the destination for the transport image, and the MT argument speci es what types of things from the SAS data library are to be transported. Here we speci ed that only SAS data sets are to be considered. 668 669 670 671

Title2 "Create A Transport Data Library"; Filename XPT "c:\projects\alaska97\flier.xpt"; Proc CPort Library=Work File=XPT MT=Data; Run;

NOTE: Proc CPORT begins to transport data set WORK.FLIER NOTE: The data set contains 11 variables and 664 observations. Logical record length is 88. NOTE: Proc CPORT begins to transport data set WORK.FLIER1 NOTE: The data set contains 11 variables and 298 observations. Logical record length is 88. NOTE: Proc CPORT begins to transport data set WORK.FLIER2 NOTE: The data set contains 11 variables and 366 observations. Logical record length is 88. 672 Filename XPT Clear; NOTE: Fileref XPT has been deassigned.

CHAPTER 3. WORKING WITH FILES

54

Now we can reverse the process using the CIMPORT procedure. This would be the same type of code that would be used on the \other" platform. 674 675 676 677

Title2 "Import From A Transport Data Library"; Filename IMPT "c:\projects\alaska97\flier.xpt"; Proc CImport Library=Work File=IMPT; Run;

NOTE: Proc CIMPORT begins to create/update data set WORK.FLIER NOTE: Data set contains 11 variables and 664 observations. Logical record length is 88. NOTE: Proc CIMPORT begins to create/update data set WORK.FLIER1 NOTE: Data set contains 11 variables and 298 observations. Logical record length is 88. NOTE: Proc CIMPORT begins to create/update data set WORK.FLIER2 NOTE: Data set contains 11 variables and 366 observations. Logical record length is 88.

Note that we could couple some of this code with the FTP le access method described earlier to let the SAS system move the le for us. When moving the transport library, use a binary transfer method so that the internal structure of the transport library is not changed.

CHAPTER 3. WORKING WITH FILES

3.5 The X Files

55

The SAS system provides through various methods, access to the operating system and its commands. The X statement can be used to pass a command along to the operating system and have the command executed as if it were given at a system command prompt (such as at an MS-DOS, Unix, or terminal window). The SAS program will wait, by default, until the command has completed. Also by default, the user will have to \manually" close the external shell or program started by the X statement. The program below will use the WIN95 operating system to generate a directory listing of the ier data sets and will redirect the output to a le named flier.dir. The NOXWAIT system option is used to tell the SAS system to continue processing SAS statements once the X statement has completed, while the XSYNC option is used to prevent the SAS system from processing more statements before the command has completed. We then open the flier.dir data set and read its contents and write them to the SAS log. 652 653 653 654 655 656 657 658 659 660 661 662 663 664 665

Options XSync NoXWait; X "dir c:\projects\alaska97\flier*.dat > c:\projects\alaska97\flier.dir" ; Filename Fish "c:\projects\alaska97"; Data _NULL_; Infile Fish("flier.dir") EOF=LF; File Log; Input; If _N_=1 Then Link LF; Put _INFILE_; Return; LF: Put //; Return; Run;

NOTE: The infile library FISH is: DIRECTORY=c:\projects\alaska97 NOTE: The infile FISH("flier.dir") is: DIRECTORY=c:\projects\alaska97, MEMBERNAME=c:\projects\alaska97\flier.dir, RECFM=V,LRECL=256 Volume in drive C is RON RICO Volume Serial Number is 1BF6-1260 Directory of C:\PROJECTS\Alaska97 FLIER FLIERS2 FLIER1N2 FLIERS1

DAT DAT DAT DAT 4 file(s) 0 dir(s)

69,160 09-22-97 5:57p FLIER.dat 38,064 09-22-97 6:20p Fliers2.dat 27,888 09-22-97 9:32p Flier1N2.DAT 30,992 09-22-97 6:19p Fliers1.dat 166,104 bytes 345,735,168 bytes free

NOTE: A total of 11 records were read from the infile library FISH. The minimum record length was 0. The maximum record length was 56. NOTE: 11 records were read from the infile FISH("flier.dir"). The minimum record length was 0. The maximum record length was 56. NOTE: The DATA statement used 0.66 seconds.

CHAPTER 3. WORKING WITH FILES

56

Note that with very little programming we could have read the directory listing into variables in a SAS data set and have processed them in some fashion, say to compute the disk space occupied by the les.

Chapter 4

The Macro Language Macro languages are programming languages that lie on top of another programming language. Their purpose, typically, is to control which code may be compiled at compile time (say, certain aspects of the programming code might be machine dependent) and to do variable substitution. That is, the macro language looks for special symbols that indicate that it is simply to replace this set of special symbols with some other code. The SAS macro language can accomplish these tasks, but generally is used to control the generation of the SAS language itself. In fact, the macro language often writes SAS language code. Thus, entire applications can be developed with the macro language, which upon execution, generates SAS language that might be a mixture of macro commands, data steps, and procedure steps.

4.1 Macro Variables Probably, the easiest way to begin with the SAS macro language is to start with macro variables. A macro variable holds an assigned value and when its reference is encountered, the assigned value is substituted for the reference. In the macro language a macro variable reference looks like &NAME where NAME is some variable name that you have assigned. The assignment of a macro variable is often made through the %LET macro statement. Notice that macro statement elements begin with a percent sign to dierentiate them from SAS language statements. Let's look at a simple example. We will assume that the ier data set has been read in and available. 15 16 17 18 19

%Let DSName=Flier; Title2 "Listing of Data Set &DSNAME"; Title3 'Single Quotes for &DSNAME'; Proc Print Data=&DSName(Obs=10); Run;

NOTE: The PROCEDURE PRINT printed page 1.

57

CHAPTER 4. THE MACRO LANGUAGE

58

The SAS Macro Language Listing of Data Set Flier Single Quotes for &DSNAME OBS

MO

DAY

YR

AR

ST

1 2 3 4 5 6 7 8 9 10

7 1 12 2 12 12 9 12 12 2

21 9 18 15 14 18 13 14 14 15

74 76 74 76 75 74 75 75 75 76

2 2 2 2 2 2 2 2 2 2

7 3 4 1 1 4 2 1 1 1

SEX

AGE

3 2 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0

1

SN

LT

WT

TSL

5 1 5 5 5 4 5 5 5 5

3.5 5.3 5.4 6.0 6.1 6.2 6.4 6.4 6.5 6.8

1.4 3.0 3.4 5.7 6.6 5.8 6.7 7.6 6.8 8.1

57.0 0.0 83.2 111.0 109.0 104.5 122.8 118.4 130.6 118.0

The %LET statement has the name of the variable to receive the macro assignment, the equals sign, then the value to be assigned to the macro variable. Note that the variable is written without any special symbols (at least for now). Also note that the value is given without any quotes. We'll address more complicated issues as we move along. In this particular instance if we had several dierent data sets that we might print, it becomes easy to specify a dierent data set and, at the same time, have its name included in the title. The type of quotes used, however, are very important. You de nitely must use double quotes if you want the macro variable to resolve to its value. Single quotes will print the ampersand and name as written. Normally, there is nothing written on the SAS log to indicate what happened with respect to the macro substitution. To see what actually was substituted we can change one of the system options, SYMBOLGEN, on. Then upon re-running the code we see the following: 21 22 SYMBOLGEN: 23 24 SYMBOLGEN: 25

Options Symbolgen; %Let DSName=Flier; Macro variable DSNAME resolves to Flier Title2 "Listing of Data Set &DSNAME"; Proc Print Data=&DSName(Obs=10); Macro variable DSNAME resolves to Flier Run;

NOTE: The PROCEDURE PRINT printed page 2.

Now we can see what value was substituted for &DSNAME. Let's look at the following code. 27 %Let Somecode=%Str(Proc Print Data=Flier(Obs=10); Run;); 28 %Unquote(&Somecode); SYMBOLGEN: Macro variable SOMECODE resolves to Proc Print Data=Flier(Obs=10); Run; SYMBOLGEN: Some characters in the above value which were subject to macro quoting have been unquoted for printing. NOTE: The PROCEDURE PRINT printed page 3.

Thus, macro variables can actually hold very complicated expressions. The macro quoting function %STR() was used to quote the entire expression so that the internal parentheses and semicolons would not cause any problems. However, when we reference the variable, the value remains quoted, which means that its contents want be executed as normal SAS language statements. To get rid of the special quoting, the macro function %UNQUOTE was

CHAPTER 4. THE MACRO LANGUAGE

59

used. Although these variables are quite powerful with respect to substitution, they have their limits for writing reusable software. Macro procedures give us even more control of code generation and variable substitution.

4.2 Macro Procedures Macro procedures consist of a procedure de nition and a body. The body can be \plain" SAS language code, or can make use of the macro language. In fact, a number of the macro statements are only permitted within the body of a macro. A macro procedure begins with the %MACRO statement and ends with the %MEND statement. Let's rewrite our earlier program as a sas macro named PRT. 30 31 32 33 34 35 SYMBOLGEN: SYMBOLGEN:

%Macro PRT(DSName); Title2 "Listing of Data Set &DSNAME"; Proc Print Data=&DSName(Obs=10); Run; %Mend PRT; %PRT(Flier) Macro variable DSNAME resolves to Flier Macro variable DSNAME resolves to Flier

NOTE: The PROCEDURE PRINT printed page 4.

Again, it is dicult to determine what is actually happening in the macro. The SYMBOLGEN results do help but they appear out of place. This is because macros are actually compiled before they are executed. This is also why the complexity of macro programming can be greater within a macro procedure. You should notice that to perform the same task on a dierent data set would only require another call to the procedure. For example %PRT(Bluegill);

The MPRINT system option can be used to print out the SAS language statements generated by the macro language. The code below turns on this option and turns o the SYMBOLGEN option. 37 Options NoSymbolGen MPrint; 38 %PRT(Flier) MPRINT(PRT): TITLE2 "Listing of Data Set Flier"; MPRINT(PRT): PROC PRINT DATA=FLIER(OBS=10); MPRINT(PRT): RUN; NOTE: The PROCEDURE PRINT printed page 5.

Not only did we get the macro variables resolved, but we also have them listed in the context of the SAS language statements.

CHAPTER 4. THE MACRO LANGUAGE

60

It is also possible to put optional arguments to a macro. This permits you to assign initial values to these macro variables, which can be overwritten when the macro is called. The example below makes the data set name optional. 40 %Macro PRT(DSName=Flier); 41 Title2 "Listing of Data Set &DSNAME"; 42 Proc Print Data=&DSName(Obs=10); 43 Run; 44 %Mend PRT; 45 %PRT; MPRINT(PRT): TITLE2 "Listing of Data Set Flier"; MPRINT(PRT): PROC PRINT DATA=FLIER(OBS=10); MPRINT(PRT): RUN; NOTE: The PROCEDURE PRINT printed page 6.

Note that the macro was called without replacing the data set name. Had we wanted to use the BLUEGILL data set, for example, we would have made the call %PRT(DSNAME=BLUEGILL);

We can even include required and optional parameters in the same macro. The required or positional parameters must be called in the same order as given in the %MACRO statement, while the optional parameters can be given in any order following after the positional parameters. 47 48 49 50 51 52

%Macro Sex(Sex,DSName=Flier); Data &DSName&Sex; Set &DSName; Where (Sex=&Sex); Run; %Mend Sex;

Now we can call the new macro and specify which sex we would like to create the data set for. Notice here the complicated name &DSName&Sex. This is actually two macro names joined together. When executed, both macro names will be resolved then joined together to form a single name. Let's call the macro and look at the results produced. 53 %Sex(1,DSNAME=Flier); MPRINT(SEX): DATA FLIER1; MPRINT(SEX): SET FLIER; MPRINT(SEX): WHERE (SEX=1); MPRINT(SEX): RUN; NOTE: The data set WORK.FLIER1 has 306 observations and 11 variables. NOTE: The DATA statement used 0.44 seconds. 54 %Sex(2); MPRINT(SEX): DATA FLIER2; MPRINT(SEX): SET FLIER; MPRINT(SEX): WHERE (SEX=2); MPRINT(SEX): RUN; NOTE: The data set WORK.FLIER2 has 340 observations and 11 variables. NOTE: The DATA statement used 0.38 seconds. 55 %Sex(3); MPRINT(SEX): DATA FLIER3; MPRINT(SEX): SET FLIER; MPRINT(SEX): WHERE (SEX=3); MPRINT(SEX): RUN;

CHAPTER 4. THE MACRO LANGUAGE

61

NOTE: The data set WORK.FLIER3 has 18 observations and 11 variables. NOTE: The DATA statement used 0.22 seconds. 56 57

Proc Datasets Library=Work; -----Directory----Libref: WORK Engine: V612 Physical Name: C:\SAS\SASWORK\#TD41921

58 59

Run; Quit;

# Name Memtype Indexes ---------------------------1 FLIER DATA 2 FLIER1 DATA 3 FLIER2 DATA 4 FLIER3 DATA 5 SASMACR CATALOG

The macro created each of the 3 data sets, one for each sex category. Now if we knew that we would be doing this, there might be other ways to accomplish this with the macro language. The macro language also has looping statements like the SAS language does. 61 62 63 64 65 66 67 68

%Macro Sex(DSName=Flier); %Do Sex=1 %To 3; Data &DSName&Sex; Set &DSName; Where (Sex=&Sex); Run; %End; %Mend Sex;

which when called generates the code below. 69

%Sex;

MPRINT(SEX): MPRINT(SEX): MPRINT(SEX): MPRINT(SEX):

DATA FLIER1; SET FLIER; WHERE (SEX=1); RUN;

NOTE: The data set WORK.FLIER1 has 306 observations and 11 variables. MPRINT(SEX): MPRINT(SEX): MPRINT(SEX): MPRINT(SEX):

DATA FLIER2; SET FLIER; WHERE (SEX=2); RUN;

NOTE: The data set WORK.FLIER2 has 340 observations and 11 variables. MPRINT(SEX): MPRINT(SEX): MPRINT(SEX): MPRINT(SEX):

DATA FLIER3; SET FLIER; WHERE (SEX=3); RUN;

NOTE: The data set WORK.FLIER3 has 18 observations and 11 variables.

CHAPTER 4. THE MACRO LANGUAGE

4.3 Bootstrap Example

62

In this section we will write a more complicated macro. This macro will compute a Monte Carlo or bootstrap test for the equality of two means (mean weight of the ier sh of males and females) using the ANOVA F statistic as the criterion. Under the null hypothesis, there is only one population and so it doesn't matter which observations you classify as males and which you classify as females. The algorithm will be the following. 1. Compute the test statistic for the original sample. 2. Select a simple random sample with replacement equal to the number of males in the sample and label these observations as males. 3. Select a simple random sample with replacement equal to the number of females in the sample and label these observations as females. 4. Concatenate the two samples together and compute the ANOVA test statistic testing for dierences between the sexes. 5. Save the test statistic then start again at (2) until a large number of iterations have been performed. 6. Rank order the test statistics from smallest to largest and record the rank of the original test statistic. 7. If the original test statistic is within the largest 5% of these test statistics, then reject the null hypothesis and claim that the weight means are dierent between the sexes. The following macro was written to implement this basic scheme. You should be aware that much more ecient schemes exist to complete this task. For example, one could generate all of the Monte Carlo samples (iterations) in a single data step and use BY processing to compute the ANOVAs BY SEX. %Macro MonteT(MaxIter=99); %* Get Rid of Unknowns First; Data _NoUnk_; Set Flier; If Sex NE 3; Run; %* Compute Test Statistic for Observed Data; Proc ANOVA Data=_NOUNK_ NoPrint OutStat=AllStats; Class Sex; Model Wt=Sex; Run; Quit; %* Determine the Number of Males and Females; %* We should preserve the sample sizes; Proc Means Data=_NoUnk_ Noprint; Class Sex; Var Wt; Output Out=Temp N=N; Run;

CHAPTER 4. THE MACRO LANGUAGE %* Get The Results From Above Into Macro Variables; Data _NULL_; Set Temp; Select (Sex); When (1) Call Symput("Males",Put(N,5.)); When (2) Call Symput("Females",Put(N,5.)); Otherwise; End; Run; %Put The Data Set Contains &Males Males.; %Put The Data Set Contains &Females Females.; %* This macro actually selects the sample; %Macro TakeSamp(Sex,Number); Sex=&Sex; Do I=1 To &Number; Obs=Floor(Ranuni(0)*N+1); Set _NoUnk_(Drop=Sex) Point=Obs Nobs=N; Output; End; %Mend TakeSamp; %* This loop controls the sampling; %Do Iter=1 %To &MaxIter; Data BootSamp; %TakeSamp(1,&Males); %TakeSamp(2,&Females); Stop; Run; %* Compute the test statistic based upon the pseudo-sample; Proc ANOVA Data=BootSamp NoPrint OutStat=One; Class Sex; Model Wt=Sex; Run; Quit; %* Append the results to the stats data set; Proc Append Base=AllStats Data=One; Run; %End; %* Print Out the Original Statistics; Title2 "Listing of All ANOVA Statistics"; Proc Print Data=AllStats(Obs=10); Run; %* Keep Only The F Statistics For Sex; Data AllStats; Retain Iter -1; Set AllStats; If _Source_="SEX"; Iter+1; Run; %* Reorder The Runs By F Statistic Value; Proc Sort Data=AllStats; By F; Run;

63

CHAPTER 4. THE MACRO LANGUAGE %* List In Increasing Order of F; %* Find Iter 0 And See Where It Falls In Relation; %* To The Other F Values.; Title2 "Listing of Ordered ANOVA Results"; Title3 "Iter Number 0 Is From Original Analysis"; Proc Print Data=AllStats; Run; Data _NULL_; Set AllStats; If Iter=0 Then Do; MonteP=(_N_-1)/(&MaxIter+1); Put "Monte Carlo P-value is " MonteP 6.4; End; Run; %Mend MonteT;

Some of the rst code generated by the macro is the following. 171

%MonteT;

MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET):

DATA _NOUNK_; SET FLIER; IF SEX NE 3; RUN;

NOTE: The data set WORK._NOUNK_ has 646 observations and 11 variables. MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET):

PROC ANOVA DATA=_NOUNK_ NOPRINT OUTSTAT=ALLSTATS; CLASS SEX; MODEL WT=SEX; RUN;

MPRINT(MONTET):

QUIT;

NOTE: The data set WORK.ALLSTATS has 2 observations and 7 variables. MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET):

PROC MEANS DATA=_NOUNK_ NOPRINT; CLASS SEX; VAR WT; OUTPUT OUT=TEMP N=N; RUN;

NOTE: The data set WORK.TEMP has 3 observations and 4 variables. MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET):

DATA _NULL_; SET TEMP; SELECT (SEX); WHEN (1) CALL SYMPUT("Males",PUT(N,5.)); WHEN (2) CALL SYMPUT("Females",PUT(N,5.)); OTHERWISE; END; RUN;

The Data Set Contains The Data Set Contains

306 Males. 340 Females.

64

CHAPTER 4. THE MACRO LANGUAGE

65

MPRINT(MONTET): DATA BOOTSAMP; MPRINT(TAKESAMP): SEX=1; MPRINT(TAKESAMP): DO I=1 TO 306; MPRINT(TAKESAMP): OBS=FLOOR(RANUNI(0)*N+1); MPRINT(TAKESAMP): SET _NOUNK_(DROP=SEX) POINT=OBS NOBS=N; MPRINT(TAKESAMP): OUTPUT; MPRINT(TAKESAMP): END; MPRINT(MONTET): ; MPRINT(TAKESAMP): SEX=2; MPRINT(TAKESAMP): DO I=1 TO 340; MPRINT(TAKESAMP): OBS=FLOOR(RANUNI(0)*N+1); MPRINT(TAKESAMP): SET _NOUNK_(DROP=SEX) POINT=OBS NOBS=N; MPRINT(TAKESAMP): OUTPUT; MPRINT(TAKESAMP): END; MPRINT(MONTET): ; MPRINT(MONTET): STOP; MPRINT(MONTET): RUN; NOTE: The data set WORK.BOOTSAMP has 646 observations and 12 variables. MPRINT(MONTET): PROC ANOVA DATA=BOOTSAMP NOPRINT OUTSTAT=ONE; MPRINT(MONTET): CLASS SEX; MPRINT(MONTET): MODEL WT=SEX; MPRINT(MONTET): RUN; MPRINT(MONTET): QUIT; NOTE: The data set WORK.ONE has 2 observations and 7 variables. MPRINT(MONTET): MPRINT(MONTET):

PROC APPEND BASE=ALLSTATS DATA=ONE; RUN;

NOTE: Appending WORK.ONE to WORK.ALLSTATS. NOTE: 2 observations added. NOTE: The data set WORK.ALLSTATS has 4 observations and 7 variables.

The last 3 steps will continue until all iterations have been completed. Once completed the code generated will be. MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET):

TITLE2 "Listing of All ANOVA Statistics"; PROC PRINT DATA=ALLSTATS(OBS=10); RUN;

NOTE: The PROCEDURE PRINT printed page 7. MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET):

DATA ALLSTATS; RETAIN ITER -1; SET ALLSTATS; IF _SOURCE_="SEX"; ITER+1; RUN;

NOTE: The data set WORK.ALLSTATS has 100 observations and 8 variables. MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET):

PROC SORT DATA=ALLSTATS; BY F; RUN;

NOTE: The data set WORK.ALLSTATS has 100 observations and 8 variables. MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET):

TITLE2 "Listing of Ordered ANOVA Results"; TITLE3 "Iter Number 0 Is From Original Analysis"; PROC PRINT DATA=ALLSTATS; RUN;

CHAPTER 4. THE MACRO LANGUAGE

66

NOTE: The PROCEDURE PRINT printed pages 8-10. MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET): MPRINT(MONTET):

DATA _NULL_; SET ALLSTATS; IF ITER=0 THEN DO; 1-MONTEP=(_N_-1)/(99+1); PUT "Monte Carlo P-value is " MONTEP 6.4; END; RUN;

Monte Carlo P-value is 0.0800

The nal page of the output produced had the following observations, including the original analysis results (ITER=0), OBS 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

ITER 56 76 79 27 15 8 94 13 4 18 51 5 63 0 26 87 43 6 24 83 14

_NAME_ WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT

_SOURCE_ SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX SEX

_TYPE_ ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA ANOVA

DF

SS 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

933.89 1063.30 999.67 1281.41 1452.89 1488.45 1658.02 1437.03 1690.12 1827.85 1857.69 1817.26 1926.93 2318.17 2148.51 2675.15 2799.96 3672.40 3720.97 4204.62 5745.28

F 1.56952 1.63018 1.82098 2.03846 2.28557 2.55722 2.69699 2.73300 2.80394 3.08179 3.10496 3.33083 3.53462 3.83827 3.85926 4.32986 4.38115 6.04351 6.30796 8.06348 8.91242

PROB 0.21073 0.20214 0.17767 0.15385 0.13107 0.11028 0.10103 0.09878 0.09452 0.07965 0.07853 0.06846 0.06055 0.05053 0.04990 0.03784 0.03673 0.01422 0.01226 0.00466 0.00294

We get a very similar answer as to what was obtained by the original ANOVA analysis. Many more simulations and we would have obtained a more precise estimate of the sampling distribution of the test statistic. It does take a while for the macro to execute. Its speed can be increased by suppressing some of the printed output. For example, suppress the macro printed output along with the notes. Options NoNotes NoMPrint NoSymbolgen;

4.4 Cluster Dendrogram There are many SAS macros that have been written by users, as well as by SAS Institute sta, to ll \holes" in the collection of SAS procedures. For example, the dendrogram generated by PROC TREE is not particularly readable and is very frustrating to users that are already familiar with a dendrogram. The dendro macro, available at the SAS WWW site, uses SAS/GRAPH to produce a high resolution dendrogram. The following program calls this macro and produces the attached graph.

CHAPTER 4. THE MACRO LANGUAGE

67

*---------------------------------------------------------* | Average Linkage Cluster Analysis. | | Analysis of a subset of the 1980 Uniform Crime Report | | data base. Only cities with populations between 100,000| | and 249,999 are used. The largest of these per state is| | selected for clustering. Reported crimes are | | standardized to total offenses for each city. | *---------------------------------------------------------*; Options PS=55 LS=78 NoDate NoNumber; Proc Format; Value State 1="ALABAMA" 4="CALIFORNIA" 7="DELAWARE" 10="GEORGIA" 13="INDIANA" 16="KENTUCKY" 19="MARYLAND" 22="MINNESOTA" 25="MONTANA" 28="NEW HAMPSHIRE" 31="NEW YORK" 34="OHIO" 37="PENNSYLVANIA" 40="SOUTH DAKOTA" 43="UTAH" 46="WASHINGTON" 49="WYOMING" Run;

2="ARIZONA" 5="COLORADO" 8="WASHINGTON, D.C." 11="IDAHO" 14="IOWA" 17="LOUISIANA" 20="MASSACHUSETTS" 23="MISSISSIPPI" 26="NEBRASKA" 29="NEW JERSEY" 32="NORTH CAROLINA" 35="OKLAHOMA" 38="RHODE ISLAND" 41="TENNESSEE" 44="VERMONT" 47="WEST VIRGINIA" 50="ALASKA"

3="ARKANSAS" 6="CONNECTICUT" 9="FLORIDA" 12="ILLINOIS" 15="KANSAS" 18="MAINE" 21="MICHIGAN" 24="MISSOURI" 27="NEVADA" 30="NEW MEXICO" 33="NORTH DAKOTA" 36="OREGON" 39="SOUTH CAROLINA" 42="TEXAS" 45="VIRGINIA" 48="WISCONSIN" 51="HAWAII";

Title1 'Reported Offenses from the 1980 Uniform Crime Report'; Title2 'for Moderate Sized Cities'; Data Crime; Input state division citysize murder manslght rape robbery assault burglary larceny motor total; Array crimes{8} murder manslght rape robbery assault burglary larceny motor; Do i=1 to 8; /* standardize the reported crimes */ Crimes{i}=crimes{i}/total*100; End; Drop i; Label manslght='manslaughter'; Format state state.; Datalines; 1 6 58040 54 0 144 956 2270 7130 10189 1094 21837 1 6 58730 37 1 58 306 1015 3671 7597 663 13348 1 6 41840 28 0 69 276 1781 4161 7491 827 14633 2 8 56210 10 0 62 193 1273 2773 7947 567 12825 --- Data Deleted Here --46 9 85450 11 0 127 409 1521 4110 10278 869 17325 48 3 52470 4 0 76 244 1136 3646 10125 590 15821 50 9 1900 15 0 117 296 1581 2611 7322 1055 12997 ; Proc Sort; By state citysize; Run; /* Now keep the largest city from each state */ Data crime; Set crime;

CHAPTER 4. THE MACRO LANGUAGE By state; If first.state; Drop citysize; Run; Proc print; Format murder manslght rape robbery assault burglary larceny motor 6.2; Run; Proc Cluster Data=crime Method=Average Outtree=Tree; Var murder manslght rape robbery assault burglary larceny motor; Id state; Run; %EPS; /* my own macro to turn on encapsulated postscript */ %Include "dendro.sas"; %dendro; Quit;

68

CHAPTER 4. THE MACRO LANGUAGE for Moderate Sized Cities

69

Chapter 5

SAS Special Files There are several les that are useful to be aware of and to make use of. The rst is the AUTOEXEC.SAS le.

5.1 Autoexec.sas The AUTOEXEC.SAS le is typically stored in a user's home directory or in the main SAS subdirectory. Once the SAS system initializes, it will, by default, read and execute the statements contained in the AUTOEXEC.SAS le. This makes it a very convenient way to set up your environment with the basic settings that you might like. You can de ne your graphics drivers here as well as printout size, and other options. In developing applications, it can be a way to autostart a program for persons that are non-programmers, but need to use a pre-built SAS application. Below is an example AUTOEXEC.SAS le that I use on my Unix system. Some parts is use frequently and others much less frequently. *************************************************; * AUTOEXEC.SAS *; *************************************************; %Macro CGM; %************************************************; %* CGM options are CGMFRMA - Monochrome *; %* CGMFRGA - Gray Scale *; %* CGMFRCA - Color *; %************************************************; Filename GSASFile '/tmp/sas.cgm'; GOptions Device=CGMFRMA GAccess=GSASFile GSFMode=Replace FText=HWCGM010 HText=1 CText=Black FTitle=HWCGM010 HTitle=1 CTitle=Black; %* FText = HWCGM010 is for Helvetica; %* Set FText to HWCGM001 For SansSerif Font; %Put %Str( ); %Put %Str(NOTE: Graphics Device is: CGMFRMA); %Put %Str(NOTE: Writing Graphics Output To: /tmp/sas.cgm); %Put %Str( ); %Mend CGM;

70

CHAPTER 5. SAS SPECIAL FILES %Macro CGMX; %************************************************; %* Display device is XCOLOR - Then print to CGM *; %* CGM options are CGMFRMA - Monochrome *; %* CGMFRGA - Gray Scale *; %* CGMFRCA - Color *; %************************************************; Filename GSASFile '/tmp/sas.cgm'; GOptions Device=XColor TargetDevice=CGMFRMA GAccess=GSASFile GSFMode=Replace FText=HWCGM010 HText=1 CText=Black; %Put %Str( ); %Put %Str(NOTE: Graphics Device is: XCOLOR); %Put %Str(NOTE: Target Device is: CGMFRMA); %Put %Str(NOTE: Writing Target Output To: /tmp/sas.cgm); %Put %Str( ); %Mend CGMX; %Macro EPS; %************************************************; %* EPS -Generate Encapsulated Postscript Output.*; %************************************************; FILENAME GSASFile "graph.eps"; GOPTIONS Device=PSEPSF TargetDevice=PSEPSF CBack=White Colors=(Black) GAccess=GSASFile NoPrompt GSFMode=Replace; %Put %Str( ); %Put %Str(NOTE: Graphics Device is: PSEPSF); %Put %Str(Note: Graphics Output to: graph.eps); %Put %Str( ); %Mend EPS; %Macro EPSX; %************************************************; %* EPSX-Generate Encapsulated Postscript Output.*; %************************************************; FILENAME GSASFile "graph.eps"; GOPTIONS Device=XCOLOR TargetDevice=PSEPSF CBack=White Colors=(Black) GAccess=GSASFile NoPrompt GSFMode=Replace; %Put %Str( ); %Put %Str(NOTE: Graphics Device is: XCOLOR); %Put %Str(Note: Graphics Output to: graph.eps); %Put %Str( ); %Mend EPSX; %Macro PS; %************************************************; %* PS - Generate Postscript Output For Printing.*; %************************************************; FILENAME GSASFile pipe 'lpr -Pps -h'; GOPTIONS Device=PS1200 TargetDevice=PS1200 CBack=White Colors=(Black) GProlog='252150532D41646F62652D0D0A'x GAccess=GSASFile NoPrompt GSFMode=Replace; %Put %Str( ); %Put %Str(NOTE: Graphics Device is: PS1200); %Put %Str( ); %Mend PS; %Macro PSX;

71

CHAPTER 5. SAS SPECIAL FILES

72

%************************************************; %* PS - Generate Postscript Output For Printing.*; %************************************************; FILENAME GSASFile pipe 'lpr -Pps -h'; GOPTIONS Device=XColor TargetDevice=PS1200 CBack=White Colors=(Black) GProlog='252150532D41646F62652D0D0A'x GAccess=GSASFile NoPrompt GSFMode=Replace; %Put %Str( ); %Put %Str(Note: Graphics Device is: XCOLOR); %Put %Str(NOTE: Hardcopy Graphics Device is: PS1200); %Put %Str( ); %Mend PSX; %Macro PSCX; %************************************************; %* PS - Generate Postscript Output For Printing.*; %************************************************; FILENAME GSASFile pipe 'lpr -Ppsc0 -J/nff/nb'; GOPTIONS Device=XColor TargetDevice=PSCOLOR CBack=White GProlog='252150532D41646F62652D0D0A'x GAccess=GSASFile NoPrompt GSFMode=Replace; %Put %Str( ); %Put %Str(Note: Graphics Device is: XCOLOR); %Put %Str(NOTE: Hardcopy Graphics Device is: PSCOLOR); %Put %Str( ); %Mend PSCX; %Macro HP; %************************************************; %* HP - Generate HPLJS3 Output For Printing. *; %************************************************; FILENAME GSASFile pipe 'lpr -Php1 -J/nff/nb'; GOPTIONS Device=HPLJS3 GAccess=GSASFile NoPrompt GSFMode=Append; %Put %Str( ); %Put %Str(NOTE: Graphics Device is: HPLJS3); %Put %Str( ); %Mend HP; Options /* Forms=SASUSER.PROFILE.DEFAULT.FORM */ LPTYPE=BSD; Libname Insight "/usr/lpp/sas612/samples/insight";

The macros make it easy for me to assign the graphics devices that I want to use. For incorporating graphics into FrameMaker I use the %CGM and %CGMX macros. For incorporating graphics into LATEXI use %EPS and %EPSX. While for direct printing I might use %PS.

5.2 Con g.sas The CONFIG.SAS le contains the con guration data that a user might customize, such as the locations of the SAS system les and memory allocations. It also speci es where the SAS WORK and SASUSER libraries are to be located. On networked systems, this can be very useful for de ning independent con gurations for each user.

CHAPTER 5. SAS SPECIAL FILES

73

An example CONFIG.SAS le is shown below. /* * This file, config.sas, holds default configuration options * for the SAS System. These options are overridden by options on the * command line, or options specified through the SAS612_OPTIONS * environment variable. * */ /* * -maps specifies the pathname of the map datasets used by PROC GMAP. */ -maps !SASROOT/maps /* * -msg specifies the directory where the SAS System will search * for the files containing the text for all error messages * and notes. These messages are stored in an internal format. */ -msg !SASROOT/sasmsg /* * -news specifies the name of a text file that will automatically * be displayed in the log when SAS is invoked. */ -news !SASROOT/misc/base/news /* * -sasautos establishes the path(s) to director(ies) for automatic * macro definitions to be searched by the macro facility when * an unknown macro is referenced. */ -sasautos !SASROOT/sasautos /* * -sashelp specifies the pathname for the directory containing on-line * help and menu screens for the SAS System. */ -sashelp !SASROOT/sashelp /* * -sasuser specifies the pathname for the directory used by the SAS * System as a default place to store files, such as the SAS * user profile catalog. See your SAS System documentation * for more information. */ -sasuser ~/sasuser /* * -work * * * * * */ -work /tmp

specifies where to create the SAS work library. This library is temporary and any SAS data sets created there will be deleted when the system terminates. The unique name 'SAS_workANNNN' is assigned to each SAS work library. 'A' is some letter and 'NNNN' is the hexadecimal representation of the process ID of the SAS process.

/* * -sasscript */

specifies the location to search for SAS/CONNECT scripts.

CHAPTER 5. SAS SPECIAL FILES -sasscript !SASROOT/misc/connect /* * -dms */ -dms

specifies that you are running SAS in Display Manager mode.

/* * * -memsize limits the amount of memory that will be allocated by the * SAS System. */ -memsize 32m /* * * -sortsize limits the amount of memory that will be allocated during * sorting operations. */ -sortsize 16m /* * Default windowing system to use. */ -fsdevice x11 /* * -helpenv specifies that native help should be used when help is * invoked. */ -helpenv helplus /* * -helploc specifies the location of the native help files for helplus. * You would have to specify '-helpenv helplus' to use those files. */ -helploc !SASROOT/X11/native_help /* * -samploc specifies the location of the sample files for helplus. */ -samploc !SASROOT/X11/native_help /* * -path specifies the search path that the SAS System will use * to find the dynamically loaded modules. Each -path * specification indicates one directory. They will be * searched in the order in which they are given. */ -path !SASROOT/sasexe/base -path !SASROOT/sasexe/graph -path !SASROOT/sasexe/stat -path !SASROOT/sasexe/fsp -path !SASROOT/sasexe/af -path !SASROOT/sasexe/insight -path !SASROOT/sasexe/ets -path !SASROOT/sasexe/eis -path !SASROOT/sasexe/iml -path !SASROOT/sasexe/connect -path !SASROOT/sasexe/or

74

CHAPTER 5. SAS SPECIAL FILES -path -path -path -path -path -path -path -path -path -path -path -path -path -path -path

75

!SASROOT/sasexe/qc !SASROOT/sasexe/dbi !SASROOT/sasexe/english !SASROOT/sasexe/fsc !SASROOT/sasexe/gis !SASROOT/sasexe/image !SASROOT/sasexe/lab !SASROOT/sasexe/nvi !SASROOT/sasexe/pub !SASROOT/sasexe/share !SASROOT/sasexe/trader !SASROOT/sasexe/toolkit !SASROOT/sasexe/spectraview !SASROOT/sasexe/unixdb !SASROOT/sasexe/mddbserver

5.3 Pro le.sct The PROFILE.SCT is a SAS catalog that contains lots of information about your SAS session. It will contain your keyboard mapping settings, your window color settings, printer de nition information, and lots of other information of this sort. Typically, its values are modi ed through the various utilities and pull-down menus available in the SAS display manager.

Chapter 6

SAS Internet Tools SAS Institute is developing many products to WEB-enable the SAS system. The products enable the system to, for example, produce HTML output. That is, you can read the SAS output with a web browser. Other products permit the use of HTML forms to interact with SAS data bases. Thus you can issue data base queries against a SAS data set where the commands are obtained from an HTML form. Other products allow an HTML document to invoke and run commands on the local SAS system. This would permit development of web applications that would actually be executed from within SAS.

6.1 Capturing OUTPUT for the Web The easiest way to get started with a web-enabled SAS system is to use the macros for generating HTML output. In the current versions (6.11 and 6.12) the macros must rst be installed. The necessary les can be obtained from the SAS Institute WWW server at the URL http://www.sas.com/rnd/web/intro.html. Once installed it is a relatively straightforward process to calling the OUT2HTM() macro to turn on the capturing of the output (or log) and calling it again to turn o the capturing and to generate the HTML code. An example of capturing a printout of the ier data set is given below. The FORMCHAR option is reset to the values below so that the line plots and tables look properly constructed.

76

CHAPTER 6. SAS INTERNET TOOLS

77

Options PS=55 LS=78 PageNo=1 NoDate; Options Formchar="|----|+|---+=|-/\<>*"; Title1; Data Flier; Infile "c:\projects\alaska97\flierdat.csv" Delimiter="," Firstobs=2; Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL Size1 Size2 Size3 Size4 Size5 Size6 Edge No; Run; %out2htm(capture=on); %*Title1 "Flier Data"; Title1 "Flier Data"; Proc Print Data=Flier Uniform; Run; Proc Plot Data=Flier; Plot Wt*Lt; Run; Quit; %out2htm(capture=off,proploc=sasuser.htmlgen.outprop.slist, encode=n,htmlfile=flier.htm);

Note the use of HTML codes embedded within the TITLE statement so as to modify the default settings. The rst call to OUT2HTM turns capturing on. The second call ends capturing, speci es the properties catalog that will control the appearance of the HTML code, and also speci es the output le to contain the results. It is also possible to specify modi cations to the properties directly in the OUT2HTM macro as below %out2htm(capture=off,encode=n,dface=COURIER,hface=COURIER,htag=PREFORMATTED TEXT, htmlfile=flier.htm,ttag=NO FORMATTING);

The DFACE variable controls the typeface for the data, the HFACE variable controls the typeface for the headings, HTAG controls the tag type(s) that will be used around the headings, and the TTAG controls the tag type(s) around the titles. The ENCODE variable determines whether or not the angle brackets (greater than and less than signs) are encoded so as to print in HTML as angle brackets. To capture the SAS log, use the option WINDOW=LOG. To append HTML output to an existing le, specify OPENMODE=APPEND. There are many other options that are possible. For the settings that I used ( xed-width fonts and black color), the following page was observed in my browser.

CHAPTER 6. SAS INTERNET TOOLS

78

CHAPTER 6. SAS INTERNET TOOLS

79

Since a properties catalog can be constructed with the properties information, once you have found the properties you like for a particular type of report, you may want to enter them into a properties catalog. An easy way to do this is interactively. From within the display manager, issue the following macro call %out2htm(runmode=I);

This will bring up a dialog box within which you can specify the properties catalog to use. Then select the properties button on the dialog box and make whatever modi cations that you have decided upon. The rst dialog box that you encounter looks like the following.

Click on the properties button to move to the next window.

CHAPTER 6. SAS INTERNET TOOLS

80

Then select the TEXT tab to get to the html de nitions for the text.

From this dialog box you can update the properties very easily then save them away. Later when you wish to assign those properties to the SAS output or log, simply refer to this properties list and you'll not have to make a long list of properties in the macro call.

Advanced Sas Programming Techniques (www.studysas.blogspot.com)

Overview

More details

Related Documents

Advanced Sas Programming Techniques (www.studysas.blogspot.com)

Advanced Programming

Introduction To Sas Programming

Programming 1 Sas Essentials

Learn Sas Programming

Advanced Research Techniques Oracle