Week 10
More on Manipulating Data
Unit 4 SAS for Data Management
Week 10: More on Manipulating Data
Welcome. This reading is a wonderful reservoir of tools, such as: how to document data entry errors in the SAS log using the PUT statement, how to select a simple random sample, and how to clean character data. Numerous examples illustrate the specification of the necessary code.
Goals of Week 10: More on Manipulating Data 1. to be competent in using SAS to verify data entry in an easy to read manner; 2. to be aware of the multiple uses of the PUT statement (such as report writing, forms creation, etc) 3. to know how to create sequential study identification numbers; 4. to understand how SAS stores dates and how to do calculations involving dates; 5. to be competent in the informatting and formatting of SAS dates and times; 6. to be aware of the variety of SAS functions that are available; 7. to appreciate that SAS offers multiple ways of working with missing information; 8. to know how to select a simple random sample without replacement from a sampling frame; and 9. to be competent in manipulating character data.
week 10
10.1
Week 10
More on Manipulating Data
Week 10 Outline – More on Manipulating Data Section Topic 1.
Using the PUT Statement ………………..…………….……. …………….. a. b. c. d. e.
2.
Dates and Time in SAS ……….……………………………………..……………. a. b. c. d.
3.
How to Write to a File ………………………………………………………… How to Write Input Data to the Log .………………………………………... Data Quality Assessment – A Good Way to Write Out a Subset of Data . How to Output Data Using Column Format………………………………… How to Use PUT Instruction to Assign Sequential Identification Values Plus Some Bells and Whistles …………………………………………. …
How to Use Informats/Formats to Read/Write Dates ………………………. How to Compute Intervals Between Dates ……………………………… How to Work with a Date as a Constant ………………………………… The TODAY( ) and DATE( ) Functions ………………………………….
Introduction to SAS Functions ………….……………………………………….... a. b. c. d.
Overview of SAS Functions ……………………………………. ………….. Using SAS to Manipulate Missing Values ……………………..…………... How to Compute Age in Completed Years (INT) ………..………………… How to Extract Date and Time Information (MDY,DHMS)… ..……………..
Page 3 3 5 6 9 11 15 15 18 19 20 22 22 23 27 27
4.
How to Select a Simple Random Sample Without Replacement ..…………..
30
5.
Introduction to SAS Character Functions ……………………………..…………….
36
a. b. c. d. e.
week 10
How to Convert to Upper Case (UPCASE) …..……………………………. How to Remove Spaces and Special Characters (COMPRESS) …..…. How to Align Character Data (LEFT, RIGHT) …………………………… How to Extract a String of Characters (SUBSTR)………………………… How to Validate a Character Value (VERIFY) …………………………….
36 37 38 38 39
10.2
Week 10
More on Manipulating Data
1 Using the PUT Statement
There is a variety of ways in SAS to code instructions that will result in the writing or outputting of information •
A DATA statement together with a SET statement outputs data to a SAS data set. (See Unit 4 Week 2; this is the Week 8 reading)
•
A PROC PRINT writes out information in a SAS data set to the results window. The results window can then be saved as a “.lst” file.
•
The PUT statement is another SAS instruction for writing or outputting of information.
a. How to Write to a File Rationale •
It may be of interest to output data to an ASCII (text) file for some other use
General Guidelines •
Use a PUT statement to write information to the location indicated in a preceding FILE statement. Note – A FILE statement does not require a companion libname statement. This is because a FILE statement indicates a full path address already.
•
If no FILE statement precedes, the PUT statement will write the information to the SAS log.
•
Thus, the PUT statement with no preceding FILE statement is the way to write information to the SAS log.
•
Tip - Be aware that there are many options for use with the PUT statement. See the SAS Language Guide or refer to SAS help.
week 10
10.3
Week 10
More on Manipulating Data
Example The PUT statement was introduced in Unit 4, Week 2 as a method for writing/exporting SAS data to an ASCII (text) file using a DATA step: Libname old “z:\river valley\sasdata”;
/* libname tells SAS the location of the source SAS data */
Data _null_;
/* _null_ is a placeholder SAS data set that will not be saved */
set old.screen02;
/* SAS is told to read source data from screen02.sas7bdat */
file “z:\river valley\txtdata\screen02.dta”;
/* Specifies the name and destination of the ascii data set */
put medrec cd4;
/* Specifies the variables that are to be written */
run; In this example the two variables (MEDREC and CD4) would be listed in the file separated by a space.
Additional Tips •
The PUT statement can be used to write information to the log, to the output window or to files and, as a choice of SAS instruction for purposes of writing, the PUT statement has certain advantages.
•
Information “written” with a PUT statement can include variable values with or without variable names, as well as text that you type in.
•
week 10
By default, a PUT statement writes lines to the SAS LOG when no file specification is given.
10.4
Week 10
More on Manipulating Data
b How to Write Input Data to the Log
Rationale •
In some instances (especially when the data set size is reasonable), it may be good documentation to have recorded in the SAS log the data that was actually read in.
•
This may aid in correcting data problems later.
Guidelines •
To write the data to the log as it is read in, use the statement: PUT _INFILE_; after the INPUT statement.
Example ** example writing input data to the LOG **; DATA BOOKS; INFILE 'C:\TEMP\HW4P3.DAT'; INPUT NAME $ DAYS SUBJECT $15.; PUT _INFILE_; RUN;
This example also illustrates use of the SAS instruction _INFILE_ . It refers specifically to data that is read in during the data step.
week 10
10.5
Week 10
More on Manipulating Data
These SAS instructions produced the following listing in the log. 1 DATA BOOKS; 2 INFILE 'C:\SPH\HW6#3.DAT'; 3 INPUT NAME $ DAYS SUBJECT $15.; 4 PUT _INFILE_; 5 RUN; BARBARA 2 ENGLISH CAROL 2 SCIENCE CAROL 3 ENGLISH CAROL 4 MATH DONALD 1 ART JAMES 4 MATH JOYCE 5 HOME ECONOMICS JOYCE 6 SCIENCE MARY 2 MECHANICS PHILIP 1 HOME ECONOMICS PHILIP 3 MATH WILLIAM 2 SCIENCE NOTE: The infile 'C:\TEMP\HW4P3.DAT' is file C:\TEMP\HW4P3.DAT. NOTE: 12 records were read from the infile C:\TEMP\HW4P3.DAT. The minimum record length was 15. The maximum record length was 26. NOTE: The data set WORK.BOOKS has 12 observations and 3 variables. NOTE: The DATA statement used 6.00 seconds.
c. Data Quality Assessment – A Good Way to Write Out a Subset of Observations Rationale •
You may wish to identify subjects who have missing or invalid data for a particular variable, or those subjects with a particular combination of variable values.
•
One way to get such a list is to make a subset of the data for observations with the characteristic of interest, and then to list the data using PROC PRINT. This can require many different data steps plus a procedure step.
•
Alternatively, the list of problem observations can be written to the output file or to the log using PUT statements in the course of a DATA step, as in the example below.
week 10
10.6
Week 10
More on Manipulating Data
Example •
In this example, two uses of the PUT statement are illustrated.
•
In the first, the variables to be written to the log are named with an equal sign ( PUT VARNAME= ). When this form is used the ‘variable name = value’ is printed to the log (VARNAME=value). In the 2nd example, a mixture of text and variable values are used to write information to the log: the
•
PUT statement says to (1) write the value of the variable NAME , (2) write the text enclosed in quotes, (3) the value of the variable WEIGHT, and (4) the text enclosed in quotes. ** use of PUT statement to write to the log **; DATA CLASS; INFILE 'C:\TEMP\PUT2.DAT'; INPUT NAME $ SEX $ AGE HEIGHT WEIGHT; * *
list students with invalid weight *; and change weight to missing *; IF WEIGHT > 200 THEN DO; PUT NAME= WEIGHT= ; PUT NAME 'has invalid weight of ' WEIGHT 'lbs.'; WEIGHT=.; END;
RUN;
Following is the SAS log that was produced. Recall that the PUT statement writes to the LOG by default, unless preceded by a FILE statement. The lines in the SAS log that were produced by the PUT statements are highlighted here. 1 2 3 4 5 6 7 8 9 10 11 12
week 10
DATA CLASS; INFILE 'C:\SPH\691F\PUT2.DAT'; INPUT NAME $ SEX $ AGE HEIGHT WEIGHT; * list students with invalid weight *; * and changes weight to missing *; IF WEIGHT > 200 THEN DO; PUT NAME= WEIGHT= ; PUT NAME 'has invalid weight of ' WEIGHT 'lbs.'; WEIGHT=.; END; RUN;
10.7
Week 10
More on Manipulating Data
NAME=JOHN WEIGHT=995 JOHN has invalid weight of 995 lbs. NOTE: The infile 'C:\TEMP\PUT2.DAT' is file C:\TEMP\PUT2.DAT. NOTE: 19 records were read from the infile C:\TEMP\PUT2.DAT. The minimum record length was 21. The maximum record length was 23. NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
week 10
10.8
Week 10
d
More on Manipulating Data
How to Output Data Using Column Format
Rationale •
It may be that you want to write a list of data to the SAS output file so that it accompanies descriptions or analysis results.
•
Such SAS output files might be descriptions, analysis results, or the reporting of feedback information.
Guidelines •
The PUT statement is analogous to an INPUT statement in this case.
•
All the options available for reading data with INPUT statements can be used for writing data with PUT statements.
Example ** writing data to output window **; DATA CLASS; INFILE 'C:\TEMP\PUT2.DAT'; INPUT NAME $ SEX $ AGE HEIGHT WEIGHT; *
route result of PUT statement to output window *; FILE PRINT;
*
list name, age, sex for females only *; IF SEX='F' THEN PUT NAME 1-8 SEX 10 AGE 12-13; TITLE1 ‘LISTING OF GIRLS IN CLASS:’; RUN;
This would appear in the output window: LISTING OF GIRLS IN CLASS: ALICE BARBARA CAROL
week 10
F 13 F 13 F 14
10.9
Week 10 JANE JANET JOYCE JUDY LOUISE MARY
week 10
More on Manipulating Data F F F F F F
12 15 11 14 12 15
10.10
Week 10
e.
More on Manipulating Data
How to Use the PUT Instruction to Assign Sequential Identification Values Plus Some Bells
and Whistles
Rationale •
A PUT statement can be used to create a header page for surveys or other data collection forms, each pre-printed with a sequential ID number.
•
Tip - Such pre-numbering of forms can prevent assignment of duplicate ID numbers.
Example •
In the example that follows a header page is created for a chart review study that requires reviewing mother’s and infant’s charts.
•
A sequential ID is printed on each form.
•
The form is used to record the names and medical record numbers of the mother and infant, and a place is given for recording the date of chart review, reviewer’s name, as well as date of data entry, and enterer’s name.
This example also illustrates some bells and whistles: •
A DATA _NULL_ statement is used since no data set is to be created
•
A DO loop is used to create the sequential ID numbers -- in this example one hundred forms are created, numbered from 101 to 200.
•
For each form a new page is started, using the PUT _PAGE_ command. Then instructions for the form specifics are given.
•
week 10
The @ symbol is used to define the starting column, just as in an INPUT statement.
10.11
Week 10
More on Manipulating Data
•
The # symbol is used to define a specific line number on the page.
•
Text to be printed, including lines to be drawn must be enclosed in quotes. Use double quotes if the text contains a single quote.
•
The forward slash (/) indicates move to the next line.
• See the SAS Language Guide or SAS HELP, section on PUT statements for more details on line and pointer controls. Note that there is one very long PUT statement that ends with a semi-colon. This PUT statement is listed within a DO loop that creates the sequential ID numbers. **********************************************; ** example of creating sequential ID number **; ** and cover page for data collection form **; **********************************************; OPTIONS nodate nonumber nocenter ls=78 ps=60; DATA _NULL_; * start data step-no file created; FILE 'C:\temp\cover1.lst'; * names file to store results; DO CASEID = 101 TO 200; * creates forms 101 to 200; PUT _PAGE_; * starts new page for each new id; PUT / / / @10 ' _______________________________________ ' / @10 '| |' / @10 '| BAYSTATE MEDICAL CENTER |' / @10 '| FETAL MACROSOMIA STUDY |' / @10 '| |' / @10 '|_______________________________________|' / / @10 'STUDY ID = ' @22 CASEID / / @10 "MOTHER'S NAME: __________________________________ " / @26 '(last, first)' / / @10 "MOHTER'S MEDICAL RECORD NUMBER: __ __ __ __ __ __ " / / @10 'INFANT NAME: ___________________________________ ' / @26 '(last, first)' / /
week 10
10.12
Week 10
More on Manipulating Data
@10 'INFANT MEDICAL RECORD NUMBER: __ __ __ __ __ __ #45 @10 "MOTHER'S CHART REVIEWED BY: ____________________ / @10 'INFANT CHART REVIEWED BY: ____________________ / @10 'FORMS CODED BY: _________________________________ / @10 'DATA ENTERED BY: ________________________________ ; *end of put statement; END; *end of do loop; RUN;
' / on ___ / ___ / ___" / on
___ / ___ / ___' /
on
___ / ___ / ___' /
on
___ / ___ / ___'
Following is the first page written by the above program: _______________________________________ | | | BAYSTATE MEDICAL CENTER | | FETAL MACROSOMIA STUDY | | | |_______________________________________| STUDY ID =
101
MOTHER'S NAME: __________________________________ (last, first) MOHTER'S MEDICAL RECORD NUMBER: __ __ __ __ __ __ INFANT NAME:
___________________________________ (last, first)
INFANT MEDICAL RECORD NUMBER:
week 10
__ __ __ __ __ __
10.13
Week 10
More on Manipulating Data
MOTHER'S CHART REVIEWED BY: _________________ on ___ / ___ / ___ INFANT CHART REVIEWED BY:
__________________ on ___ / ___ / ___
FORMS CODED BY: _____________________________ on ___ / ___ / ___ DATA ENTERED BY: ____________________________ on ___ / ___ / ___ Much more complicated and detailed forms can be generated with PUT statements. This can be particularly useful in follow-up surveys, where a form for each subject in the baseline survey can be preprinted with name, address phone number, ID numbers, etc. Any useful information from the baseline survey can be incorporated and printed using a PUT statement. To do this, the DATA _NULL_; statement would be followed by a SET statement, naming the source data file containing the baseline data, before the DO loop and PUT statement.
week 10
10.14
Week 10
More on Manipulating Data
2 Dates and Times in SAS
SAS stores dates internally using numbers that are related to the date January 1, 1960 • In SAS, a date is recorded as the number of days from January 1, 1960, with Jan 1, 1960 having a value of 0. • Dates after 01/01/1960 have positive values, and dates before this date have negative values. • If you fail to assign a format to a date using a format statement, the number printed will be the number of days since January 1, 1960 – not a particularly useful piece of data to look at – so assign formats to dates!. • A record length of 5 is sufficient for all recent dates (dates after 778 BC) represented as month, day, year since there are at most 5 significant digits. This record length also is sufficient for negative SAS dates.
a. How to Use INFORMATS/FORMATS to Read/Write Dates •
We read dates recorded as month, day, and year. SAS doesn’t know how to do this.
• Therefore, special DATE informats exist to translate this specification of the date to the SAS numeric representation. A • n informat works in a manner similar to a format – it provides instruction for reading values, rather than writing values. • Similarly, since the SAS representation of a date as the number of days from 01/01/1960 is not readily recognized as a month, day and year, formats exit to convert SAS numeric dates to month, day, year format for printing.
week 10
10.15
Week 10
More on Manipulating Data
Example • Data in the following example includes information on birth date, date of surgery and date of postoperative testing. • The dates in this file are in MMDDYY format, in this case with a slash (/) to separate month, day, and year. • The subject birth dates use 10 columns (MMDDYY10. format), with year taking 4 columns • The other dates use 8 columns with year taking 2 columns, or MMDDYY8. format. • Note the INFORMAT statement used before the INPUT statement to define the date formats. • Alternatively, without an informat statement the format for reading the dates can be indicated directly on the input statement, as in
INPUT CASEID @5 DOB MMDDYY10. @16 SURDATE MMDDYY8. @25 TDATE MMDDYY8.;
• In the program that follows, copies of each of the dates are made under new names so that several of the different formats for printing dates can be illustrated. • Note that date formats are always available for use in SAS, without running PROC FORMAT. • However, formats must be assigned to date variables in a FORMAT statement in a data or proc step, like any other formats. • A format should always be assigned to DATE variables in SAS for printing, since looking at the number of days since 01JAN60 is not usually very informative.
week 10
10.16
Week 10
More on Manipulating Data
OPTIONS PAGESIZE=60 LINESIZE=128 nodate nonumber nocenter; ***************************************************; ** read in data using MMDDYY. format for dates **; ***************************************************; DATA DATES; INFILE 'C:\temp\DATE1.DAT'; * how to read dates *; INFORMAT DOB MMDDYY10. SURDATE TDATE MMDDYY8.; INPUT CASEID DOB SURDATE TDATE; ** make copies of dates for demonstrating use of formats**; DOB1=DOB; SURDATE1=SURDATE; TDATE1=TDATE; ** compute days from surgery to testing, ** ** and patient age at surgery **; DAYS = TDATE - SURDATE; AGE = INT((SURDATE - DOB) / 365.25); /* the INT function rounds to integer */ ** **
assign formats to dates **; no format for TDATE1 **; FORMAT DOB MMDDYY10. DOB1 DATE7. SURDATE WORDDATE12. SURDATE1 WEEKDATE23. TDATE WORDDATX12.; RUN; * print results to illustrate formats *; PROC PRINT DATA=DATES; ID CASEID; VAR DOB DOB1 SURDATE SURDATE1 AGE; TITLE1 'LIST 1: EXAMPLE OF DATE FORMATS'; RUN; PROC PRINT DATA=DATES; ID CASEID; VAR SURDATE TDATE TDATE1 DAYS; TITLE1 'LIST 2: EXAMPLE OF DATE FORMATS'; TITLE2 'UNFORMATTED TDATE1 LISTED AS DAYS FROM JAN 1, 1960'; RUN;
week 10
10.17
Week 10
More on Manipulating Data
The output follows. LIST 1: CASEID 101 102 103 104 105 106
EXAMPLE OF DATE FORMATS DOB
DOB1
12/03/1917 03/18/1910 09/21/1930 08/27/1924 10/11/1938 06/14/1919
SURDATE
03DEC17 18MAR10 21SEP30 27AUG24 11OCT38 14JUN19
Jul Jul Jul Jul Jul Jul
17, 19, 22, 26, 29, 30,
SURDATE1
1991 1991 1991 1991 1991 1991
Wednesday, Friday, Monday, Friday, Monday, Tuesday,
Jul Jul Jul Jul Jul Jul
AGE 17, 19, 22, 26, 29, 30,
1991 1991 1991 1991 1991 1991
73 81 60 66 52 72
LIST 2: EXAMPLE OF DATE FORMATS UNFORMATTED TDATE1 LISTED AS DAYS FROM JAN 1, 1960 CASEID 101 102 103 104 105 106
SURDATE Jul Jul Jul Jul Jul Jul
17, 19, 22, 26, 29, 30,
1991 1991 1991 1991 1991 1991
TDATE 4 5 5 13 15 15
Aug Aug Aug Aug Aug Aug
1991 1991 1991 1991 1991 1991
TDATE1
DAYS
11538 11539 11539 11547 11549 11549
18 17 14 18 17 16
b. How to Compute Intervals between Dates
SAS’ system for internal storage of dates makes straightforward computations involving dates. • The advantage of storing dates as a number of days from some reference point lies in the ease of
computation of differences, or durations. • In the above example, the number of days between surgery and post-operative testing can be
computed as a simple difference: DAYS = TDATE - SURDATE;
• The result of a difference between two dates is always in units of days. • If we wanted the difference in weeks this could be computed as:
week 10
10.18
Week 10
More on Manipulating Data WEEKS = (TDATE - SURDATE) / 7;
• In the same manner, AGE (in years) at surgery can be computed as the number of days between surgery date and birth date, divided by 365.25: AGE = (SURDATE-DOB)/365.25;
• In the example in the program above the INT (integer) function was used to round down to the integer part – this is discussed in the section on SAS functions below.
Occasionally, you may need to be mindful of the Y2K problem in working with dates TIP - It’s a good idea to use 4 digits for year whenever possible when reading and writing dates to avoid confusion. In the above example, the birth dates were listed with 4 digit years for this reason. When working with birthdates of subjects in the early 1900’s, dates will be read incorrectly as early 2000’s, if only the final 2 digits of the year are given. These can be corrected using SAS Functions (see following, 3. Introduction to SAS Functions).
TIME and DATETIME formats and informats are available for use in SAS. These can be used when times, and time differences are of interest. An example is given in the section on date and time functions, which follows.
c How to Work with a Date as a Constant • Constant values for dates, times, and datetimes can also be referred to using the DATE. , TIME. or DATETIME. formats.
week 10
10.19
Week 10
More on Manipulating Data
• For example given information on birth date (DOB), to compute a subject's age in years at a specific date, for example Jan 1, 1992, you could use the following statement: AGE1992 = ('01JAN1992'D - DOB)/365.25;
• The date constant must be in DATE9. format (days in 2 digits, month in 3 letter abbreviation, and years in 4 digits) enclosed in single quotes, and followed by the letter d (d or D works), no spaces.
Similar to dates, TIME and DATETIME constants can be defined. See the section on date, time and datetime constants in the chapter on SAS expressions in the SAS Language Guide for more details.
d The TODAY and DATE Functions
The TODAY( ) Function The statement:
TDATE = TODAY();
creates the variable TDATE with the current date on the computer as the value. The word today is followed by empty parentheses.
Example – Computing a subject’s current age: CURR_AGE = (TODAY() - DOB)/365.25;
week 10
10.20
Week 10
More on Manipulating Data
The DATE( ) Function The DATE( ) function gives the same result as the TODAY( ) function. TDATE = DATE();
week 10
10.21
Week 10
3.
More on Manipulating Data
Introduction to SAS Functions
Many special functions for computing and manipulating numeric and character data are available for use in the SAS system. a.
Overview of SAS Functions
The functions are divided into 13 categories: Category
Functions in SAS
Arithmetic
ABSolute value, MIN, MAX, MODulus, SIGN, and others
Truncation
CEIL (smallest integer), FLOOR (largest integer), ROUND, INTeger and others
Mathematical
EXP (power e), LOG (natural log), LOG10 (log base 10)
Trigonometric
ARCOS, ARSIN, ATAN, COS, SIN, SINH, TAN, TANH
Probability
POISSON, PROBBETA, PROBBNML, PROBCHI, PROBF, PROBT, PROBNORM, and others
Quantiles
CINV (chi-square), FINV, GAMINV, PROBIT (inverse of normal), TINV
Sample Statistics
N (# non-miss), MEAN, CV, KURTOSIS, NMISS, RANGE, STD, VAR, STDERR, SUM and others
Random Number
NORMAL, RANBINomial, RANEXP, RANNOR, RANPOI, UNIFORM, RANUNI and others
Financial
COMPOUND, MORT, SAVING, and others
Character
BYTE, LEFT align, LENGTH, SCAN, TRIM, UPCASE, and others
Date and time
DATE (today's date), DATEJUL (Julian to SAS), MDY, MONTH, DAY, YEAR, others
State and Zip code
ZIPSTATE, FIPNAME and others
Special functions
DIFn (difference in lags), INPUT, LAGn, PUT, SOUND, SYMGET For details, see SAS Language Guide.
week 10
10.22
Week 10
More on Manipulating Data
• Functions can provide a simple way of computing complex formulae, such as a variance or probability function. • TIP: Many of the functions handle missing values differently than when computing with expressions. It is important to pay attention to this detail when choosing to compute using SAS functions. Examples of use of functions in a few key categories follow.
b. Using SAS to Manipulate Missing Values Suppose you want to compute the mean of 3 scores for a subject (MEAN function) There’s more than one approach. • This can be done by defining an expression to sum the three scores and divide by 3. • Alternatively, a mean can be computed by naming the three variables in the MEAN function. • TIP - The result is not the same because missing values are handled differently in the two
approaches. Using the first method, the mean score has a missing value if any one of the three scores is missing. Using the MEAN function, the mean score has a value if at least one of the three scores has a non-missing value, as illustrated below:
** EXAMPLE TO SHOW HOW MEAN FUNCTION **; ** HANDLES MISSING VALUES **; DATA EX1; INPUT SCORE1 SCORE2 SCORE3; M1 = (SCORE1 + SCORE2 + SCORE3)/3; M2 = MEAN(SCORE1,SCORE2,SCORE3); CARDS; 1 2 3 . 2 3 1 . 3
week 10
* mean of all 3 scores; * mean of non-missing ;
10.23
Week 10
More on Manipulating Data
. . 3 . . . ; RUN; PROC PRINT DATA=EX1; VAR SCORE1-SCORE3 M1 M2; TITLE1 'MEAN SCORES COMPUTED TWO WAYS'; RUN;
How to Use the MEAN Function. • To use the mean function, the word MEAN is followed by a variable list enclosed in parentheses. • The variable names must be separated by commas, no spaces. • If the variables names all have the same prefix with sequential numbering, an alternative hyphenated list form is available: M2 = MEAN(SCORE1,SCORE2,SCORE3);
or M2 = MEAN(OF SCORE1-SCORE3);
Results of the above program follow. Note that the expression only computes a mean when all values are non-missing, while the MEAN function computes a mean if at least one value is present.
week 10
10.24
Week 10
More on Manipulating Data
MEAN SCORES COMPUTED TWO WAYS OBS 1 2 3 4 5
SCORE1 1 . 1 . .
SCORE2 2 2 . . .
SCORE3 3 3 3 3 .
M1 2 . . . .
M2 2.0 2.5 2.0 3.0 .
Tip – Have a look in the The SAS Language Guide on functions. You may find some kind of computational shortcut by using functions. Example - Computing a mean score for subjects with at least 11 of 14 subscores (the N function, MEAN function) • As part of a study on smoking cessation, subjects were given a 14-question form used to rate a patient on self-efficacy in quitting. • For each question a score of 1-7 was possible. High scores were expected to correlate well with efficacy in quitting. • Investigators were interested in using the mean of the 14 scores as a predictor of quit status at the follow-up periods. • However many subjects had skipped questions on the form so that their data were incomplete. • The study investigators decided that if at least 11 of the 14 scores were available, the mean would be a valid measure of efficacy. • The following program computes a mean for subjects with at least 11 scores, creating a missing value for the others. It makes use of the N function to find the number of non-missing values, and the MEAN function.
week 10
10.25
Week 10
More on Manipulating Data
DATA EFF2; SET EFF1; * find # of non-missing scores using N function *; NSCORE = N(OF SCORE1-SCORE14); * compute mean of available scores when have 11+ *; * using MEAN function *; IF NSCORE>=11 THEN MSCORE=MEAN(OF SCORE1-SCORE14); RUN;
In this example two variables were created: NSCORE the number of non-missing scores, and MSCORE, the mean of the non-missing scores, when 11 or more scores were present.
week 10
10.26
Week 10
c
More on Manipulating Data
How to Compute Age in Completed Years( the INT Function)
• Age is typically reported in completed years. • When computing age at some event from the date of birth and the event date, it is common to want whole numbers only for age in years. • Rounding off, or formatting the age to 0 decimal places would result in rounding up for fractional parts over .5, e.g., 8.67 rounds up to 9 years rather than 8 years, which is not how we usually think of age. • In this case the integer function INT, can be used, to select the integer part of the computed age: AGE = INT( (SURGDATE - DOB)/365.25 );
d.
How to Extract Date and Time Information
• Date and Time functions can be used to create DATE, TIME and DATETIME variables from other values, or to abstract information from DATE and TIME values, such as the day, month or year from a date value. • For example in the Wesson Women's Time Study, dates and times were entered as separate values for day, month, hour and minutes. • The following program gives an example of how these can be converted into DATETIME values, so that the time spent in a unit can be computed as the difference between two DATETIME values. • The example makes use of the MDY function to create a date value from the month, day and year of the study, 1991, and then uses the DHMS function to create a DATETIME value from the Date, Hours, Minutes, and (zero) Seconds.
week 10
10.27
Week 10
More on Manipulating Data
• This example computes the time in hours spent in WETU, the Women's Evaluation and Treatment Unit. • Note that when a time difference is computed from a stored DATETIME value, the result is in units of seconds. OPTIONS PAGESIZE=60 LINESIZE=78; LIBNAME D 'D:\WWTIME'; *************************************; ** COMPUTE TIME SPENT IN WETU **; ** AS DIFFERENCE BETWEEN TRANSFER **; ** AND ARRIVAL TIME **; ** USING WWTIME DATA **; *************************************; DATA WETU(KEEP=WETUTIME W_TRANDT W_ARR_DT); SET D.WWTIME1(OBS=10); * reads only first 10 obs; * compute date from month,day,year *; WARRDATE=MDY(W_ARRMON,W_ARRDAY,1991); WTRADATE=MDY(W_TRAMON,W_TRADAY,1991);
* arrival date; * transfer date;
* compute datetime from date, hours, mins, and secs=0 *; * arrival datetime; W_ARR_DT=DHMS(WARRDATE,W_ARRHRS,W_ARRMIN,0); * transfer datetime; W_TRANDT=DHMS(WTRADATE,W_TRAHRS,W_TRAMIN,0); * compute time spent in WETU as difference adjusted to hrs *; WETUTIME = (W_TRANDT - W_ARR_DT) / 3600; RUN; ** print data assigning datetime formats to transfer **; ** and arrival times, and allowing 6 columns, 2 dec **; ** for printing the time difference in hours **; PROC PRINT DATA=WETU; FORMAT W_TRANDT W_ARR_DT DATETIME13. WETUTIME 6.2; TITLE1 'DURATION, IN HOURS, SPENT IN WETU'; RUN;
week 10
10.28
Week 10
More on Manipulating Data
The resulting output follows. DURATION, IN HOURS, SPENT IN WETU OBS
W_ARR_DT
W_TRANDT
1 2 3 4 5 6 7 8 9 10
09FEB91:06:35 07FEB91:11:25 07FEB91:09:20 07FEB91:05:30 05FEB91:04:30 10FEB91:02:45 06FEB91:18:20 10FEB91:08:50 10FEB91:01:45 08FEB91:08:58
09FEB91:07:28 07FEB91:12:45 07FEB91:10:45 07FEB91:06:35 05FEB91:05:50 10FEB91:04:00 06FEB91:19:35 10FEB91:09:20 10FEB91:03:30 08FEB91:10:53
WETUTIME 0.88 1.33 1.42 1.08 1.33 1.25 1.25 0.50 1.75 1.92
• Note that the time spent in the Women’s Evaluation and Treatment Unit, WETUTIME, is in units of hours, with decimal fractions of hours. • For example 1.25 means one-and-one-quarter hours, or 1 hour and 15 minutes, and 0.50 means a half-hour, or 30 minutes. • SAS TIME formats are meant for use with clock time representations, not durations. If you want these durations printed in hours and minutes, you would need to compute the minutes from the fractional hours, and print these as two variables. The following programming statements could be used: * complete hours as integer part; WETUHRS = INT(WETUTIME); * fractional hours times 60 min/hr ; WETUMIN = (WETUTIME - WETUHRS) * 60;
week 10
10.29
Week 10
4
More on Manipulating Data
How to Select a Simple Random Sample without Replacement: Using Random Number Generating Functions
Rationale – You may have used SAS to create a sampling frame (list of all members of population) and now wish to select a simple random sample from the list. Random number generators enable the sample to be selected. An illustration of a method for selecting a simple random sample without replacement from a population of 10 subjects (with one record per subject) follows. Suppose the subjects are listed in a file as shown below. ID 1 2 3 4 5 6 7 8 9 10
Age 23 29 34 19 29 15 32 32 31 28
Sex M M F F F F M F M M
Using the Uniform ( ) Function to Select a Simple Random Sample Without Replacement Assume we wish to select a simple random sample without replacement of 4 subjects from this list. We can use the Uniform random number generator, which generates a random digit between 0 and 1, to do the selection with the following program.
week 10
10.30
Week 10
More on Manipulating Data
*** selecting a simple random sample ***; *** without replacement ***; libname old 'c:\temp\'; data frame1; input sid age sex $1. @@; cards; 1 23 M 2 29 M 3 34 F 4 19 F 5 29 F 6 15 F 7 32 M 8 32 F 9 31 M 10 28 M ; run; data frame2a; set frame1; retain n 10 k 4; *set starting pop=n and sample size=k ; sample1=0; * set indicator of in sample to 0; prob=k/n; * set probability of selection; rn=uniform(0); * select random number in (0,1); if rn < prob then do; * if rn in (0,rn) ; sample1=1; * set indicator to 1 ; k=k-1; * decrease number to select ; end; n=n-1; * decrease remaining pop size ; run; proc print data=frame2a; var sid age sex sample1 rn prob n k; title1 'Listing after Sample Selection in SAS data set FRAME2a'; run;
• Data are read in on one line using the trailing @@ to hold the line. • The sample is selected in the data set FRAME2A. • The RETAIN statement initializes the total size of the population (n) and the size of the sample (k) that is to be selected. • These variables must be named in a RETAIN statement to keep track of the number of observations left in the population, along with the number of observations remaining to be selected into the sample. • As the sample is selected, the value from the previous observation is held unless a change is specified. week 10
10.31
Week 10
More on Manipulating Data
• The variable SAMPLE1 is an indicator variable with value of "0" for observations not in the sample, and value "1" for observations selected to be in the sample. When selecting a simple random sample without replacement, the proportion of observations to be selected changes as each observation is selected. • This proportion is given by PROB, the remaining number of observations to be selected divided by the remaining number of observations in the population. • To determine whether the particular observation in the data step is to be selected in the sample, a Uniform random number (RN) is selected and compared with PROB. • If the random number is less than PROB, then the observation is selected, and the values of "k" and "n" are both reduced by one. The indicator for the sample, SAMPLE1, is also set to one. If RN is bigger than PROB, only the population size "n" is reduced by one and the sampling continues. A listing of the resulting data follows: Listing after Sample Selection in SAS data set FRAME2a OBS 1 2 3 4 5 6 7 8 9 10
week 10
SID
AGE
SEX
SAMPLE1
RN
PROB
N
K
1 2 3 4 5 6 7 8 9 10
23 29 34 19 29 15 32 32 31 28
M M F F F F M F M M
0 1 0 1 0 0 1 1 0 0
0.75321 0.34198 0.50989 0.41783 0.36324 0.51605 0.02004 0.32841 0.62433 0.59228
0.40000 0.44444 0.37500 0.42857 0.33333 0.40000 0.50000 0.33333 0.00000 0.00000
9 8 7 6 5 4 3 2 1 0
4 3 3 2 2 2 1 0 0 0
10.32
Week 10
More on Manipulating Data
How the Uniform Random Number Generator Uniform(0) Works • The uniform random number generator is a SAS function that has "0" as an argument in the previous program. • The "0" is a special code that indicates that the uniform random number sequence should start based on the date and time that the program is run. • Re-running the program will produce a different set of Uniform random numbers. • Although independent samples should be based on a different set of random numbers, it may be desirable to be able to duplicate the sample selection. This can be done by specifying a 5-7 digit odd integer as the argument in the UNIFORM function. • This is illustrated this below, with the selection of two samples, where a common starting point – seed – for the random number sequence is specified. *** select 2 samples using same SEED ***; data frame2b; retain n2 10 k2 4 seed 1823925; set frame1; sample2=0; prob2=k2/n2; rn2=uniform(seed); if rn2 < prob2 then do; sample2=1; k2=k2-1; end; n2=n2-1; run; *** repeat with same SEED ***; data frame3; retain n3 10 k3 4 seed 1823925; set frame1; sample3=0; prob3=k3/n3; rn3=uniform(seed); if rn3 < prob3 then do; sample3=1;
week 10
10.33
Week 10
More on Manipulating Data
k3=k3-1; end; n3=n3-1; run; ** merge 2 samples by SID and print results **; data all; merge frame2 frame3; by sid; run; proc print data=all; id sid; var sid age sex sample2 rn2 prob2 n2 k2 sample3 rn3 prob3 n3 k3; title1 'Listing after Sample Selection FRAME2-3 SEED=1823925'; run;
week 10
10.34
Week 10
More on Manipulating Data
Listing after Sample Selection FRAME2-3 SEED=1823925 ID AGE SEX SAMPLE2 RN2 1 23 M 0 0.85516 2 29 M 1 0.06069 3 34 F 0 0.54748 4 19 F 0 0.67427 5 29 F 0 0.98183 6 15 F 1 0.19015 7 32 M 1 0.28839 8 32 F 0 0.82554 9 31 M 0 0.56925 10 28 M 1 0.39763
PROB2 N2 K2 SAMPLE3 RN3 PROB3 N3 K3 0.40000 9 4 0 0.85516 0.40000 9 4 0.44444 8 3 1 0.06069 0.44444 8 3 0.37500 7 3 0 0.54748 0.37500 7 3 0.42857 6 3 0 0.67427 0.42857 6 3 0.50000 5 3 0 0.98183 0.50000 5 3 0.60000 4 2 1 0.19015 0.60000 4 2 0.50000 3 1 1 0.28839 0.50000 3 1 0.33333 2 1 0 0.82554 0.33333 2 1 0.50000 1 1 0 0.56925 0.50000 1 1 1.00000 0 0 1 0.39763 1.00000 0 0
Notice that the sample random numbers were generated (RN2 = RN3) when the same ‘seed’ or starting value for the random number generator is used, resulting in selection of the same sample.
week 10
10.35
Week 10
5.
More on Manipulating Data
Introduction to SAS Character Functions • There is an array of SAS functions designed for working with character data. • Tip - Character data can be tricky to work with, since capitalization, spelling, spacing and punctuation differences will all cause values to be read differently. • Many of the character functions can be used to modify values, selectively read parts of character values, or search values for certain character strings. A few helpful examples follow.
a
How to Convert to Upper Case (UPCASE)
Rationale and Example –
• At the time of data entry, use of capitalization may have been inconsistent. As a result, some observations that should be matching will not be read as having the same value. • For example, if data for sex has inconsistently been entered as F, f, M, or m, then a frequency table would indicate 4 different values for sex. Use of the statement: SEX2 = UPCASE(SEX);
would convert all values to uppercase, so only 2 distinct values would appear in the new variable SEX2, F and M. This could also be accomplished with the statements: IF SEX =’f’ THEN SEX =’F’; IF SEX =’m’ THEN SEX =’M’;
• However, when names, addresses or other character data with more than just a few possible values are to be converted, the function would obviously be the preferred method.
week 10
10.36
Week 10
More on Manipulating Data
b How to Remove Spaces and Special Characters (COMPRESS) Rationale – • When spacing or punctuation has been inconsistently used, values will not match. • The Compress function allows you to remove spaces or specified characters. • For example, when names have been entered in one place as ‘Last, First’ and in another as ‘Last First’ without a comma, we might wish to remove the commas. NAME2 = COMPRESS(NAME,’,’);
would create a new variable NAME2 with all commas removed.
Guidelines • To use the Compress function, the form COMPRESS(arg1,arg2); is used, where arg1 is the variable to be compressed, and arg2 specifies the characters to be removed. • If you are listing specific characters, then these must be named with single quotes (or double quotes if you want to remove a single quote character). • Tip - Do not use delimiters (e.g., spaces or commas) between characters to remove, or the space or comma will be considered as characters to remove. For example, NAME2 = COMPRESS(NAME,’ nd
.,’); will remove spaces, periods and commas. If no 2
argument (arg2) is specified, by default
all spaces will be removed from the original value.
week 10
10.37
Week 10
c
More on Manipulating Data
How to Align Character Data (LEFT and RIGHT)
Rationale • Inconsistent data entry, or data derived from different sources can leave you with values that match, except for inconsistent spacing. • The LEFT and RIGHT functions simply move values to be left or right aligned. LEFT removes blanks from the start of a value, and RIGHT moves blanks from the end to the start. For example, if values for sex have been entered as ′Female′ or ′Male
′
Male′, the last 2 values will be read as
or ′
different values, since the spacing or alignment differs. Using the statement: SEX2 = LEFT(SEX);
would left align all values, or change ′
Male′ to ′Male
′.
d How to Extract a String of Characters (SUBSTR) Rationale • This is a great function! When information is encoded within a variable, for example, with an informative ID variable, reading a part of the value becomes of interest. • The substring function (SUBSTR) can be used for this purpose. • In the Anaconda Soil Ingestion Study, informative ID numbers for samples (SID), of the form ‘AnnSdd’ were used, where the first character (A) indicated the particular study, nn the child ID number, S the sample type (F=fecal, N=food), and dd the study day, 01-07. • Sample analysis results were output in electronic form from the laboratory equipment used for analysis. It was necessary to extract information from the sample ID number to determine which child, and which sample type. The following statement will do this:
week 10
10.38
Week 10
More on Manipulating Data
CHILDID = SUBSTR(SID,1,3); STYPE = SUBSTR(SID,4,1);
• The CHILDID is read from the sample ID (SID) starting in position 1, reading for 3 characters. • The sample type (STYPE) is read from SID, starting in position 4, reading 1 character. For example, if SID=’A64N02’ then CHILDID=’A64’ and STYPE=’N’. Guidelines To use SUBSTR(arg1,arg2,arg3), • arg1 names the variable to be read, • arg2 names the starting position, and • arg3 names the number of characters to read.
e
How to Validate a Character Value (VERIFY)
Rationale Verify can be used to check that all values in your data fit a limited set of acceptable values.
How to Use the VERIFY Function (Tip - It is NOT a logical operator) For example, suppose you have data, stored as character codes (CODE), which should take on only the values 0, 1, 2, 3 or 9, the following statements can be used to check for invalid data: CHK = ‘01239’; VCODE = VERIFY(CODE,CHK); IF VCODE NE 0 THEN PUT ‘INVALID CODE: ‘ CODE=
week 10
SID=;
10.39
Week 10
More on Manipulating Data
• The value of VCODE will be 0 if the value of CODE contains only 0,1,2,3, or 9, but will return the first column number where any other value is found. • The IF-THEN statement will write a message to the log with the invalid code and subject ID, when VCODE is not equal to 0.
week 10
10.40