Accessing Data
Last Updated : 29 June, 2004
Center of Excellence
Agenda Creating datasets using DATA step. Infile statement Different input styles. Combining datasets using DATA step
What Is the SAS System? The SAS System is an integrated system of software products that enables you to perform data entry, retrieval, and management report writing and graphics statistical and mathematical analysis business planning, forecasting, and decision support operations research and project management quality improvement applications development.
What Is the SAS System? In addition, you can integrate with SAS many SAS business solutions that enable you to perform large scale business functions, such as Data Warehousing and Data Mining Human Resources Management and Decision Support Financial Management and Decision Support
Overview of Base SAS Software The core of the SAS System is base SAS software SAS language a programming language that you use to manage your data.
SAS procedures software tools for data analysis and reporting.
Macro facility a tool for extending and customizing SAS software programs and for reducing text in your programs.
DATA step debugger a programming tool that helps you find logic problems in DATA step programs.
Output Delivery System (ODS) a system that delivers output in a variety of easy-toaccess formats, such as SAS data sets, listing files, or Hypertext Markup Language (HTML).
Base SAS data access management analysis presentation
Components of the SAS Language SAS Files Files with formats or structures known to SAS. All SAS files reside in a SAS data library. SAS data set is structured in a format that SAS can process.
SAS catalog Many different kinds of information that are used in a SAS job
are stored in SAS catalogs, such as instructions for reading and printing data values, or function key settings that you use in the SAS windowing environment.
SAS stored program contains compiled code that you create and save for repeated
use.
Components of the SAS Language SAS Data Sets SAS data file both describes and physically stores data values. descriptor portion
– describes the contents of the SAS data set to SAS. Data portion
– data that has been collected or calculated. – An observation is a collection of data values that usually relate to a single object. A variable is the set of data values that describe a given characteristic.
SAS data view does not actually store values but create logical SAS data
sets without using the storage space required by SAS data files.
Structure of SAS Data Sets
Descriptor Portion
SAS Data Set General Data Set Information Name Number of Obs. *Label Number of Variables *Date/Time Created Storage Information Information for Each Variable Type Length Position Name *Label *Format *Informat IDNUM
Data Portion
NAME
1351 Farr, Sue 161 212 Moore, Ron 2512 Ruth, G H ... ... ... 5151 Coxe, Susan
WAGECAT
WAGERATE
S S S S
3392.50 5093.75 . 1572.50
S
3163.00
Components of the SAS Language External Files Data files that you use to read and write data, but which are in a structure unknown to SAS. External files can be used for storing raw data that you want to read into a SAS data file SAS program statements procedure output
Database Management System Files SAS software is able to read and write data to and from
other vendors' software, such as many common database management system (DBMS) files. In addition to base SAS software, you must license the SAS/ACCESS software for your DBMS and operating environment.
Components of the SAS Language SAS Language Elements DATA step consists of a group of statements in the SAS language that reads
raw data or existing SAS data sets to create a SAS data set.
PROC step
A group of procedure statements used to analyze data in SAS data sets to produce statistics, tables, reports, charts, and plots, to create SQL queries, and to perform other analyses and operations on your data. They also provide ways to manage and print SAS files.
SAS Macro Facility
a powerful programming tool for extending and customizing your SAS programs, and for reducing the amount of code that you must enter to do common tasks. Macros are SAS files that contain compiled macro program statements and stored text.
What Can the DATA Step Do? You can use the DATA step in the following ways to transform your information: Read from a raw data file into the SAS system. SAS Data Set
Raw Data File
Descriptor
DATA Step
What Can the DATA Step Do? Create multiple SAS data sets in one DATA step.
DATA Step
What Can the DATA Step Do? Combine existing data sets. SAS Data Set 2
SAS Data Set 1
DATA Step
What Can the DATA Step Do? You can also add or augment information in a variety of ways. Create accumulating totals. SaleDate SaleDate
Sale Sale Amt Amt
Mth2Dte Mth2Dte
01APR2001 01APR2001 02APR2001 02APR2001 03APR2001 03APR2001 04APR2001 04APR2001 05APR2001 05APR2001
498.49 498.49 946.50 946.50 994.97 994.97 564.59 564.59 783.01 783.01
498.49 498.49 1444.99 1444.99 2439.96 2439.96 3004.55 3004.55 3787.56 3787.56
What Can the DATA Step Do? Manipulate numeric values.
BirthDay 4253 SAS Function
Age 30
What Can the DATA Step Do? Summarize data sets.
Salary Salary 42000 42000 34000 34000 27000 27000 20000 20000 19000 19000 19000 19000
Div Div HUMRES HUMRES FINACE FINACE FLTOPS FLTOPS FINACE FINACE FINACE FINACE FLTOPS FLTOPS
DATA Step
Div Div
DivSal DivSal
FINACE FINACE FLTOPS FLTOPS HUMRES HUMRES
42000 42000 46000 46000 73000 73000
What Can the DATA Step Do? presenting your data file management And much, much more……………
Words in the SAS Language A word or token in the SAS language is a collection of characters that communicates a meaning to SAS. A word or token ends when SAS encounters one of the following: the beginning of a new token a blank after a name or a number token the ending quotation mark of a literal token.
Each word or token in the SAS language classified into four categories. Names - a series of characters that begin with a letter or an underscore. Ex.: data, _old, yearcutoff, _n_, year_04, descending Literal - consists of 1 to 32,767 characters enclosed in single or double quotation marks ( ‘Bangalore’, “2003-04”, ‘ Wipro”s Plan’, "Report for the Third Quarter" )
Words in the SAS Language Number - in general is composed entirely of numeric digits, with an optional decimal point and a leading plus or minus sign. SAS also recognizes numeric values in the following forms as number tokens: scientific (E-) notation, hexadecimal notation, missing value symbols, and date and time literals. Ex: 1234, -2004, 1.25, 5.4E-1, ’30jun04'd
Special character - is usually any single keyboard character other than letters, numbers, the underscore, and the blank. In general, each special character is a single token, although some twocharacter operators, such as ** and <=, form single tokens. Ex: =, :, @, ‘, +, /
Names in the SAS Language A SAS name is a name token that represents
variables SAS data sets formats or informats SAS procedures options arrays statement labels SAS macros or macro variables SAS catalog entries librefs or filerefs.
There are two kinds of names in SAS. names of elements of the SAS language names supplied by SAS users.
Names in the SAS Language Rules for User-Supplied SAS Names
Members of SAS data libraries (SAS data sets, views, catalogs, indexes) Generation data sets Catalog entries Engines, Librefs, Filerefs, Passwords DATA step variables - 32 DATA step variable labels - 256 DATA step statement labels Arrays - 32 Functions - 16 Formats Informats -7 Macros, Macro variables - 32
- 32 - 28 - 32 -8
- 32
-8
Names in the SAS Language Rules for User-Supplied SAS Names The first character must be a letter (A, B, C, . . ., Z) or underscore (_). Subsequent characters can be letters, numeric digits (0, 1, . . ., 9), or underscores. You can use upper or lowercase letters. SAS processes names as uppercase regardless of how you type them. Blanks cannot appear in SAS names. SAS reserves a few names for automatic variables and variable lists, SAS data sets, and librefs. When creating variables, do not use the names of special SAS automatic variables (for
example, _N_ and _ERROR_) or special variable list names (for example, _CHARACTER_, _NUMERIC_, and _ALL_). When associating a libref with a SAS data library, do not use SASHELP, SASMSG, SASUSER, WORK . When you create SAS data sets, do not use _NULL_, _DATA_, _LAST_.
Names in the SAS Language Rules for User-Supplied SAS Names Special characters, except for the underscore, are not allowed. In filerefs only, you can use the dollar sign ($), pound sign (#), and at sign (@). When assigning a fileref to an external file, do not use: SASCAT. When you create a macro variable, do not use names that begin with SYS
SAS Dates SAS dates are special numeric values representing the number of days between January 1, 1960 and a specified date. 1jan1959
1jan1960
1jan1961
1jan2000
DATE9. Informat -365
0 SAS Date Values
366
14610
SAS Date Values
MMDDYY10. Format 01 01 01 01 /01 /01 /01 /01 /19 /19 /19 /20 59 60 61 00
Standard Data The term standard data refers to character and numeric data that SAS recognizes automatically. Some examples of standard numeric data include 35469.93 3E5 (exponential notation) -46859
Standard character data is any character you can type on your keyboard. Standard character values are always left-justified by SAS.
Nonstandard Data The term nonstandard data refers to character and numeric data that SAS does not recognize automatically. Examples of nonstandard numeric data include 12/12/2012 29FEB2000 4,242 $89,000
Create a SAS Data Set from a Raw Data File E1232 E2341 E3452 E6781 E8321 E1052 E1062 E8172 E1091
15OCT1999 01JUN1997 26OCT1993 16SEP1992 26NOV1996 27FEB1997 10MAY1987 06JAN2000 20AUG1991
raw data file 61065 A contains employee 91688 information for 32639 the level 1 flight 28305 attendants. Use the 40440 raw data file to 39461 41463 create the work.fltat1 40650 SAS data set. 40950
Desired Output Obs
EmpID
Hire Date
1 2 3 4 5 6 7 8 9
E1232 E2341 E3452 E6781 E8321 E1052 E1062 E8172 E1091
14532 13666 12352 11947 13479 13572 9991 14615 11554
Salary
Bonus
61065 91688 32639 28305 40440 39461 41463 40650 40950
3053.25 4584.40 1631.95 1415.25 2022.00 1973.05 2073.15 2032.50 2047.50
The DATA Statement A DATA step always begins with a DATA statement. General form of a DATA statement:
DATA DATASAS-data-set; SAS-data-set;
The DATA statement starts the DATA step and names the SAS data set being created.
The INFILE Statement If you are reading data from a raw data file, you need an INFILE statement. General form of an INFILE statement: INFILE INFILE'raw-data-file' 'raw-data-file' ; ;
The INFILE statement points to the raw data file being read. Options in the INFILE statement affect how SAS reads the raw data file.
The INPUT Statement When you are reading from a raw data file, the INPUT statement follows the INFILE statement. General form of an INPUT statement:
INPUT INPUT variable-specification variable-specification…; …;
The INPUT statement describes the raw data fields and specifies how you want them converted into SAS variables.
Formatted Input The input style tells SAS where to find the fields and how to read them into SAS. INPUT INPUT@n @nvariable-name variable-name informat. informat. ...; ...;
@n - moves the pointer to the starting point of the field. variable-name - names the SAS variable being created. Informat - specifies how many positions to read and how to convert the raw data into a SAS value.
The INPUT Statement Common SAS informats: $w. - reads a standard character field, where w specifies the width of the field in bytes. W - reads a standard numeric field, where w specifies the width of the field in bytes. DATE9. - reads dates in the form 31DEC2012.
The Assignment Statement To create a new variable in the DATA step, use an assignment statement:
variable-name=expression; variable-name=expression;
The assignment statement creates a SAS variable and specifies how to calculate that variable's value.
Create a SAS Data Set from a Raw Data File data work.fltat1; infile 'raw-data-file'; input @1 EmpID $5. @7 HireDate date9. @17 Salary 5.; Bonus=.05*Salary; run;
Create a SAS Data Set from a Raw Data File Partial Log NOTE: 9 records were read from the infile 'fltat1.dat'. The minimum record length was 21. The maximum record length was 21. NOTE: The data set WORK.FLTAT1 has 9 observations and 4 variables.
Overview of DATA Step Processing
Processing the DATA Step The SAS System processes the DATA step in two phases: compilation execution.
When you submit a DATA step for execution, SAS checks the syntax of the SAS statements and compiles them. During the compile phase, SAS creates the following three items input buffer is a logical area in memory into which SAS reads each
record of raw data when SAS executes an INPUT statement.
program data vector (PDV) is a logical area in memory where SAS builds a data set, one
observation at a time. When a program executes, SAS reads data values from the input buffer or creates them by executing SAS language statements.
DATA Step Processing The
data values are assigned to the appropriate variables in the program data vector. From here, SAS writes the values to a SAS data set as a single observation Along with data set variables and computed variables, the PDV contains two automatic variables, _N_ and _ERROR_. The _N_ variable counts the number of times the DATA step begins to iterate. The _ERROR_ variable signals the occurrence of an error caused by the data during execution.
descriptor information is information that SAS creates and maintains about
each SAS data set, including data set attributes and variable attributes. It contains, for example, the name of the data set and its member type, the date and time that the data set was created, and the number, names and data types (character or numeric) of the variables.
DATA Step Processing The flow of action in the Execution Phase of a simple DATA step The DATA step begins with a DATA statement. Each time the DATA statement executes, a new iteration of the DATA step begins, and the _N_ automatic variable is incremented by 1. SAS sets the newly created program variables to missing in the program data vector (PDV). SAS reads a data record from a raw data file into the input buffer, or it reads an observation from a SAS data set directly into the program data vector. You can use an INPUT, MERGE, SET, MODIFY, or UPDATE statement to read a record. SAS executes any subsequent programming statements for the current record.
DATA Step Processing At the end of the statements, an output, return, and reset occur automatically. SAS writes an observation to the SAS data set, the system automatically returns to the top of the DATA step, and the values of variables created by INPUT and assignment statements are reset to missing in the program data vector. Note that variables that you read with a SET, MERGE, MODIFY, or UPDATE statement are not reset to missing here. SAS counts another iteration, reads the next record or observation, and executes the subsequent programming statements for the current observation. The DATA step terminates when SAS encounters the end-of-file in a SAS data set or a raw data file.
DATA step Reading External File Data
data bonus_04; [1] infile 'your-input-file'; [2] input IDnumber name $ salary ; [3] bonus=salary * 0.25; [4] run; [5]
1- Begin the DATA step and create a SAS data set called bonus_04. 2- Specify the external file that contains your data. 3- Read a record and assign values to three variables. 4- Calculate a value for variable bonus. 5- Execute the DATA step.
DATA step Input Styles The INPUT statement reads raw data from instream data lines or external files into a SAS data set and input styles depending on the layout of data values in the records. INPUT, Formatted - Reads input values from specified columns and assigns them to the corresponding SAS variables INPUT, Column - Reads input values with specified informats and assigns them to the corresponding SAS variables INPUT, List - Scans the input data record for input values and assigns them to the corresponding SAS variables INPUT, Named - Reads data values that appear after a variable name that is followed by an equal sign and assigns them to corresponding SAS variables
DATA step An informat is an instruction that SAS uses to read data values into a variable. The INPUT statement with an informat after a variable name is the simplest way to read values into a variable. $w. DATEw.
Reads standard character data Reads date values in the form ddmmmyy or
ddmmmyyyy MMDDYYw.
-
Reads date values in the form mmddyy or
mmddyyyy w.d -
Reads standard numeric data COMMAw.d Removes embedded characters
Accessing Data – List input List input uses a scanning method for locating data values. Data values are must be separated by at least one blank (or other defined delimiter). List input requires only that you specify the variable names and a dollar sign ($), if defining a character variable.
Libname new “/wipro/dw/data”; data new.scores; length name $ 12; input name $ score1 score2; datalines; aaaaaa 1132 1187 bbbbbbbb 1015 1102 cccc 246 357 ; Run;
Modified List Input data scores; infile datalines dsd; input Name : $9. Score1-Score3 Team ~ $25. Div $; datalines; Smith,12,22,46,"Green Hornets, Atlanta",AAA Mitchel,23,19,25,"High Volts, Portland",AAA Jones,09,17,54,"Vulcans, Las Vegas",AA ; run; output Name Score1 Smith 12 22 Mitchel 23 Jones 09 17
Score2 Score3 Team Div 46 "Green Hornets, Atlanta“ AAA 19 25 "High Volts, Portland“ 54 "Vulcans, Las Vegas“ AA
AAA
Modified List Input The : (colon) format modifier enables you to use list input but also to specify an informat after a variable name, whether character or numeric. SAS reads until it encounters a blank column. The ~ (tilde) format modifier enables you to read and retain single quotation marks, double quotation marks, and delimiters within character values. If you want SAS to read consecutive delimiters as though there is a missing value between them, specify the DSD option in the INFILE statment. To read and store a character input value longer than 8 bytes, define a variable's length by using a LENGTH, INFORMAT. Character values cannot contain embedded blanks when the file is delimited by blanks. Fields must be read in order. Data must be in standard numeric or character format.
Data Accessing - Column Input Column input enables you to read standard data values that are aligned in columns in the data records. Specify the variable name, followed by a dollar sign ($) if it is a character variable, and specify the columns in which the data values are located in each record: data scores; infile datalines truncover; input name $ 1-12 score1 17-20 score2 27-30; datalines; 123456789101112131415161718192021222324252627282930 Riley 1132 987 Henderson 1015 1102 ; run;
Data Accessing - Column Input To use column input, data values must be in the same field on all the input lines in standard numeric or character form.
Features of column input include the following Character values can contain embedded blanks. Character values can be from 1 to 32,767 characters long. Placeholders, such as a single period (.), are not required for missing data. Input values can be read in any order, regardless of their position in the record. Values or parts of values can be reread. Both leading and trailing blanks within the field are ignored. Values do not need to be separated by blanks or other delimiters. Use the TRUNCOVER option on the INFILE statement to ensure that SAS handles data values of varying lengths appropriately.
Data Accessing - Formatted Input Formatted input combines the flexibility of using informats with many of the features of column input. By using formatted input, you can read nonstandard data for which SAS requires additional instructions. Formatted input is typically used with pointer controls that enable you to control the position of the input pointer in the input buffer when you read data. data scores; input name $12. +4 score1 comma5. +6 score2 comma5.; datalines; Riley 1,132 1,187 Henderson 1,015 1,102 ;
Data Accessing - Formatted Input Important points about formatted input are Characters values can contain embedded blanks. Character values can be from 1 to 32,767 characters long. Placeholders, such as a single period (.), are not required for missing data. With the use of pointer controls to position the pointer, input values can be read in any order, regardless of their positions in the record. Values or parts of values can be reread. Formatted input enables you to read data stored in nonstandard form, such as packed decimal or numbers with commas.
Data Accessing - Named Input You can use named input to read records in which data values are preceded by the name of the variable and an equal sign (=). The following INPUT statement reads the data lines containing equal signs. data games; input name=$ score1= score2=; datalines; name=abc score1=1132 score2=1187 ; run;
The MISSOVER Option The MISSOVER option prevents SAS from loading a new record when the end of the current record is reached. General form of the INFILE statement with the MISSOVER option: INFILE INFILE‘raw-data-file’ ‘raw-data-file’MISSOVER; MISSOVER;
If SAS reaches the end of the row without finding values for all fields, variables without values are set to missing.
Using the MISSOVER Option data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=',' missover; input ID $ InService : date9. PassCap CargoCap; run; Raw Data File 50001 ,25feb1989,132, 530 50002, 11nov1989,152 50003, 22oct1991,168, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520
Using the MISSOVER Option Partial SAS Log NOTE: 6 records were read from the infile 'aircraft3.dat'. The minimum record length was 19. The maximum record length was 26. NOTE: The data set WORK.AIRPLANES3 has 6 observations and 4 variables. NOTE: DATA statement used: real time 0.42 seconds cpu time 0.07 seconds
Missing Values without Placeholders There is missing data represented by two consecutive delimiters. 50001 ,25feb1989,, 540 50002, 11nov1989,132, 530 50003, 22oct1991,168, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180,520
Missing Values without Placeholders By default, SAS treats two consecutive delimiters as one. Missing data should be represented by a placeholder. 5 0 0 0 1 ,25feb1989 , . ,
530
The DSD Option General form of the DSD option in the INFILE statement: INFILE INFILE‘file-name’ ‘file-name’DSD; DSD;
Missing Values without Placeholders The DSD option sets the default delimiter to a comma treats consecutive delimiters as missing values enables SAS to read values with embedded delimiters if the value is surrounded by double quotes.
Using the DSD Option data airplanes4; length ID $ 5; infile 'raw-data-file' dsd; input ID $ InService : date9. PassCap CargoCap; run; 50001 ,25feb1989,, 540 50002, 11nov1989,132, 530 50003, 22oct1991,168, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180,520
INFILE Statement Options Problem
Option
Non-blank delimiters
DLM='delimiter(s)'
Missing data at end of row
MISSOVER
Missing data represented by consecutive delimiters or Embedded delimiters where values are surrounded by double quotes
DSD
These options can be used separately or together in the INFILE statement.
Multiple Records Per Observation Farr, Sue Anaheim, CA 869-7008 Anderson, Kay B. Chicago, IL 483-3321 Tennenbaum, Mary Ann Jefferson, MO 589-9030
A raw data file has three records per employee. Record 1 contains the first and last names, record 2 contains the city and state of residence, and record 3 contains the employee’s phone number.
Desired Output The SAS data set should have one observation per employee. LName
FName
Farr Anderson Tennenbaum
Sue Kay B. Mary Ann
City Anaheim Chicago Jefferson
State CA IL MO
Phone 869-7008 483-3321 589-9030
Multiple INPUT Statements data address; length LName FName $ 20 City $ 25 State $ 2 Phone $ 8; infile 'raw-data-file' dlm=','; Load Record input LName $ FName $; Load Record input City $ State $; Load Record input Phone $; run;
Line Pointer Controls You can also use line pointer controls to control when SAS loads a new record. DATA DATASAS-data-set; SAS-data-set; INPUT INPUTvar-1 var-1var-2 var-2var-3 var-3// var-4 var-4var-5; var-5; additional additionalSAS SAS statements statements
SAS loads the next record when it encounters a forward slash.
Reading Multiple Records Per Observation data address; length LName FName $ 20 City $ 25 State $ 2 Phone $ 8; infile 'raw-data-file' dlm=','; Load Record input LName $ FName $ / Load Record City $ State $ / Load Record Phone $; run;
Reading Multiple Records Per Observation Partial Log
NOTE: 9 records were read from the infile 'addresses.dat'. The minimum record length was 8. The maximum record length was 20. NOTE: The data set WORK.ADDRESS has 3 observations and 5 variables.
Reading Raw Data Files with Multiple Records Per Observation Mixed Record Types
101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10
Reading Raw Data Files with Multiple Records Per Observation Desired Output Sales ID 101 3034 101 128 1345 109
Location
Sale Date
Amount
USA EUR USA USA EUR USA
14264 14274 14274 14280 14281 14320
3295.50 1876.30 2938.00 2908.74 3145.60 2789.10
Reading Raw Data Files with Multiple Records Per Observation The Single Trailing @
The single trailing @ option holds a raw data record in the input buffer until SAS executes an INPUT statement with no trailing @ reaches the bottom of the DATA step. General form of an INPUT statement with the single trailing @:
INPUT INPUT var1 var1 var2 var2 var3 var3 … …@; @;
Reading Raw Data Files with Multiple Records Per Observation Processing the Trailing @
Load next record.
Hold record for next INPUT statement.
input SalesID $ Location $ @; if location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.;
Multiple Observations Per Record The raw data file RETIRE contains each employee’s identification number and this year’s contribution to his or her retirement plan. Each record contains E00973 1400 E09872 2003 E73150 2400 information for three employees. E45671 4500 E34805 1980 E47200 4371
Desired Output
EmpID
Contrib
E00973 E09872 E73150 E45671 E34805
1400 2003 2400 4500 1980
Multiple Observations Per Record Processing: What Is Required? E00973 1400 E09872 2003 E73150 2400 Read for Obs. 1
Read for Obs. 2
Read for Obs. 3
Process Other Statements
Process Other Statements
Process Other Statements
Output
Output
Output
Multiple Observations Per Record The Double Trailing @ The double trailing @ holds the raw data record across iterations of the DATA step until the line pointer moves past the end of the line. INPUT INPUT var1 var1 var2 var2 var3 var3… …@@; @@;
data work.retire; length EmpID $ 6; infile 'raw-data-file'; input EmpID $ Contrib @@; run; Hold until end of record.
Multiple Observations Per Record Partial Log NOTE: 2 records were read from the infile 'retire.dat'. The minimum record length was 35. The maximum record length was 36. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.RETIRE has 6 observations and 2 variables. The “SAS went to a new line” message is expected because the @@ option indicates that SAS should read until the end of each record.
Multiple Observations Per Record Trailing @ Versus Double Trailing @ Option Trailing @ INPUT var-1... @; Double trailing @ INPUT var-1 ... @@;
Effect Holds raw data record until 1) an INPUT statement with no trailing @ 2) the bottom of the DATA step. Holds raw data records in input buffer until SAS reads past end of line.
Reading Hierarchical Raw Data Files Processing Hierarchical Files •Many files are hierarchical in structure, consisting of a •header record •one or more related detail records. Header Typically, each record contains a field that identifies whether it is a header record or a detail record.
Detail Detail Header Header Detail Header Detail Detail
Reading Hierarchical Raw Data Files Processing Hierarchical Files Heade
Heade You can read a hierarchical file into a SAS data set by creating r1 r and storing Detail observationDetail per detail record the header informatio Variab Variab 1 of each observation. les les Detail 2 Hierarchica Detail l File 3 Heade r2 Detail 1 Heade r3 Detail 1 Detail 2
Heade Detail r 1 SAS Data1Set Heade Detail r1 2 Heade Detail r1 3 Heade Detail r2 1 Heade Detail r3 1 Heade Detail r3 2
Reading Hierarchical Raw Data Files Creating One Observation Per Detail E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S
raw data file The DEPENDANTS has a header record containing the name of the employee and a detail record for each dependant on the employee’s health insurance.
Reading Hierarchical Raw Data Files Desired Output Personnel would like a list of all the dependants and the name of the associated employee.
EmpLName Adams Adams Porter Lewis Nicholls Slaydon
EmpFName Susan Susan David Dorian D. James Marla
DepName Michael Lindsay Susan Richard Roberta John
Relation C C S C C S
Reading Hierarchical Raw Data Files The RETAIN Statement General form of the RETAIN statement:
RETAIN RETAINvariable-name variable-name; ;
The RETAIN statement prevents SAS from reinitializing the values of new variables at the top of the DATA step. This means that values from previous records are available for processing.
Reading Hierarchical Raw Data Files Hold EmpLName and EmpFName data dependants(drop=Type); length EmpLName EmpFName DepName $ 20 Relation $ 1; retain EmpLName EmpFName; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run;
Example – Modified List Input This example explains Compilation Phase Execution Phase.
Example – Modified List Input Raw Data File 50001 50002 50003 50004 50005 50006
Compile
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
Input Buffer
ID $ 5
PDV
Raw Data File 50001 50002 50003 50004 50005 50006
Compile
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Input Buffer
ID $ 5
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
...
Raw Data File 50001 50002 50003 50004 50005 50006
Execute
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Input Buffer
ID $ 5
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
.
.
. ...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Input Buffer 5 0 0 0 1
ID $ 5
4 f e b 1 9 8 9
1 3 2
5 3 0
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
.
.
. ...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Input Buffer 5 0 0 0 1
ID $ 5
50001
4 f eb 1 9 8 9
1 3 2
5 3 0
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
10627 .
. 132
530 . ...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
Implicit return
Input Buffer 5 0 0 0 1
ID $ 5
50001
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
4 f eb 1 9 8 9
1 3 2
Implicit output
5 3 0
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
10627 .
530 . Write out observation to airplanes. . 132
...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run; Implicit output
Input Buffer 5 0 0 0 1
ID $ 5
50001
4 f eb 1 9 8 9
1 3 2
5 3 0
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
10627 .
530 . Write out observation to airplanes. . 132
...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
Implicit return
Input Buffer 5 0 0 0 1
ID $ 5
50001
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
4 f eb 1 9 8 9
1 3 2
5 3 0
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
10627 .
. 132
530 . ...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Input Buffer 5 0 0 0 1
ID $ 5
4 f eb 1 9 8 9
1 3 2
5 3 0
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
.
.
. ...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Input Buffer 5 0 0 0 2
ID $ 5
50002
1 1 n o v 1 98 9
1 5 2
5 4 0
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
10907 .
. 152
540 . ...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
Implicit return
Input Buffer 5 0 0 0 2
ID $ 5
50002
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
1 1 n o v 1 98 9
1 5 2
Implicit output
5 4 0
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
10907 .
540 . Write out observation to airplanes. . 152
...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run; Implicit output
Input Buffer 5 0 0 0 2
ID $ 5
50002
1 1 n o v 1 98 9
1 5 2
5 4 0
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
10907 .
. 152
540 . ...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
Implicit return
Input Buffer 5 0 0 0 2
ID $ 5
50002
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
1 1 n o v 1 98 9
1 5 2
5 4 0
PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
10907 .
. 152
540 . ...
Raw Data File 50001 50002 50003 50004 50005 50006
4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Input Buffer 5 0 0 0 2
ID $ 5
Continue processing until 1 1 n o v 1 98 9 1 5 2 5 4 0 end of the raw data file. PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8
.
.
.
Output of Dataset proc print data=airplanes noobs; run;
ID 50001 50002 50003 50004 50005 50006
In Service
Pass Cap
Cargo Cap
10627 10907 11617 12088 12228 12772
132 152 168 172 170 180
530 540 530 550 510 520
Combining SAS Data Sets Create a Data set from two or more existing data sets by joining observation side-by-side or appends the observations from one data set to another data set. Methods to combine SAS data sets concatenating interleaving one-to-one reading one-to-one merging match merging updating.
Combine SAS data sets
Concatenating Concatenating the data sets appends the observations from one data set to another data set. The DATA step reads DATA1 sequentially until all observations have been processed, and then reads DATA2. Data set COMBINED contains the results of the concatenation.
Combine SAS data sets Interleaving intersperses observations from two or more data sets, based on one or more common variables.
Combine SAS data sets One-to-One Reading and One-toOne Merging. One-to-one reading combines observations from two or more SAS data sets by creating observations that contain all of the variables from each contributing data set. Observations are combined based on their relative position in each data set. The DATA step stops after it has read the last observation from the smallest data set. One-toone merging is similar to a one-toone reading, with two exceptions: you use the MERGE statement instead of multiple SET statements, and the DATA step reads all observations from all data sets.
Combine SAS data sets Match merging combines observations from two or more SAS data sets into a single observation in a new data set based on the values of one or more common variables.
Combine SAS data sets Identifying Data Set Contributors When you read multiple SAS data sets in one DATA step, you can use the IN= data set option to detect which data set contributed to an observation. General form of the IN= data set option:
SAS-data-set(IN=variable) SAS-data-set(IN=variable) where variable is any valid SAS variable name. Variable is a temporary numeric variable with a value of: 0 to indicate false; the data set did not contribute to the current
observation 1 to indicate true; the data set did contribute to the current observation
Combine SAS data sets IN= Data Set Option Transact Num 111 111 113 114 116
Trans D C C D C
Amnt 126.32 560 235 14.56 371.69
Branch Num 111 112 114 115 116
Branch M.G.Road Sivaji Nagar Madiwala Koramangala BTM
A data set named Newtrans shows this week’s transactions. ■ A data set named noactiv shows accounts with no transactions this week. ■ A data set named noacct shows transactions with no matching ■
Combine SAS data sets Num 111 111 113 114 116
Trans D C C D C
Amnt 126.32 560 235 14.56 371.69
Num 111 112 114 115 116
Branch M.G.Road Sivaji Nagar Madiwala Koramangala BTM
Branch data Transact newtrans noactiv (drop=Trans Amnt) noacct (drop=Branch); merge prog2.transact(in=InTrans) prog2.branch(in=InBanks); by ActNum; if InTrans and InBanks then output newtrans; else if InBanks and not InTrans then output noactiv; else if InTrans and not InBanks then output noacct; run;
Combine SAS data sets A data set named Newtrans shows this week’s transactions. Num Trans Amnt Branch 111 111 114 116
D C D C
M.G.Road M.G.Road Madiwala BTM
126.32 560 14.56 371.69
A data set named noactiv shows accounts Num . Branch with no transactions this week 112 115
Sivaji Nagar Koramangala
A data set named noacct shows transactions with no matching account Num Trans Amnt number. 113
C
235
Questions