Sort And Accumulating Totals Last Updated : 29June, 2004
Center of Excellence
Objectives Understand how the SAS System initializes the value of a variable in the PDV. Prevent reinitialization of a variable in the PDV. Create an accumulating variable.
Creating an Accumulating Variable SaleDate
SaleAmt
01APR2001 02APR2001 03APR2001 04APR2001 05APR2001 06APR2001 07APR2001 08APR2001 09APR2001 10APR2001 11APR2001 12APR2001
498.49 946.50 994.97 564.59 783.01 228.82 930.57 211.47 156.23 117.69 374.73 252.73
The SAS data set prog2.daysales contains daily sales data for a retail store. There is one observation for each day in April showing the date (SaleDate) and the total receipts for that day (SaleAmt).
Creating an Accumulating Variable The store manager also wants to see a running total of sales for the month as of each day. Partial Output SaleDate
Sale Amt
Mth2Dte
01APR2001 02APR2001 03APR2001 04APR2001 05APR2001
498.49 946.50 994.97 564.59 783.01
498.49 1444.99 2439.96 3004.55 3787.56
Creating Mth2Dte By default, variables created with an assignment statement are initialized to missing at the top of the DATA step.
Mth2Dte=Mth2Dte+SaleAmt;
An accumulating variable must retain its value from one observation to the next.
The RETAIN Statement General form of the RETAIN statement:
RETAIN RETAINvariable-name variable-name
… …;; The RETAIN statement prevents SAS from reinitializing the values of new variables at the top of the DATA step. Previous values of retained variables are available for processing across iterations of the DATA step.
The RETAIN Statement The RETAIN statement retains the value of the variable in the PDV across iterations of the DATA step initializes the retained variable to missing before the first execution of the DATA step if an initial value is not specified is a compile-time-only statement.
Retain Mth2Dte and Set an Initial Value If you do not supply an initial value, all the values of Mth2Dte will be missing. retain Mth2Dte 0;
Creating an Accumulating Variable
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
Compile SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
R SALEDATE SALEAMT
MTH2DTE
... ...
Execute SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
R SALEDATE SALEAMT
.
.
MNTH2DTE
0 ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
R SALEDATE SALEAMT
15066 .
498.49 .
MNTH2DTE
0 ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
0+498.49 SALEDATE SALEAMT
15066 .
498.49 .
R MNTH2DTE
498.49 0 ... ...
SaleDate 15066 15067 15068 15069 15070
data mnthtot; set prog2.daysales; SaleAmt Implicit retain Mth2Dte 0; Implicit Return Output Mth2Dte=Mth2Dte+SaleAmt; 498.49 run; 946.50 994.97 564.59 R 783.01
SALEDATE SALEAMT
15066
498.49
MNTH2DTE
498.49
Write out observation to mnthtot. ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run; Implicit Output
R SALEDATE SALEAMT
15066
498.49
MNTH2DTE
498.49
Write out observation to mnthtot. ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run; Implicit Return
R SALEDATE SALEAMT
15066
498.49
MNTH2DTE
498.49 ... ...
SaleDate SaleDate 15066 15066 15067 15067 15068 15068 15069 15069 15070 15070
SaleAmt SaleAmt 498.49 498.49 946.50 946.50 994.97 994.97 564.59 564.59 783.01 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
R SALEDATE SALEAMT
15066
498.49
MNTH2DTE
498.49 ... ...
SaleDate SaleDate 15066 15066 15067 15067 15068 15068 15069 15069 15070 15070
SaleAmt SaleAmt 498.49 498.49 946.50 946.50 994.97 994.97 564.59 564.59 783.01 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
R SALEDATE SALEAMT
15066 15067
498.49 946.50
MNTH2DTE
498.49 ... ...
SaleDate SaleDate 15066 15066 15067 15067 15068 15068 15069 15069 15070 15070
SaleAmt SaleAmt 498.49 498.49 946.50 946.50 994.97 994.97 564.59 564.59 783.01 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
498.49+946.50 SALEDATE SALEAMT
15066 15067
498.49 946.50
R
MNTH2DTE
1444.99 498.49 ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run; Implicit Return
Implicit Output
R SALEDATE SALEAMT
15067
946.50
MNTH2DTE
1444.99
Write out observation to mnthtot. ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run; Implicit Output
R SALEDATE SALEAMT
15067
946.50
MNTH2DTE
1444.99
Write out observation to mnthtot. ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run; Implicit Return
R SALEDATE SALEAMT
15067
946.50
MNTH2DTE
1444.99 ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
R SALEDATE SALEAMT
15067
946.50
MNTH2DTE
1444.99 ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
R SALEDATE SALEAMT
15068 15067
946.50 994.97
MNTH2DTE
1444.99 ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
1444.99+994.97 SALEDATE SALEAMT
15068 15067
946.50 994.97
R
MNTH2DTE
2439.96 1444.99 ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run; Implicit Return
Implicit Output
R SALEDATE SALEAMT
15068
994.97
MNTH2DTE
2439.96
Write out observation to mnthtot. ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run; Implicit Output
R SALEDATE SALEAMT
15068
994.97
MNTH2DTE
2439.96
Write out observation to mnthtot. ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run; Implicit Return
R SALEDATE SALEAMT
15068
994.97
MNTH2DTE
2439.96 ... ...
SaleDate 15066 15067 15068 15069 15070
SaleAmt 498.49 946.50 994.97 564.59 783.01
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2Dte=Mth2Dte+SaleAmt; run;
Continue processing until end of SAS data set SALEDATE SALEAMT
15068
994.97
R
MNTH2DTE
2439.96
Creating an Accumulating Variable proc print data=mnthtot noobs; format SaleDate date9.; run; Partial PROC PRINT Output SaleDate 01APR2001 02APR2001 03APR2001 04APR2001 05APR2001
Sale Amt
Mth2Dte
498.49 946.50 994.97 564.59 783.01
498.49 1444.99 2439.96 3004.55 3787.56
Accumulating Totals: Missing Values What happens if there are missing values for SaleAmt?
data mnthtot; set prog2.daysales; retain Mth2Dte 0; Mth2dte=Mth2Dte+SaleAmt; run;
Undesirable Output
SaleDate
Sale Amt
Mth2Dte
01APR2001 02APR2001 03APR2001 04APR2001 05APR2001
498.49 . 994.97 564.59 783.01
498.49 . . . .
Missing value
Subsequent values of Mth2Dte are missing ... ...
The Sum Statement When creating an accumulating variable, an alternative to the RETAIN statement is the sum statement. General form of the sum statement:
variable variable ++expression; expression;
The Sum Statement The sum statement creates the variable on the left side of the plus sign if it does not already exist initializes the variable to zero before the first iteration of the DATA step automatically retains the variable adds the value of the expression to the variable at execution ignores missing values.
Accumulating Totals: Missing Values
data mnthtot2; set prog2.daysales2; Mth2Dte+SaleAmt; run;
Accumulating Totals: Missing Values proc print data=mnthtot2 noobs; format SaleDate date9.; run; Partial PROC PRINT Output SaleDate
SaleAmt
Mth2Dte
01APR2001 02APR2001 03APR2001 04APR2001 05APR2001
498.49 . 994.97 564.59 783.01
498.49 498.49 1493.46 2058.05 2841.06 c03s1d1.sas
Objectives Define First. and Last. processing. Calculate an accumulating total for groups of data. Use a subsetting IF statement to output selected observations.
Accumulating Totals for Groups EmpID
Salary
Div
E00004 E00009 E00011 E00036 E00037 E00048 E00077 E00097 E00107 E00123 E00155 E00171
42000 34000 27000 20000 19000 19000 27000 20000 31000 20000 27000 44000
HUMRES FINACE FLTOPS FINACE FINACE FLTOPS APTOPS APTOPS FINACE APTOPS APTOPS SALES
SAS data set The prog2.empsals contains each employee’s identification number (EmpID), salary (Salary), and division (Div). There is one observation for each employee.
Desired Output Human resources wants a new data set that shows total salary paid for each division. Div
DivSal
APTOPS FINACE FLTOPS HUMRES SALES
410000 163000 318000 181000 373000
Grouping the Data A B
E
D
C
You must group the data in the SAS data set before you can perform processing.
Review of the SORT Procedure You can rearrange the observations into groups using the SORT procedure. General form of a PROC SORT step: PROC PROCSORT SORTDATA=input-SAS-data-set DATA=input-SAS-data-set ; ; BY BY BY-variable BY-variable ...; ...; RUN; RUN;
The SORT Procedure The SORT procedure rearranges the observations in a DATA set can sort on multiple variables creates a SAS data set that is a sorted copy of the input SAS data set replaces the input data set by default.
Sorting by Div
proc sort data=prog2.empsals out=salsort; by Div; run;
Processing Data in Groups Div
Salary
APTOPS APTOPS APTOPS FINACE FINACE FINACE FINACE SALES SALES
20000 100000 50000 25000 20000 23000 27000 10000 12000
DivSal 170000
95000 22000
... ...
BY-Group Processing
DATA DATAoutput-SAS-data-set; output-SAS-data-set; SET SETinput-SAS-data-set; input-SAS-data-set; BY BYBY-variable BY-variable … …;; statements> RUN; RUN;
BY-Group Processing
data divsals(keep=Div DivSal); set salsort; by Div; additional SAS statements run;
BY-Group Processing A BY statement in a DATA step creates temporary variables for each variable listed in the BY statement. General form of the names of BY variables in a DATA step:
First.BY-variable First.BY-variable Last.BY-variable Last.BY-variable
First. and Last. Values The First. variable has a value of 1 for the first observation in a BY group; otherwise, it equals 0. The Last. variable has a value of 1 for the last observation in a BY group; otherwise, it equals 0.
Use these temporary variables to conditionally process sorted, grouped, or indexed data.
First. / Last. Example Look Ahead Div
Salary
APTOPS APTOPS APTOPS FINACE FINACE FINACE FINACE SALES SALES
20000 100000 50000 25000 20000 23000 27000 10000 12000
First.Div 1 Last.Div 0
... ...
First. / Last. Example
Div APTOPS APTOPS APTOPS FINACE FINACE FINACE FINACE SALES SALES
Look Ahead Salary First.Div 0 20000 100000 50000 Last.Div 25000 0 20000 23000 27000 10000 12000 ... ...
First. / Last. Example
Div APTOPS APTOPS APTOPS FINACE FINACE FINACE FINACE SALES SALES
Salary Look Ahead 20000 100000 50000 25000 20000 23000 27000 10000 12000
First.Div 0 Last.Div 1
... ...
First. / Last. Example
Div
Salary
APTOPS APTOPS APTOPS FINACE FINACE FINACE FINACE SALES SALES
20000 Look Ahead 100000 50000 25000 20000 23000 27000 10000 12000
First.Div 1 Last.Div 0
... ...
First. / Last. Example
Div
Salary
APTOPS APTOPS APTOPS FINACE FINACE FINACE FINACE SALES SALES
20000 Look Ahead 100000 50000 25000 20000 23000 27000 10000 12000
First.Div 1 Last.Div 0
What Must Happen When? There is a three-step process for accumulating totals: 1. Set the accumulating variable to 0 at the start of each BY group. 2. Increment the accumulating variable with a sum statement (automatically retains). 3. Output only the last observation of each BY group.
Accumulating Totals for Groups 1.
Set the accumulating variable to 0 at the start of each BY group.
data divsals(keep=Div DivSal); set salsort; by Div; if First.Div then DivSal=0; additional SAS statements run;
Accumulating Totals for Groups 2. Increment the accumulating variable with a sum statement (automatically retains). data divsals(keep=Div DivSal); set salsort; by Div; if First.Div then DivSal=0; DivSal+Salary; additional SAS statements run;
First. / Last. Example Div
Salary
DivSal
APTOPS APTOPS APTOPS FINACE FINACE FINACE FINACE SALES SALES
20000 100000 50000 25000 20000 23000 27000 10000 12000
20000 120000 170000 25000 45000 68000 91000 10000 22000
Subsetting IF Statement The subsetting IF defines a condition that the observation must meet to be further processed by the DATA step. General form of the subsetting IF statement:
IF IFexpression; expression; If the expression is true, the DATA step continues processing the current observation. If the expression is false, SAS returns to the top of the DATA step.
Accumulating Totals for Groups
3. Output only the last observation of each BY group. data divsals(keep=Div DivSal); set salsort; by Div; if First.Div then DivSal=0; DivSal+Salary; if Last.Div; run;
Subsetting IF Statement (Review) Initialize InitializePDV. PDV. Execute Executeprogram program statements. statements.
NO Is the condition true?
If condition; Execute Executeadditional additional program programstatements. statements.
Output Outputobservation observationto to SAS SASdata dataset. set.
YES
... ...
Accumulating Totals for Groups Partial Log NOTE: There were 39 observations read from the data set WORK.SALSORT. NOTE: The data set WORK.DIVSALS has 5 observations and 2 variables. NOTE: DATA statement used: real time 0.74 seconds cpu time 0.33 seconds
Accumulating Totals for Groups proc print data=divsals noobs; run; PROC PRINT Output Div DivSal APTOPS FINACE FLTOPS HUMRES SALES
410000 163000 318000 181000 373000
c03s2d1.sas
Input Data EmpID E00004 E00009 E00011 E00036 E00037 E00077 E00097 E00107 E00123 E00155 E00171 E00188 E00196 E00210 E00222 E00236
Salary 42000 34000 27000 20000 19000 27000 20000 31000 20000 27000 44000 37000 43000 31000 250000 41000
Region E W W W E C E E NC W W W C E NC W
Div HUMRES FINACE FLTOPS FINACE FINACE APTOPS APTOPS FINACE APTOPS APTOPS SALES HUMRES APTOPS APTOPS SALES APTOPS
The SAS data set prog2.regsals contains each employee’s ID number (EmpID), salary (Salary), region (Region), and division (Div). There is one observation for each employee.
Desired Output Human resources wants a new data set that shows the total salary paid and the total number of employees for each division in each region. Partial Output Num Region
Div
DivSal
Emps
C E E E E NC NC
APTOPS APTOPS FINACE FLTOPS HUMRES APTOPS FLTOPS
70000 83000 109000 122000 178000 37000 28000
2 3 4 3 5 2 1
Sorting by Region and Div The data must be sorted by Region and Div. Region is the primary sort variable. Div is the secondary sort variable. proc sort data=prog2.regsals out=regsort; by Region Div; run;
Sorting by Region and Div proc print data=regsort noobs; run; Partial PROC PRINT Output Region Div C C E E E E E
APTOPS APTOPS APTOPS APTOPS APTOPS FINACE FINACE
Salary 27000 43000 20000 31000 32000 19000 31000
Multiple BY Variables data regdivsals; set regsort; by Region Div; additional SAS statements run;
Multiple BY Variables: Example Look Ahead Region
Div
C C C E E E NC NC NC NC NC
APTOPS APTOPS APTOPS APTOPS FINACE FINACE FINACE SALES SALES SALES SALES
First.Region 1 First.Div 1 Last.Region 0 Last.Div 0 ... ...
Multiple BY Variables: Example Region
Look Ahead Div
C C C E E E NC NC NC NC NC
APTOPS APTOPS APTOPS APTOPS FINACE FINACE FINACE SALES SALES SALES SALES
First.Region 0 First.Div 0 Last.Region 0 Last.Div 0 ... ...
Multiple BY Variables: Example Region
Div Look Ahead
C C C E E E NC NC NC NC NC
APTOPS APTOPS APTOPS APTOPS FINACE FINACE FINACE SALES SALES SALES SALES
First.Region 0 First.Div 0 Last.Region 1 Last.Div 1 ... ...
Multiple BY Variables: Example Region C C C E E E NC NC NC NC NC
Div Look Ahead APTOPS APTOPS APTOPS APTOPS FINACE FINACE FINACE SALES SALES SALES SALES
First.Region 1 First.Div 1 Last.Region 0 Last.Div 1 ... ...
Multiple BY Variables: Example Region C C C E E E NC NC NC NC NC
Div Look Ahead APTOPS APTOPS APTOPS APTOPS FINACE FINACE FINACE SALES SALES SALES SALES
First.Region 1 First.Div 1 Last.Region 0 Last.Div 1
Multiple BY Variables When you use more than one variable in the BY statement, a change in the primary variable forces Last.BY-variable=1 for the secondary variable. Region C C E E E E
Div APTOPS APTOPS APTOPS APTOPS APTOPS FINACE
First. Region
Last. Region
1 0 1 0 0 0
0 1 0 0 0 0
First. Div 1 0 1 0 0 1
Last.Div 0 1 0 0 1 0
Multiple BY Variables /*Summarize salaries by division*/ data regdivsals(keep=Region Div DivSal NumEmps); set regsort; by Region Div; if First.Div then do; DivSal=0; NumEmps=0; end; DivSal+Salary; NumEmps+1; if Last.Div; run;
Multiple BY Variables Partial Log NOTE: There were 39 observations read from the data set WORK.REGSORT. NOTE: The data set WORK.REGDIVSALS has 14 observations and 4 variables. NOTE: DATA statement used: real time 0.07 seconds cpu time 0.07 seconds
Multiple BY Variables proc print data=regdivsals noobs; run; Partial PROC PRINT Output Region C E E E
Div
DivSal
APTOPS APTOPS FINACE FLTOPS
70000 83000 109000 122000
c03s2d2.sas
Questions