Week 11
SAS Procedures to Summarize Data Unit 5 SAS for Data Description
Week 11: Procedures to Summarize Data
Welcome! Data summarization is a significant portion of data management activities. It serves a variety of purposes, including (1) monitoring and tracking of a study cohort, (2) informing project planning, and (3) cohort description. Data summaries are also useful because they provide clues to data cleaning and analysis.
Data summarization might take the form of a listing of the data, the reporting of averages (and accompanying standard deviations), tabulations of frequency distributions, or graphical summaries (e.g. scatter plots).
The better summaries are those that are self-explanatory. They are well labeled (have titles and variable values identified) and are straightforward to understand. It’s also helpful to accompany the summarization with documentation of the data source (name and version of data set) and the name of program that generated the summary.
The SAS procedures discussed in this reading, Unit 5 week 1 (week 11 of course), are PRINT, MEANS, SUMMARY, UNIVARIATE, TABLUATE, FREQ, FORMS, REPORT, CHART, and PLOT.
Week 11
Page 1 of 58
Week 11
SAS Procedures to Summarize Data
In the Unit 5 week 2 (week 12 of course) reading, the procedures discussed are CHART and PLOT. The Unit 5 week 2 (week 12 of course) reading also includes a brief introduction to using the SAS ANALYST for producing graphics with the SASGRAPH module. These are higher quality graphics than the printer character charts and plots produced with PROCs CHART and PLOT.
TIP - Most procedures produce results that appear in the output window. Along with directly producing output, many of the procedures to be discussed can produce new SAS data sets (output data), which in turn can be used in other procedures, such as PRINT or TABULATE. In this way, it is possible to have more control over the format in which results are printed. Output data sets can also be modified in subsequent DATA steps, to add labels or formats before printing.
TIP – Be sure to see descriptions of these procedures in the SAS Procedures Guide. Learn to use the SAS manuals - there are many options to use with all procedures. These course notes are not all inclusive!
Week 11
Page 2 of 58
Week 11
SAS Procedures to Summarize Data
Goals of Week 11: Procedures to Summarize Data 1. to be competent in using SAS procedures to “write out” data values in a manner that is easy to read; 2. to appreciate the utility of “writing out” data as a preliminary to data quality assessment; 3. to be competent in using SAS procedures to produce frequency distributions and crosstabulations of data that are self-explanatory to an independent reader; and 4. to be (at least a little) competent in using SAS procedures to “write out” data as part of the production of forms (admittedly MS ACCESS might be a better tool in this regard);
Week 11 Outline – Procedures to Summarize Data Section Topic
Page
1.
How to Produce Data Listings (PROC PRINT) ……………………………. a. Printing with variable labels and ID statement …………………..….… b. Printing with a BY statement …………………………………………….. c. Introduction to the WHERE statement: How to Print a Subset ………
4 5 7 13
2.
How to Produce Summary Statistics ………………………………………… a. How to Use PROC MEANS and PROC SUMMARY ……………… b. How to Use PROC UNIVARIATE ………………………………….. c. How to Use PROC TABULATE ……………………………………..
15 15 21 27
3.
How to Produce Frequency Tables, Cross-Tabulations (PROC FREQ)
39
4.
How to Print Forms (PROC FORMS) …….………………………………..
50
5.
How to Use PROC REPORT ……………………………………………….
54
Week 11
Page 3 of 58
Week 11
SAS Procedures to Summarize Data
1. How to Produce Data Listings: PROC PRINT PROC PRINT is a simple way to get a listing of data in a SAS data set. • We have already used PROC PRINT to get a listing of data (see Unit 4 week 1) • More generally, PROC PRINT is used to list: (1) all of the variable values for all of the observations in a SAS data set, (2) some of the variable values for all of the observations in a SAS data set, or (3) some of the variable values for some of the observations in a SAS data set. • There are options in SAS to control the format of the listing. • Selected options for PROC PRINT are illustrated in the examples that follow.
Examples – The examples below use data on two neurological assessment scales used in a cardiopulmonary bypass study. Data on pre-op, post-op and follow-up scores are printed in different ways to illustrate some of the options available for printing in SAS.
Week 11
Page 4 of 58
Week 11
SAS Procedures to Summarize Data
a Printing with variable labels and ID statement Example • The data in this example are arranged with one record per subject. Included in this record are pre, post, and follow-up scores for each of two assessment scales. • Scores are printed for all three periods, for the two scales. • When the keyword LABEL is included in the PROC PRINT statement, the variable labels are used in the column heading instead of the variable name. • TIP - It is also possible to assign new labels in the PROC PRINT procedure. • TIP - A split character can be used when creating variable labels. This character is used to split the labels into two or more lines for printing. To do this: - Instead of writing LABEL on the PROC PRINT statement, - Write SPLIT=’ ’, where the split character (which can be a space) is enclosed within the single quotes. • In this example, new labels using the split character * were defined for all variables. You can see the advantage of using a split: the column width for printing would be determined by the length of the variable label if it were not split. • The VAR statement is used to - select the variables to print, and to - control the order in which variables appear. • When the ID statement is used, no observation number is printed. Instead, the variable named after ID appears in the leftmost column, before the variables in the VAR statement. • IF NO VAR statement is written, ALL of the variables in the dataset are printed, and they are printed in the order in which they are stored.
Week 11
Page 5 of 58
Week 11
SAS Procedures to Summarize Data
Example *__________________________________________________; ** print neurologic summary scores for pre, post, **; ** & follow-up using label and ID options **; ************************************************************** * Use ‘*’ to define split character **; * print only first 10 observations **; PROC PRINT DATA=MNSCORE(OBS=10) SPLIT='*'; ** define labels indicating where to split into two lines **; LABEL PATID=’PATIENT*ID’ MSCORE1='PRE-OP*MATHEW*SCORE' MSCORE2='POST-OP*MATHEW*SCORE' MSCORE3='FOLLOW-UP*MATHEW*SCORE' NSCORE1='PRE-OP*NEURO*SCORE' NSCORE2='POST-OP*NEURO*SCORE' NSCORE3='FOLLOW-UP*NEURO*SCORE'; ** define the variable to put in first column in place of obs number **; ID PATID; ** name variables in order for printing **; VAR MSCORE1 MSCORE2 MSCORE3 NSCORE1 NSCORE2 NSCORE3; TITLE1 'LISTING OF NEUROLOGICAL TOTALS SCORES'; TITLE2 'FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES'; RUN;
The following output is produced: LISTING OF NEUROLOGICAL TOTALS SCORES FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES PATID ID 24 28 60 65 70 40 190 196 210 240
Week 11
MATHEW SCORE 100 99 100 100 99 97 100 100 98 100
PREOP MATHEW SCORE 97 87 100 98 97 99 100 100 85 98
POSTOP FOLLOWUP MATHEW NEURO SCORE SCORE 97 100 98 100 100 100 100 100 100 100 98 95 97 100 99 100 99 90 97 100
PREOP POSTOP FOLLOWUP NEURO NEURO SCORE SCORE 85 85 85 90 100 100 90 100 95 100 100 95 100 90 100 100 80 95 95 90
Page 6 of 58
Week 11
SAS Procedures to Summarize Data
b Printing with a BY statement
Any use of a BY statement must be preceded with a sort of the data!
Use of a BY statement allows you to print data that is sorted. • To print with a BY statement, the data must first be sorted by the BY variable(s). For example, if you wanted to print data that is sorted by LNAME (e.g. LNAME is the variable name for “last name”), the print instruction must be preceded by a sort instruction. This might look something like
proc sort data=temp; proc print data=temp;
by lname; by lname;
run; run;
• When a BY variable is used for printing, the data are grouped under a header line the gives information on the grouping or BY variable. • The second example uses the same data, but in this case it is arranged with multiple records per subject, for pre, post, and follow-up status. Here the data are printed in two ways, first by patient id number, and second by patient status. • Tip - When the same variable is named in both an ID statement and a BY statement, the grouping variable is listed in the first column, and not repeated for subsequent observations. • Each by-group is separated by a blank line; this makes clear the separation of groups.
Week 11
Page 7 of 58
Week 11
SAS Procedures to Summarize Data
• Here, labels have been created with the split character as part of the PROC PRINT instruction; however, this is not required. • Tip - Do include a split character in variable labels; it makes your output easier to read!. • Tip – And while you’re at it, nclude the variable name within the label. For example: LABEL PATID = ‘PATID:*Patient*ID’;
This also enhances readability since it produces the variable
name as well as its accompanying description in the output.
Week 11
Page 8 of 58
Week 11
SAS Procedures to Summarize Data
*************************************************************; ** print neurologic summary scores for pre, post, & **; ** follow-up using label and ID options and BY statements **; *************************************************************; ** create formats for patient status **; PROC FORMAT; VALUE PFMT 1='PRE-OP' 2='POST-OP' 3='FOLLOW-UP'; RUN; ** SORT to print grouped by PATID **; PROC SORT DATA=MNS2; BY PATID; RUN; * define split char, suppress printing obs # with NOOBS; * print grouped by patid; * print first 9 obs ; PROC PRINT DATA=MNS2(OBS=9) SPLIT='*' NOOBS; BY PATID; ** format patient status **; FORMAT PSTATUS PFMT.; ** define labels indicating where to split into two lines **; LABEL MATTOTAL='MATHEW*TOTAL SCORE' NTOTAL='NEUROLOGICAL*TOTAL SCORE' PSTATUS='PATIENT*STATUS'; ** name variables in order for printing **; VAR PSTATUS MATTOTAL NTOTAL; TITLE1 'LISTING OF NEUROLOGICAL TOTALS SCORES'; TITLE2 'FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES'; TITLE3 'LISTED BY SUBJECT'; RUN;
Week 11
Page 9 of 58
Week 11
SAS Procedures to Summarize Data
Output follows. Note that no observation number is listed, due to the option NOOBS on the PROC PRINT statement. LISTING OF NEUROLOGICAL TOTALS SCORES FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES LISTED BY SUBJECT --------------------- PATIENT ID NUMBER =24 --------------------PATIENT STATUS PRE-OP POST-OP FOLLOW-UP
MATHEW TOTAL SCORE
NEUROLOGICAL TOTAL SCORE
100 97 97
100 85 85
--------------------- PATIENT ID NUMBER = 28 -------------------PATIENT STATUS PRE-OP POST-OP FOLLOW-UP
MATHEW TOTAL SCORE
NEUROLOGICAL TOTAL SCORE
99 87 98
100 85 90
--------------------- PATIENT ID NUMBER = 60 -------------------PATIENT STATUS PRE-OP POST-OP FOLLOW-UP
Week 11
MATHEW TOTAL SCORE
NEUROLOGICAL TOTAL SCORE
100 100 100
100 100 100
Page 10 of 58
Week 11
SAS Procedures to Summarize Data
Recall - When the same variable is named in both an ID statement and a BY statement, the grouping variable is listed in the first column, and not repeated for subsequent observations.
Example – The following example uses BY statement and ID statement together, and groups the data by status rather than by patient. ** print the same data grouped by pstatus **; PROC SORT DATA=MNS2; BY PSTATUS; RUN; * print grouped by pstatus ; * define split char ; * ID the pstatus in first column ; PROC PRINT DATA=MNS2 SPLIT='*'; BY PSTATUS; ID PSTATUS; ** assign format for patient status **; FORMAT PSTATUS PFMT.; ** define labels indicating where to split into two lines **; LABEL PATID ='PATIENT*ID NO.' MATTOTAL='MATHEW*TOTAL SCORE' NTOTAL='NEUROLOGICAL*TOTAL SCORE' PSTATUS='PATIENT*STATUS'; ** name variables in order for printing **; VAR PATID MATTOTAL NTOTAL; TITLE1 'LISTING OF NEUROLOGICAL TOTALS SCORES'; TITLE2 'FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES'; TITLE3 'LISTED BY STATUS, ID STATUS'; RUN;
Week 11
Page 11 of 58
Week 11
SAS Procedures to Summarize Data LISTING OF NEUROLOGICAL TOTALS SCORES FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES LISTED BY STATUS, ID STATUS
PATIENT STATUS
PATIENT ID NO.
MATHEW TOTAL SCORE
NEUROLOGICAL TOTAL SCORE
PRE-OP
24 28 60 65 74
100 99 100 100 99
100 100 100 100 100
POST-OP
24 28 60 65 74
97 87 100 98 97
85 85 100 90 95
FOLLOW-UP
24 28 60 65 74
97 98 100 100 100
85 90 100 100 100
Note the variable listed in both the BY and ID statements is listed to the left, and written only once for each group. There are several more options to control printing, including line spacing (e.g., double or single), printing subtotals and column totals • See the PRINT procedure in the SAS Procedures Guide or use the online HELP.
Week 11
Page 12 of 58
Week 11 c
SAS Procedures to Summarize Data
Introduction to the WHERE statement: How to Print a Subset of Observations
The WHERE statement instructs SAS to perform its task on a selected set of observations. • WHERE is used to select which observations will be used in the procedure being performed. • For example, if we are interested in listing only data for the pre-operative assessment, this can be done by creating a subset data file – using a data step to take a subset of data with PSTATUS=1, and then using this new data set in PROC PRINT.
• Alternatively, PROC PRINT can be used with a WHERE statement, with the condition: WHERE PSTATUS=1;
to select the subset of interest. • Note - In the following example, notice the difference in observation numbers that appear on the data listing (if the OBS column is printed). * Part 1: create subset of data and print it *; DATA MNPSTAT1; SET MNS2; IF PSTATUS=1; RUN; * define split char, print subset ; PROC PRINT DATA=MNPSTAT1 SPLIT='*'; FORMAT PSTATUS PFMT.; LABEL PATID ='PATIENT*ID NO.' MATTOTAL='MATHEW*TOTAL SCORE' NTOTAL='NEUROLOGICAL*TOTAL SCORE' PSTATUS='PATIENT*STATUS'; VAR PSTATUS PATID MATTOTAL NTOTAL; TITLE1 'LISTING OF NEUROLOGICAL TOTALS SCORES'; TITLE2 'FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES'; TITLE3 'PRE-OP SCORES ONLY - SUBSET DATA; RUN; * Part 2: print subset using WHERE statement *; PROC PRINT DATA=MNS2 SPLIT='*'; WHERE PSTATUS=1;
Week 11
Page 13 of 58
Week 11
SAS Procedures to Summarize Data
FORMAT PSTATUS PFMT.; LABEL PATID ='PATIENT*ID NO.' MATTOTAL='MATHEW*TOTAL SCORE' NTOTAL='NEUROLOGICAL*TOTAL SCORE' PSTATUS='PATIENT*STATUS'; VAR PSTATUS PATID MATTOTAL NTOTAL; TITLE1 'LISTING OF NEUROLOGICAL TOTALS SCORES'; TITLE2 'PRE-OP SCORES ONLY - USING WHERE STATEMENT; RUN;
LISTING OF NEUROLOGICAL TOTALS SCORES FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES PRE-OP SCORES ONLY - SUBSET DATA
OBS
PATIENT STATUS
1 2 3 4 5
PRE-OP PRE-OP PRE-OP PRE-OP PRE-OP
PATIENT ID NO. 24 28 60 65 74
MATHEW TOTAL SCORE 100 99 100 100 99
NEUROLOGICAL TOTAL SCORE 100 100 100 100 100
---------------------------------------------------------------------------------------------------------------------------LISTING OF NEUROLOGICAL TOTALS SCORES PRE-OP SCORES ONLY - USING WHERE STATEMENT
OBS
PATIENT STATUS
1 4 7 10 13
PRE-OP PRE-OP PRE-OP PRE-OP PRE-OP
PATIENT ID NO. 24 28 60 65 74
MATHEW TOTAL SCORE
NEUROLOGICAL TOTAL SCORE
100 99 100 100 99
100 100 100 100 100
Note that the observation numbers differ. In this case, the data file MNS2 had been previously sorted by patient id and patient status, so that the WHERE statement selected every third observation, corresponding to pre-op status.
Week 11
Page 14 of 58
Week 11 2.
SAS Procedures to Summarize Data
How to Produce Summary Statistics
There are four procedures that provide basic descriptive statistics for continuous variables. • These are MEANS, SUMMARY, UNIVARIATE, and TABULATE. The SAS Procedures Guide provides a detailed comparison of the statistics available in each procedure. • The procedures differ in the choice of statistics that can be produced. They also differ in the formatting of results. • MEANS, SUMMARY and UNIVARIATE can be used to create output datasets containing summary statistics that can be used in other procedures, such as PROCs PRINT and REPORT, which give many options for controlling the formatting of the data.
a.
How to Use PROC MEANS and PROC SUMMARY
• PROCs MEANS and SUMMARY can be used to compute means, minimums, quantiles, maximums, standard deviations and standard errors, range, number of missing values, confidence intervals, plus several other statistics. • CLASS and/or BY statements can be used to compute the statistics separately for subgroups of observations. • For example, when computing statistics on subject AGE, using the statement BY SEX; would provide separate statistics on AGE for males and females, for a data set previously sorted by the variable SEX. • CLASS statements produce separate statistics for subgroups, along with overall statistics for the whole group. For example using the statement CLASS SEX; would produce statistics on AGE for all subjects, as well as for males and females separately. • Note - Use of a CLASS statement does not require a prior sort of the data.
Week 11
Page 15 of 58
Week 11
SAS Procedures to Summarize Data
The primary difference between PROC MEANS and PROC SUMMARY is in the defaults for printing. • PROC MEANS, by default, provides results in the output window, although an output data set containing the summary statistics can be requested. • PROC SUMMARY, by default, produces only an output data set, although results can be requested in the output window using the PRINT option.
The primary difference between CLASS and BY statements is that the format for printing is different. • A BY statement requires previous sorting and produces no overall statistics.
Each of the procedures, MEAN and SUMMARY produce a default set of statistics; however, you can request specific statistics. • This gives you control of the order the statistics appear in on the output, when the statistics are printed. • “How to” - Statistics are requested on the PROC statement, before the first semi-colon. • See the SAS Procedures Manual or the online documentation for details of available statistics. Example - Producing summary statistics for two neurological assessment scales • This example uses produces summary statistics on the neurologic assessment scales using PROC MEANS and the PROC SUMMARY. • Creation of an output data set, subsequently printed with PROC PRINT is also illustrated.
Week 11
Page 16 of 58
Week 11
SAS Procedures to Summarize Data
• TIP - Use the MAXDEC option to control decimal places printed in the output! This is illustrated in the example. If you don’t use the MAXDEC option, the default is 8 places after the decimal point – and no one should have to look at that much nonsense. OPTIONS LINESIZE=78 PAGESIZE=60; LIBNAME CPB 'C:\temp'; *************************************************************; ** get means of neurologic scores by patient status **; ** do this with PROC MEANS and SUMMARY to show options **; *************************************************************; ** create formats for patient status **; PROC FORMAT; VALUE PFMT 1='PRE-OP' 2='POST-OP' 3='FOLLOW-UP'; RUN; ** sort by pstatus **; PROC SORT DATA=MNS2; BY PSTATUS; RUN; * name statistics, allow 2 dec places for printing *; PROC MEANS DATA=MNS2 N MEAN STD STDERR MIN MAX MAXDEC=2; * compute separately for each period; BY PSTATUS; ** format patient status **; FORMAT PSTATUS PFMT.; ** name variables to compute statistics for; VAR MATTOTAL NTOTAL; ** define output data set, variable names for statistics **; OUTPUT OUT=MNMEANS MEAN=MEANMAT MEANN STD=STDMAT STDN STDERR=SE_MAT SE_N MIN=MINMAT MINN MAX=MAXMAT MAXN; TITLE1 'SUMMARY STATISTICS'; TITLE2 'FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES'; TITLE3 'USING PROC MEANS WITH A BY STATEMENT';
Week 11
Page 17 of 58
Week 11
SAS Procedures to Summarize Data
RUN; PROC PRINT DATA=MNMEANS; ID PSTATUS; FORMAT PSTATUS PFMT.; TITLE1 'OUTPUT DATA SET FROM MEANS PROCEDURE WITH BY STATEMENT'; RUN;
Following is the output. SUMMARY STATISTICS FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES USING PROC MEANS WITH A BY STATEMENT --------------------- STATUS PRE, POST OR FOLLOW-UP=PRE-OP --------------------N Obs Variable Label N Minimum Maximum Mean Std Dev Std Error -----------------------------------------------------------------------------------32 MATTOTAL MATHEW TOTAL SCORE 32 95.00 100.00 99.03 1.37 0.24 NTOTAL NEUROLOGICAL TOTAL 32 90.00 100.00 98.43 3.22 0.57 --------------------------------------------------------------------------------------------------------- STATUS PRE, POST OR FOLLOW-UP=POST-OP ----------------------N Obs Variable Label N Minimum Maximum Mean Std Dev Std Error -----------------------------------------------------------------------------------31 MATTOTAL MATHEW TOTAL SCORE 31 73.00 100.00 96.48 5.73 1.03 NTOTAL NEUROLOGICAL TOTAL 31 80.00 100.00 94.67 6.31 1.13 -------------------------------------------------------------------------------------------------------- STATUS PRE, POST OR FOLLOW-UP=FOLLOW-UP ---------------------N Obs Variable Label N Minimum Maximum Mean Std Dev Std Error -----------------------------------------------------------------------------------28 MATTOTAL MATHEW TOTAL SCORE 28 94.00 100.00 98.39 1.39 0.26 NTOTAL NEUROLOGICAL TOTAL 28 85.00 100.00 96.42 4.48 0.84 -----------------------------------------------------------------------------------
OUTPUT DATA SET FROM MEANS PROCEDURE WITH BY STATEMENT PSTATUS
_TYPE_
PRE-OP POST-OP FOLLOW-UP
Week 11
0 0 0
_FREQ_ 32 31 28
MEANMAT
MEANN
STDMAT
STDN
SE_MAT
SE_N
99.03 96.48 98.39
98.43 94.67 96.42
1.37 5.73 1.39
3.22 6.31 4.48
0.24 1.03 0.26
0.56 1.13 0.84
MINMAT 95 73 94
MINN 90 80 85
MAXMAT
MAXN
100 100 100
100 100 100
Page 18 of 58
Week 11
SAS Procedures to Summarize Data
************************************************************; ** repeat, using PROC SUMMARY with a class statement **; ************************************************************; * name input data set, & print results *; * to output window *; PROC SUMMARY DATA=MNS2 PRINT MAXDEC=2 N MEAN STD STDERR MIN MAX ; /* name statistics*/ * compute separately for each period and overall; CLASS PSTATUS; ** format patient status **; FORMAT PSTATUS PFMT.; ** name variables to compute statistics for; VAR MATTOTAL NTOTAL; ** create output for using in subsequent step, giving name **; ** for each statistic for each variable **; OUTPUT OUT=MNSUMM MEAN=MEANMAT MEANN STD=STDMAT STDN STDERR=SE_MAT SE_N MIN=MINMAT MINN MAX=MAXMAT MAXN; RUN; PROC PRINT DATA=MNSUMM; FORMAT PSTATUS PFMT.; ID PSTATUS; TITLE1 'SUMMARY STATISTICS'; TITLE2 'FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES'; TITLE3 'USING SUMMARY WITH A CLASS STATEMENT'; RUN;
Following is the output. SUMMARY STATISTICS FOR MATHEW AND NEUROLOGIC STANDARD ASSESSMENT SCALES USING SUMMARY WITH A CLASS STATEMENT PSTATUS _TYPE_ _FREQ_ MEANMAT MEANN STDMAT STDN SE_MAT SE_N MINMAT MINN MAXMAT 0 PRE-OP 1 POST-OP 1 FOLLOW 1
Week 11
91 32 31 28
97.96 99.03 96.48 98.39
96.53 98.43 94.67 96.42
2.48 1.37 5.73 1.39
4.78 3.22 6.31 4.48
0.26 0.24 1.03 0.26
0.50 0.56 1.13 0.84
73 95 73 94
80 90 80 85
100 100 100 100
MAXN 100 100 100 100
Page 19 of 58
Week 11
SAS Procedures to Summarize Data
• MEANS produces results to the output window, by default, and this is shown, along with a printing of the requested output data set. • SUMMARY produces overall statistics for all observations, in addition to subgroup statistics, when the class statement is used. • In this example, the overall statistics are not particularly meaningful, so the advantages of using a CLASS statement are not important. • Explanation of the _TYPE_ Variable - The _TYPE_ variable is a SAS produced variable. It is produced by both MEANS and SUMMARY when an output dataset is requested. _TYPE_ indicates the level of breakdown. 0 indicates overall statistics; 1 indicates that 1 level of subgrouping is used, etc. When several variables are used in a CLASS statement, such as CLASS SEX PSTATUS; then statistics will be produced for females pre-op/ females postop/ females follow-up which would have _TYPE_=2, for a 2 level breakdown. Statistics for all females and for all males would be produced with _TYPE_=1, as well as for all pre-op, etc. would be produced. • Again, many options are available for controlling the ways in which variable groups and subgroups are defined, which makes these procedures very powerful for summarizing data.
Week 11
Page 20 of 58
Week 11
SAS Procedures to Summarize Data
b How to Use PROC UNIVARIATE PROC UNIVARIATE also produces descriptive statistics for continuous numeric variables –along with greater detail (including selected graphical descriptions) on the distribution of the variables. •
UNIVARIATE can be used to produce percentiles, such as the 10th and 90th percentiles (or any other percentile you desire). There are several options for computing percentiles.
•
Tests for normality of the distribution of the data are also available, along with a normal probability plot, stem-and-leaf and box-and-whisker plots.
• The five smallest and five largest values can also be identified by an ID variable – which is useful when identifying cases with outliers. The number and percent missing values for a variable are also reported. • UNIVARIATE can also be used with a BY statement, for previously sorted data, to produce separate statistics for each group. • Tip – Use a PROC UNIVARIATE with a BY statement to get a side-by-side box-and-whisker plots; these allow you to compare visually the distribution of groups on a variable of interest. • WARNING !!! PROC UNIVARIATE can take a lot of time, and produce tons of pages of output. This is especially true if you are not careful in defining your variable list, or you use a BY variable that has many groups. • WARNING or TIP ??? (you decide) PROC UNIVARIATE will produce statistics for the group (or groups) defined by missing values for the BY variable. This is because missing values define a group. •
Along with producing output, PROC UNIVARIATE can be used to produce an output data set. You can specify any set of percentile values to be included in the output data set can be requested, along with the default set of output variables.
Week 11
Page 21 of 58
Week 11
SAS Procedures to Summarize Data
Example - The example that follows uses data from a study of peri-operative beta blocker use in surgical patients. options pagesize=55 linesize=78 nocenter nodate nonumber; ***********************************************************; ** Example using Proc Univariate **; ***********************************************************; libname bb 'C:\bblocker'; proc univariate plot normal data=bb.bblock1; id counter; var age; title1 'Univariate statistics on Age'; run; proc sort data=bb.bblock1; by gender; run; ** summary stats separately by gender **; proc univariate plot data=bb.bblock1; id counter; by gender; var age; title1 'Grouping by Patient Gender'; run;
Week 11
Page 22 of 58
Week 11
SAS Procedures to Summarize Data
The following output is produced. Univariate statistics on Age The UNIVARIATE Procedure Variable: AGE (AGE) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation
Week 11
158 60.2088608 15.6742532 -0.1982307 611339 26.0331337
Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean
158 9513 245.682214 -0.7009588 38572.1076 1.24697663
Page 23 of 58
Week 11
SAS Procedures to Summarize Data
Location Mean Median Mode
Basic Statistical Measures Variability
60.20886 62.00000 70.00000
Std Deviation Variance Range Interquartile Range
15.67425 245.68221 72.00000 25.00000
NOTE: The mode displayed is the smallest of 2 modes with a count of 6. Tests for Location: Mu0=0 Test
-Statistic-
-----p Value------
Student's t Sign Signed Rank
t M S
Pr > |t| Pr >= |M| Pr >= |S|
48.28387 79 6280.5
<.0001 <.0001 <.0001
Tests for Normality Test
--Statistic---
-----p Value------
Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling
W D W-Sq A-Sq
Pr Pr Pr Pr
0.980877 0.085311 0.205514 1.096303
< > > >
W D W-Sq A-Sq
0.0275 <0.0100 <0.0050 0.0073
Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min
Week 11
Estimate 93 92 85 79 72 62 47 40 35 25 21
Page 24 of 58
Week 11
SAS Procedures to Summarize Data
Extreme Observations ----------Lowest-----------------Highest--------Value COUNTER Obs Value COUNTER Obs 21 25 25 27 31 Stem 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2
1 84 8 109 75
1 84 8 109 75
Leaf 23 5566778 1114 55666778888899999 00000011122222333333444 5556667777788899999 0001122233344 555566666778899 000111112244 5556667777889999 011122223333444 56778999 112 557 1 ----+----+----+----+--Multiply Stem.Leaf by 10**+1 Variable: AGE (AGE)
87 87 88 92 93 # 2 7 4 17 23 19 13 15 12 16 15 8 3 3 1
18 27 102 65 9
18 27 102 65 9
Boxplot | | | | +-----+ | | *--+--* | | | | +-----+ | | | | |
Normal Probability Plot 92.5+ +++* * | **+* * | +** | ***** | *****+ | ****++ | ***++ 57.5+ *** | +** | **** | ***** | ****+ | *** | *+** 22.5+*++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
Week 11
Page 25 of 58
Week 11
SAS Procedures to Summarize Data
When a BY statement is used, the same output is generated separately for each group defined by the BY variable. •
In addition, a side-by-side Box-and-Whisker plot to compare all groups is produced, as in the example below.
•
Note - The separate statistics for the 2 gender groups are not shown in the output that follows, as it is too many pages.
Grouping by Patient Gender The UNIVARIATE Procedure Schematic Plots 100 + | | | | 90 + | | | | | | | | | 80 + | | | | | | | | | | +-----+ 70 + +-----+ | | | | | | | | | | | | | | | *-----* 60 + *--+--* | + | | | | | | | | | | | | | | | | 50 + +-----+ | | | | | | | | +-----+ | | | 40 + | | | | | | | | | | | 30 + | | | | | | | | | | 20 + | ------------+-----------+----------GENDER 1 2
Week 11
Page 26 of 58
Week 11
SAS Procedures to Summarize Data
c How to Use PROC TABULATE PROC TABULATE procedure is initially confusing to use but bear with it; it is a powerful tool for producing nicely formatted tables of descriptive statistics for groups and subgroups of classification variables. • In particular, PROC TABULATE can be used to produce formatted tables that can be incorporated into a report directly, or with only minor modification. • Moreover, while all of the statistics available in TABULATE (plus more) can be produced in other procedures, TABULATE provides tremendous flexibility in the formulation of tables • Counts and percentages for categorical variables can also be reported using PROC TABULATE. • TIP - PROC TABULATE can also be used to print tables of results from other procedures (such as regression analyses, correlations, p-values, etc)! This is possible when these results have been stored in output data sets.
There is a higher start up learning time for using PROC TABULATE than the other procedures that produce summary statistics, but it’s worth it!! • The reason is – once you’ve survived the learning time, you are later spared the task of copying numbers, or repetitive cut and paste from crudely formatted output into tables for a report. This is especially true for summary reports that are produced at regular intervals throughout a study.
Week 11
Page 27 of 58
Week 11
SAS Procedures to Summarize Data
PROC TABULATE requires the specification of CLASS (categorical) variables used to form groups and subgroups, and continuous numeric analysis variables (identified on a VAR statement) for which statistics are produced. • A TABLE statement is used to define the rows, columns, and pages of a TABLE, along with the statistics to be produced, and the format for printing them. • LABEL and FORMAT statements can be used to provide more descriptive information for variable names and values. • KEYLABEL can be used to provide more descriptive row and column titles for the statistics requested. For example in the place of N as a column heading for the number of observations, a KEYLABEL statement would allow you to use the phrase "NO. OF OBS". • PROC TABULATE is described in the SAS Procedures Guide. In addition, there is a separate manual devoted entirely to PROC TABULATE. Description of the options is also available in the online documentation.
Following are a few examples of PROC TABULATE. They illustrate some of the different ways of printing the same summary statistics for a data set.
Example • The data for this example come from a study of functional status outcome six months postcardiac catheterization procedure. • Patients completed a functional status questionnaire pre-operatively, and again, 6 months post-operatively.
Week 11
Page 28 of 58
Week 11
SAS Procedures to Summarize Data
• Of interest was a comparison of diabetic cardiac catheterization patients versus non-diabetic cardiac catheterization patients with respect to their change over time in functional status. Functional status was assessed using 2 scores: physical functioning and mental functioning. • In the tables that follow, summary statistics are printed for change in physical and mental function scores. These summary statistics are reported for (1) diabetic patients, (2) nondiabetic patients, (3) groups defined by age group, and (4) the entire study cohort. • The examples that follow serve to illustrate the control in table formatting that is available.
TIP - While it is possible to name and create several tables in a single PROC TABULATE procedure (before the RUN; statement), it is recommended that you request separate TABLE statements; this has to do with the titles statements. • Only one set of titles can be specified for a procedure -- to produce new titles for a new table, the PROC TABULATE, CLASS and VAR statements must be repeated.
Guidelines for specification of the TABLE statement in PROC TABULATE. • There are no default statistics produced with PROC TABULATE. • Table rows - statement factors listed before a comma (,) define the table rows. • Table columns – statement factors listed after the comma (,) define the table columns. • An asterisk is used to separate sub-grouping within rows or columns.
Week 11
Page 29 of 58
Week 11
SAS Procedures to Summarize Data
• How to format the printing of a statistic – To accomplish formatting, the statistic name is followed by a formatting definition. (1) For example, MEAN*F=8.2 requests that the mean be printed, using 8 columns (including 1 for the decimal place) with 2 of the 8 after the decimal place (i.e., #####.##).
(2) Alternatively, a single format can be defined for printing all statistics, by using the phrase FORMAT=8.2 on the PROC TABULATE statement (see final example).
• The keyword ALL is used to get overall statistics in addition to subgroup statistics (e.g., DIAB ALL will produce statistics for diabetics, non-diabetics and overall). * Example 1: Proc Tabulate *; * Statistics in columns, rows defined by class variables *; PROC TABULATE DATA=FD2 ; CLASS DIAB AGEGROUP ; VAR PF2_1 MF2_1;
* define grouping variables *; * define analysis variables *
* define table with rows for diab status, *; * and sub-rows for age group *; TABLES (PF2_1) * ((DIAB ALL)*(AGEGROUP ALL)) , (N*F=5. MEAN*F=8.1 STD*F=8.2 MIN*F=8. MAX*F=8.); * assign formats and labels to variables *; FORMAT DIAB dfmt.; LABEL PF2_1='PF2_1: Physical Function Change' MF2_1='MF2_1: Mental Function Change'; TITLE1 'SUMMARY STATISTICS FOR CHANGE SCORES'; TITLE2 'Example 1: Age Groups within Diabetic Status'; TITLE3 ' Using default column headings'; RUN; * repeat for mental function change, using KEYLABEL
Week 11
*;
Page 30 of 58
Week 11
SAS Procedures to Summarize Data
* separate procedure used to illustrate options, change titles *; PROC TABULATE DATA=FD2 ; CLASS DIAB AGEGROUP ; VAR PF2_1 MF2_1; TABLES (MF2_1) * ((DIAB ALL)*(AGEGROUP ALL)) , (N*F=5. MEAN*F=8.1 STD*F=8.2 MIN*F=8. MAX*F=8.); FORMAT DIAB dfmt.; LABEL PF2_1='PF2_1: Physical Function Change' MF2_1='MF2_1: Mental Function Change'; KEYLABEL ALL=TOTAL N='# of OBS'; TITLE1 'SUMMARY STATISTICS FOR CHANGE SCORES'; TITLE2 'Example 1: Age Groups within Diabetic Status'; TITLE3 ' using KEYLABEL to rename col and row headings'; RUN;
The following output is produced. SUMMARY STATISTICS FOR CHANGE SCORES Example 1: Age Groups within Diabetic Status Using default column headings „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ† ‚ ‚ N ‚ Mean ‚ Std ‚ Min ‚ Max ‚ ‡ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PF2_1: ‚DIAB ‚AGEGRO-‚ ‚ ‚ ‚ ‚ ‚ ‚Physic-‡ƒƒƒƒƒƒƒ‰UP ‚ ‚ ‚ ‚ ‚ ‚ ‚al ‚Nondia-‡ƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚Functi-‚betic ‚<65 ‚ 345‚ 5.7‚ 13.03‚ -29‚ 41‚ ‚on ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚Change ‚ ‚>=65 ‚ 287‚ 2.2‚ 11.57‚ -28‚ 36‚ ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚All ‚ 632‚ 4.1‚ 12.50‚ -29‚ 41‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Diabet-‚AGEGRO-‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ic ‚UP ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚<65 ‚ 79‚ 1.6‚ 12.16‚ -28‚ 37‚ ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚>=65 ‚ 98‚ 0.7‚ 11.85‚ -28‚ 36‚ ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚All ‚ 177‚ 1.1‚ 11.96‚ -28‚ 37‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚All ‚AGEGRO-‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚UP ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚<65 ‚ 424‚ 4.9‚ 12.95‚ -29‚ 41‚ ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚>=65 ‚ 385‚ 1.8‚ 11.64‚ -28‚ 36‚ ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚All ‚ 809‚ 3.4‚ 12.44‚ -29‚ 41‚ Šƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒŒ
Week 11
Page 31 of 58
Week 11
SAS Procedures to Summarize Data
SUMMARY STATISTICS FOR CHANGE SCORES Example 1: Age Groups within Diabetic Status using KEYLABEL to rename col and row headings „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ† ‚ ‚# of ‚ ‚ ‚ ‚ ‚ ‚ ‚ OBS ‚ Mean ‚ Std ‚ Min ‚ Max ‚ ‡ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚MF2_1: ‚DIAB ‚AGEGRO-‚ ‚ ‚ ‚ ‚ ‚ ‚Mental ‡ƒƒƒƒƒƒƒ‰UP ‚ ‚ ‚ ‚ ‚ ‚ ‚Functi-‚Nondia-‡ƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚on ‚betic ‚<65 ‚ 345‚ -0.2‚ 11.48‚ -52‚ 32‚ ‚Change ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚>=65 ‚ 287‚ 0.5‚ 9.90‚ -30‚ 28‚ ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚TOTAL ‚ 632‚ 0.1‚ 10.79‚ -52‚ 32‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Diabet-‚AGEGRO-‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ic ‚UP ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚<65 ‚ 79‚ -0.2‚ 10.17‚ -26‚ 20‚ ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚>=65 ‚ 98‚ 0.6‚ 8.85‚ -18‚ 21‚ ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚TOTAL ‚ 177‚ 0.2‚ 9.45‚ -26‚ 21‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚TOTAL ‚AGEGRO-‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚UP ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚<65 ‚ 424‚ -0.2‚ 11.23‚ -52‚ 32‚ ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚>=65 ‚ 385‚ 0.6‚ 9.64‚ -30‚ 28‚ ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚TOTAL ‚ 809‚ 0.1‚ 10.50‚ -52‚ 32‚ Šƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒŒ
Week 11
Page 32 of 58
Week 11
SAS Procedures to Summarize Data
* Example 2: Proc Tabulate *; * Statistics in rows, groups and subgroups in columns *; PROC TABULATE DATA=FD2 ; CLASS DIAB AGEGROUP ; VAR PF2_1 MF2_1;
* define grouping variables *; * define analysis variables *;
* define table with rows for change scores *; * and statistics and columns for groups, *; * and sub-groups *; TABLES (PF2_1 MF2_1) * (N*F=5. MEAN*F=8.1 STD*F=8.2 MIN*F=8. MAX*F=8.) , ((DIAB ALL)*(AGEGROUP ALL)) ; * assign formats and labels to variables *; * and keylabels to headers *; FORMAT DIAB dfmt.; LABEL PF2_1='PF2_1: Physical Function Change' MF2_1='MF2_1: Mental Function Change'; KEYLABEL ALL=TOTAL N='# of OBS'; TITLE1 'SUMMARY STATISTICS FOR CHANGE SCORES'; TITLE2 'Example 2: Subgroups in Columns'; RUN;
Week 11
Page 33 of 58
Week 11
SAS Procedures to Summarize Data
SUMMARY STATISTICS FOR CHANGE SCORES Example 2: statistics in rows and Subgroups in Columns „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ DIAB ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ Nondiabetic ‚ Diabetic ‚ TOTAL ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ‰ ‚ ‚ AGEGROUP ‚ ‚ AGEGROUP ‚ ‚ AGEGROUP ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ‰ ‡ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ‰ ‡ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ <65 ‚ >=65 ‚ TOTAL ‚ <65 ‚ >=65 ‚ TOTAL ‚ <65 ‚ >=65 ‚ TOTAL ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PF2_1: Physical ‚# of OBS ‚ 345‚ 287‚ 632‚ 79‚ 98‚ 177‚ 424‚ 385‚ 809‚ ‚Function Change ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Mean ‚ 5.7‚ 2.2‚ 4.1‚ 1.6‚ 0.7‚ 1.1‚ 4.9‚ 1.8‚ 3.4‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Std ‚ 13.03‚ 11.57‚ 12.50‚ 12.16‚ 11.85‚ 11.96‚ 12.95‚ 11.64‚ 12.44‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Min ‚ -29‚ -28‚ -29‚ -28‚ -28‚ -28‚ -29‚ -28‚ -29‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Max ‚ 41‚ 36‚ 41‚ 37‚ 36‚ 37‚ 41‚ 36‚ 41‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚MF2_1: Mental ‚# of OBS ‚ 345‚ 287‚ 632‚ 79‚ 98‚ 177‚ 424‚ 385‚ 809‚ ‚Function Change ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Mean ‚ -0.2‚ 0.5‚ 0.1‚ -0.2‚ 0.6‚ 0.2‚ -0.2‚ 0.6‚ 0.1‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Std ‚ 11.48‚ 9.90‚ 10.79‚ 10.17‚ 8.85‚ 9.45‚ 11.23‚ 9.64‚ 10.50‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Min ‚ -52‚ -30‚ -52‚ -26‚ -18‚ -26‚ -52‚ -30‚ -52‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Max ‚ 32‚ 28‚ 32‚ 20‚ 21‚ 21‚ 32‚ 28‚ 32‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒŒ
Week 11
Page 34 of 58
Week 11
SAS Procedures to Summarize Data
* Example 3: Proc Tabulate *; nd * Statistics in and 1 group in rows, 2 group in columns *; * reset linesize option so table will fit across page *; options ls=175; PROC TABULATE DATA=FD2 ; CLASS DIAB AGEGROUP ; VAR PF2_1 MF2_1; TABLES (PF2_1 MF2_1) * (DIAB ALL) , (AGEGROUP ALL) * (N*F=5. MEAN*F=8.1 STD*F=8.2 MIN*F=8. MAX*F=8.); FORMAT DIAB dfmt.; LABEL PF2_1='PF2_1: Physical Function Change' MF2_1='MF2_1: Mental Function Change'; KEYLABEL ALL=TOTAL N='# of OBS'; TITLE1 'SUMMARY STATISTICS FOR CHANGE SCORES'; TITLE2 'Example 3: Cross-classifying Diab (row) and Age Groups (Col)'; RUN;
Week 11
Page 35 of 58
Week 11
SAS Procedures to Summarize Data
SUMMARY STATISTICS FOR CHANGE SCORES Example 3: Cross-classifying Diab (row) and Age Groups (Col) „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ AGEGROUP ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ <65 ‚ >=65 ‚ TOTAL ‚ ‚ ‡ƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒˆƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒˆƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ‰ ‚ ‚# of ‚ ‚ ‚ ‚ ‚# of ‚ ‚ ‚ ‚ ‚# of ‚ ‚ ‚ ‚ ‚ ‚ ‚ OBS ‚ Mean ‚ Std ‚ Min ‚ Max ‚ OBS ‚ Mean ‚ Std ‚ Min ‚ Max ‚ OBS ‚ Mean ‚ Std ‚ Min ‚ Max ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚PF2_1: Physical ‚DIAB ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚Function Change ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚Nondiabetic ‚ 345‚ 5.7‚ 13.03‚ -29‚ 41‚ 287‚ 2.2‚ 11.57‚ -28‚ 36‚ 632‚ 4.1‚ 12.50‚ -29‚ 41‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Diabetic ‚ 79‚ 1.6‚ 12.16‚ -28‚ 37‚ 98‚ 0.7‚ 11.85‚ -28‚ 36‚ 177‚ 1.1‚ 11.96‚ -28‚ 37‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚TOTAL ‚ 424‚ 4.9‚ 12.95‚ -29‚ 41‚ 385‚ 1.8‚ 11.64‚ -28‚ 36‚ 809‚ 3.4‚ 12.44‚ -29‚ 41‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚MF2_1: Mental ‚DIAB ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚Function Change ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚Nondiabetic ‚ 345‚ -0.2‚ 11.48‚ -52‚ 32‚ 287‚ 0.5‚ 9.90‚ -30‚ 28‚ 632‚ 0.1‚ 10.79‚ -52‚ 32‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Diabetic ‚ 79‚ -0.2‚ 10.17‚ -26‚ 20‚ 98‚ 0.6‚ 8.85‚ -18‚ 21‚ 177‚ 0.2‚ 9.45‚ -26‚ 21‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚TOTAL ‚ 424‚ -0.2‚ 11.23‚ -52‚ 32‚ 385‚ 0.6‚ 9.64‚ -30‚ 28‚ 809‚ 0.1‚ 10.50‚ -52‚ 32‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒŒ
Week 11
Page 36 of 58
Week 11
SAS Procedures to Summarize Data
Additional options in TABULATE let you define if and where you want lines in the tables. •
For example you may choose not to have vertical or horizontal separators. This is useful for producing tables for publication – some journals ask that tables be presented without lines.
The next example illustrates reporting counts (N) and percentages (PCTN) using PROC TABULATE. •
Note that all the variables are categorical, or CLASS variables.
•
In this example a single format is defined for the whole table, on the PROC line, rather than a different format for each statistic. This works here, because counts will always be whole numbers, and I’m content to report percents rounded to a whole percent.
•
How to get column percents - Column percents are defined by listing the ROW variables within < > after the keyword PCTN.
•
How to get row percents - To get row percents, the Column variable (DIAB in this case) would be listed (e.g., PCTN) instead of the row variables.
•
The KEYLABEL statement is used to replace the word ‘ALL’ on the output with the word ‘TOTAL’.
Week 11
Page 37 of 58
Week 11
SAS Procedures to Summarize Data
** Example 4: Counts and Percentages **; ** Using only CLASS (categorical) variables **; ** Rows: sex, race, age group **; ** Cols: diabetic status **; ** row percents reported **; PROC TABULATE DATA=FD2 FORMAT=8.0; CLASS DIAB SEX RACE2 AGEGROUP ; TABLES (SEX RACE2 AGEGROUP ALL), (DIAB ALL)*(N PCTN<SEX RACE2 AGEGROUP ALL>='PERCENT') / RTS=18; FORMAT SEX $SEXF. DIAB DIABF. RACE2 RACEF.; KEYLABEL ALL=TOTAL; TITLE1 'DESCRIPTIVE TABLE FOR DEMOGRAPHICS'; TITLE2 'Example 4: Counts and Percentages '; RUN;
DESCRIPTIVE TABLE FOR DEMOGRAPHICS Example 4: Counts and Percentages „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ DIABETIC*STATUS ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ Nondiabetic ‚ Diabetic ‚ TOTAL ‚ ‚ ‡ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ‰ ‚ ‚ N ‚PERCENT ‚ N ‚PERCENT ‚ N ‚PERCENT ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚SEX ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚Female ‚ 837‚ 31‚ 400‚ 41‚ 1237‚ 34‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚Male ‚ 1875‚ 69‚ 576‚ 59‚ 2451‚ 66‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚RACE2 ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚Caucasian ‚ 2589‚ 95‚ 844‚ 86‚ 3433‚ 93‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚Other ‚ 123‚ 5‚ 132‚ 14‚ 255‚ 7‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚AGEGROUP ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚<65 ‚ 1566‚ 58‚ 496‚ 51‚ 2062‚ 56‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚>=65 ‚ 1146‚ 42‚ 480‚ 49‚ 1626‚ 44‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚TOTAL ‚ 2712‚ 100‚ 976‚ 100‚ 3688‚ 100‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒŒ
Week 11
Page 38 of 58
Week 11
3
Procedures to Summarize Data
How to Produce Frequency Tables and Cross-Tabulations (PROC FREQ)
PROC FREQ produces frequency table summaries of the distributions of discrete numeric or character variables. • PROC FREQ can produce cross-tabulations, as well as one-way frequency listings. • Counts and percentages are produced for each group defined by variable values, or by the crossing of variable values. • Chi-square tests and other measures of association can also be produced by PROC FREQ (these will not be discussed in this course). • PROC FREQ can also be used to produce an output data set in addition to, or in place of results in the output window.
A TABLES statement is used to define the frequency tables. • To produce one-way tables, simply list the variable(s) on the TABLES statement separated by spaces. • To get cross-tabulations, name the variables separated by an asterisk (i.e., VAR1 * VAR2). To define rows - The first named variable (var1) defines the rows. To define columns - The second named variable (var2) defines the columns. • You can list any combination of individual variables, cross-tabulations and multi-way tables on a single TABLES statement. Multiple TABLES statements can be given in the same procedure. • TIP - Only one set of titles may be given for all the tables requested in a PROC FREQ Thus, titles may be the limiting factor in the number of tables requested in a single procedure.
Week 11
Page 39 of 58
Week 11
Procedures to Summarize Data
There are shorthand methods for listing variables in the TABLES statement. • If a set of variables has the same prefix with sequential numbering, name the first and last variable, separated by a single hyphen, no spaces. For example, the statement TABLES score1-score10;
would produce frequency tables for all ten variables, score1 through score10. •
For variables without a common prefix, list the first and last in order as they appear in a data set (use PROC CONTENTS POSITION to know correct order), separated by a double hyphen, no spaces. All the variables listed in this manner must be of the same type, character or numeric. For example, the line TABLES age--insur;
would produce tables for age, insur, and all variables in order between them in the data set, as long as they are all numeric or all categorical.
There are also shorthand ways of listing cross-tabulations. • For example if you want cross-tabulations of sex by a whole set of variables use TABLES SEX * (firstvar--lastvar);
• WARNING!! Be careful requesting a group of variables crossed by another group – listing (var1--varn) * (varlist1--varlistn) – you may end up with a lot more tables than you bargained for!
Week 11
Page 40 of 58
Week 11
Procedures to Summarize Data
• For multi-way tables, (i.e., A*B*C), the first variable defines the table, the second the rows, and the third the columns. For example the statement TABLES SEX * AGEGR * HSMOKE;
would produce two cross-tabulations, one table for males and one for females, of age group against smoking status (light, moderate, heavy). • An alternative way to achieve the result would be to request tables of AGEGR * HSMOKE, using a separate BY SEX statement. This requires that the data set be previously sorted by sex, and would produce different statistics, should you be doing analyses as well as summary tables (e.g., different chi-square tests).
Missing values are NOT included in frequency tables unless the options MISSPRINT or MISSING are used. • Options are listed in the TABLES statement and appear after a slash (/). • If you use the MISSPRINT option - Missing values will be included in the table, BUT - Missing values will not be used in computing percentages. • If you use the MISSING option - Missing values will be included in the table AND - Missing values will be used in computing percentages. • If neither option is specified (default) - Missing values are not included in the table AND - Missing values are not used in computing percentages.
Week 11
Page 41 of 58
Week 11
Procedures to Summarize Data
Example Illustrating the Options for Handling Missing Data • This example uses data from a study of patients transferred from midwifery care during the course of a pregnancy. One reason patients left midwifery care was early loss. If a woman suffered an early loss, more detail about the loss was requested. Thus, for other women this question was not applicable. • Two questions of interest in this study might be: 1) What proportion of all study women had a therapeutic abortion? 2) What proportion of early pregnancy losses were therapeutic abortions? •
Recall that options are listed on a TABLES statement following a slash; the slash appears after the listing of variables.
•
The default leaves missing values out of the table altogether. The MISSPRINT option includes these in the table, so that the difference between those for whom the question is not applicable, and those with missing data can be seen.
•
Note that the WHERE statement is used in this example; this is done so that the PROC FREQ operates on a selected subset of the observations only.
Week 11
Page 42 of 58
Week 11
Procedures to Summarize Data
***********************************************************; * example to illustrate different missing options in FREQ *; * 1st define formats, including formats for missing codes *; ***********************************************************; PROC FORMAT; VALUE EARLYFMT 1='TAB' 2='SAB <11 WEEKS' 3='SAB 12-19 WEEKS' 4='SAB 20-24 WEEKS' .='MISSING' .N='NOT APPLICABLE'; RUN; ** with and without missprint and missing options **; PROC FREQ DATA=CNMT; FORMAT EARLYLOS EARLYFMT.; TABLES EARLYLOS; TITLE1 'CNM TRANSFER STUDY'; TITLE2 ‘DEFAULT OPTION FOR MISSING VALUES’; RUN; PROC FREQ DATA=CNMT; FORMAT EARLYLOS EARLYFMT.; TABLES EARLYLOS / MISSPRINT; TITLE2 ‘MISSPRINT OPTION’; RUN; PROC FREQ DATA=CNMT; FORMAT EARLYLOS EARLYFMT.; TABLES EARLYLOS / MISSING; TITLE2 ‘MISSING OPTION'; RUN; PROC FREQ DATA=CNMT; WHERE EARLYLOS NE .N; FORMAT EARLYLOS EARLYFMT.; TABLES EARLYLOS / MISSING; TITLE2 ‘MISSING OPTION'; TITLE3 ‘WHERE used to select applicable cases only’; RUN;
Week 11
Page 43 of 58
Week 11
Procedures to Summarize Data
CNM TRANSFER STUDY DEfAULT OPTION FOR MISSING VALUES EARLY PREGNANCY LOSS
EARLYLOS
TAB SAB <11 WEEKS SAB 12-19 WEEKS SAB 20-24 WEEKS
Frequency 11 16 19 10
Cumulative
Percent
19.6 28.6 33.9 17.9
Cumulative
Frequency 11 27 46 56
Percent
19.6 48.2 82.1 100.0
Frequency Missing = 286
CNM TRANSFER STUDY MISSPRINT OPTION EARLY PREGNANCY LOSS
EARLYLOS
MISSING NOT APPLICABLE TAB SAB <11 WEEKS SAB 12-19 WEEKS SAB 20-24 WEEKS
Frequency 6 280 11 16 19 10
Percent . . 19.6 28.6 33.9 17.9
Cumulative
Frequency . . 11 27 46 56
Cumulative
Percent . . 19.6 48.2 82.1 100.0
Frequency Missing = 286
Week 11
Page 44 of 58
Week 11
Procedures to Summarize Data
CNM TRANSFER STUDY MISSING OPTION EARLY PREGNANCY LOSS Cumulative Cumulative EARLYLOS Frequency Percent Frequency Percent MISSING 6 1.8 6 1.8 NOT APPLICABLE 280 81.9 286 83.6 TAB 11 3.2 297 86.8 SAB <11 WEEKS 16 4.7 313 91.5 SAB 12-19 WEEKS 19 5.6 332 97.1 SAB 20-24 WEEKS 10 2.9 342 100.0 CNM TRANSFER STUDY MISSING OPTION WHERE used to select applicable cases only EARLY PREGNANCY LOSS EARLYLOS MISSING TAB SAB <11 WEEKS SAB 12-19 WEEKS SAB 20-24 WEEKS
Frequency 6 11 16 19 10
Percent 9.7 17.7 25.8 30.7 16.1
Cumulative Frequency 6 17 33 52 62
Cumulative Percent 9.7 27.4 53.2 83.9 100.0
• In the first two tables the computed percentages are identical, but use of the MISSPRINT option distinguishes the not applicable (did not have an early loss) from those with missing information. From these we can see, for example that of the patients with an early loss of known type, 19.6 percent had therapeutic abortions (TAB). • The third table uses the MISSING option, which includes missing values in computation of percentages. From this table we can see that of all the transferred patients, 3.2 percent had therapeutic abortions. • The final table uses the WHERE statement as well as the MISSING option, to create a table only on patients with applicable data – in this case only those with early loss. Since 6 patients had early loss of unknown type or age, these can now be included in the table to show that among the patients with
Week 11
Page 45 of 58
Week 11
Procedures to Summarize Data
an early loss, 9.7% were of unknown type/age, and that among all patients with early loss, 17.7% were known to have therapeutic abortions. • TIP - The availability of different options is useful, depending on what you want to know.
There are other options to control what is printed in the tables • NOCUM suppresses printing of cumulative frequencies and percentages. This is especially appropriate for nominal data, inasmuch as cumulative frequencies and cumulative percentages don’t make much sense. •
NOFREQ suppresses printing of cell counts, NOROW, NOCOL, and NOPERCENT suppress row, column and overall cell percentages, respectively.
Example Illustrating the Creation of Cross-Tabulations • The first example produces two tables, trimester by age group, and trimester by earlylos. • The second TABLES statement produces separate tables of age group by earlylos for each level of trimester. Note that there is no table for the third trimester, since by definition no one in the third trimester can have an early loss. • This example illustrates choosing to suppress percentagles. ** Example Crosstabulations **; PROC FORMAT; VALUE AGEFMT 1='<15 YRS' 2='15-17 YRS' 3='18-19 YRS' 4='20+ YRS'; VALUE EARLYFMT 1='TAB' 2='SAB <11 WEEKS' 3='SAB 12-19 WEEKS' 4='SAB 20-24 WEEKS' .='MISSING' Week 11
Page 46 of 58
Week 11
Procedures to Summarize Data .N='NOT APPLICABLE';
RUN; ** cross-tabulations **; PROC FREQ DATA=CNMT; FORMAT AGEGR AGEFMT. EARLYLOS EARLYFMT.; *
look at trimester by 2 different factors *; TABLES TRIMES * (AGEGR EARLYLOS);
*
look at tables of age group by earlylos, separately for each trimester *; TABLES TRIMES * AGEGR * EARLYLOS / NOPERCENT NOROW NOCOL; TITLE1 ‘CNM TRANSFER STUDY’; RUN; CNM TRANSFER STUDY TABLE OF TRIMES BY AGEGR TRIMES(TRIMESTER OF PREGNANCY AT TRANSFER) AGEGR(PATIENT AGE AT TRANSFER) Frequency| Percent | Row Pct | Col Pct |<15 YRS |15-17 YR|18-19 YR|20+ YRS | | |S |S | | Total ---------+--------+--------+--------+--------+ 1 | 0 | 9 | 7 | 3 | 19 | 0.00 | 5.20 | 4.05 | 1.73 | 10.98 | 0.00 | 47.37 | 36.84 | 15.79 | | 0.00 | 14.29 | 13.73 | 6.25 | ---------+--------+--------+--------+--------+ 2 | 8 | 19 | 23 | 20 | 70 | 4.62 | 10.98 | 13.29 | 11.56 | 40.46 | 11.43 | 27.14 | 32.86 | 28.57 | | 72.73 | 30.16 | 45.10 | 41.67 | ---------+--------+--------+--------+--------+ 3 | 3 | 35 | 21 | 25 | 84 | 1.73 | 20.23 | 12.14 | 14.45 | 48.55 | 3.57 | 41.67 | 25.00 | 29.76 | | 27.27 | 55.56 | 41.18 | 52.08 | ---------+--------+--------+--------+--------+ Total 11 63 51 48 173 6.36 36.42 29.48 27.75 100.00 Frequency Missing = 169
Week 11
Page 47 of 58
Week 11
Procedures to Summarize Data
TABLE OF TRIMES BY EARLYLOS TRIMES(TRIMESTER OF PREGNANCY AT TRANSFER) EARLYLOS(EARLY PREGNANCY LOSS) Frequency| Percent | Row Pct | Col Pct |TAB |SAB <11 |SAB 12-1|SAB 20-2| | |WEEKS |9 WEEKS |4 WEEKS | Total ---------+--------+--------+--------+--------+ 1 | 3 | 12 | 5 | 0 | 20 | 6.25 | 25.00 | 10.42 | 0.00 | 41.67 | 15.00 | 60.00 | 25.00 | 0.00 | | 50.00 | 92.31 | 26.32 | 0.00 | ---------+--------+--------+--------+--------+ 2 | 3 | 1 | 14 | 10 | 28 | 6.25 | 2.08 | 29.17 | 20.83 | 58.33 | 10.71 | 3.57 | 50.00 | 35.71 | | 50.00 | 7.69 | 73.68 | 100.00 | ---------+--------+--------+--------+--------+ 3 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | . | . | . | . | | 0.00 | 0.00 | 0.00 | 0.00 | ---------+--------+--------+--------+--------+ Total 6 13 19 10 48 12.50 27.08 39.58 20.83 100.00 Frequency Missing = 294
Week 11
Page 48 of 58
Week 11 CNM TRANSFER STUDY TABLE 1 OF AGEGR BY EARLYLOS CONTROLLING FOR TRIMES=1 AGEGR(PATIENT AGE AT TRANSFER)
Procedures to Summarize Data
EARLYLOS(EARLY PREGNANCY LOSS)
Frequency |TAB |SAB <11 |SAB 12-1|SAB 20-2| | |WEEKS |9 WEEKS |4 WEEKS | ----------+--------+--------+--------+--------+ <15 YRS | 0 | 0 | 0 | 0 | ----------+--------+--------+--------+--------+ 15-17 YRS | 1 | 6 | 0 | 0 | ----------+--------+--------+--------+--------+ 18-19 YRS | 2 | 2 | 1 | 0 | ----------+--------+--------+--------+--------+ 20+ YRS | 0 | 1 | 0 | 0 | ----------+--------+--------+--------+--------+ Total 3 9 1 0
Total 0 7 5 1 13
Frequency Missing = 15 TABLE 2 OF AGEGR BY EARLYLOS CONTROLLING FOR TRIMES=2 AGEGR(PATIENT AGE AT TRANSFER)
EARLYLOS(EARLY PREGNANCY LOSS)
Frequency |TAB |SAB <11 |SAB 12-1|SAB 20-2| | |WEEKS |9 WEEKS |4 WEEKS | ----------+--------+--------+--------+--------+ <15 YRS | 1 | 0 | 2 | 1 | ----------+--------+--------+--------+--------+ 15-17 YRS | 2 | 0 | 6 | 0 | ----------+--------+--------+--------+--------+ 18-19 YRS | 0 | 1 | 3 | 0 | ----------+--------+--------+--------+--------+ 20+ YRS | 0 | 0 | 1 | 1 | ----------+--------+--------+--------+--------+ Total 3 1 12 2
Total 4 8 4 2 18
Frequency Missing = 77 TABLE 3 OF AGEGR BY EARLYLOS CONTROLLING FOR TRIMES=3 Effective Sample Size = 0 Frequency Missing = 127
Week 11
Page 49 of 58
Week 11
Procedures to Summarize Data
4. How to Print Forms: PROC FORMS Proc FORMS can be used to print mailing labels, file cards – any printer forms that have a regular pattern, and that use information stored in SAS data sets. • Beware! It may take some trial and error (with settings) to obtain the exact look (e.g. spacing) that you are after. In some instances, it may be easier to export your data from SAS to ACCESS, and use the special features that allow you to specify the mailing label format name, so spacing is defined for you. • There are options within the procedure to define the form dimensions, spacing, indentation, number of units per page, and more. An example for printing mailing labels is shown. • An alternative approach to producing “forms” is to use a PUT statement in a DATA _NULL_; step . When you are printing combinations of text and variable values, this can sometimes be easier to use. The PROC FORMS statement names the data file to use, followed by a series of options to control page and line size, number of forms to print down and across a page, number of lines to skip between forms (in this example, SKIP=2), as well as others options. • LINE statements give a line number, followed by the variable names to be printed on the line. Options for printing follow the slash (/). This example uses the LASTNAME feature to reorder a name given as last, first. The option puts the text after a comma first, followed by the text before the comma, and the comma is not printed. Alternatively, the first and last names could have been read in as separate variables. • The option PACK removes extra spaces that would appear, if a character variable doesn’t use all of the available character variable length – as shown in the second example, without the options. • Note that the statement TITLE1; must be used to remove all titles. Week 11
Page 50 of 58
Week 11
Procedures to Summarize Data
** PROC FORMS EXAMPLE **; ** read in ASCII data **; data mail; input sid name & $20. addr1 & $25. addr2 & $25. zip; cards; 01 Johnson, Lee. R. P.O. Box 243, Montgomery, Al 36113 02 Abbott, Brenda K. 568 Trillion Ct., Denver, CO 80237 03 Rodriquez, Juan 619 Powell Dr., Charleston, SC 29412 04 Stevenson, Mary K. 22 Meredith Blvd., Austin, TX 78702 05 Hawks, Patrick E. Rt. 1, Box 523, Taylorsville, NC 28681 06 Lee, Chen 123 Maple St. Raleigh, NC 27606 07 Weinstein, Joseph M. Rt. 4, Box 466, Dixon, IL 61021 08 Baskowshi, Bonnie G. P.O. Box 42, Sacramento, CA 85841 ; RUN; * print subject list *; proc print data=mail; id sid; var name addr1 addr2 zip; title1 'List of subjects'; RUN; * proc forms with options *; proc forms data=mail skip=2; line 1 name / lastname; line 2 addr1 ; line 3 addr2 zip/pack; title1; run; * without options *; proc forms data=mail; line 1 name ; line 2 addr1 ; line 3 addr2 zip; run;
Week 11
Page 51 of 58
Week 11
Procedures to Summarize Data
List of subjects sid 1 2 3 4 5 6 7 8
name
addr1
addr2
Johnson, Lee. R. Abbott, Brenda K. Rodriquez, Juan Stevenson, Mary K. Hawks, Patrick E. Lee, Chen Weinstein, Joseph M. Baskowshi, Bonnie G.
P.O. Box 243, 568 Trillion Ct., 619 Powell Dr., 22 Meredith Blvd., Rt. 1, Box 523, 123 Maple St. Rt. 4, Box 466, P.O. Box 42,
Montgomery, Al Denver, CO Charleston, SC Austin, TX Taylorsville, NC Raleigh, NC Dixon, IL Sacramento, CA
zip 36113 80237 29412 78702 28681 27606 61021 85841
(First example using options) Lee. R. Johnson P.O. Box 243, Montgomery, Al 36113
Brenda K. Abbott 568 Trillion Ct., Denver, CO 80237
Juan Rodriquez 619 Powell Dr., Charleston, SC 29412
Mary K. Stevenson 22 Meredith Blvd., Austin, TX 78702
Patrick E. Hawks Rt. 1, Box 523, Taylorsville, NC 28681
Chen Lee 123 Maple St. Raleigh, NC 27606
Joseph M. Weinstein Rt. 4, Box 466, Dixon, IL 61021
Bonnie G. Baskowshi P.O. Box 42, Sacramento, CA 85841
Week 11
Page 52 of 58
Week 11
Procedures to Summarize Data
(2nd example without options) Johnson, Lee. R. P.O. Box 243, Montgomery, Al Abbott, Brenda K. 568 Trillion Ct., Denver, CO Rodriquez, Juan 619 Powell Dr., Charleston, SC Stevenson, Mary K. 22 Meredith Blvd., Austin, TX Hawks, Patrick E. Rt. 1, Box 523, Taylorsville, NC Lee, Chen 123 Maple St. Raleigh, NC Weinstein, Joseph M. Rt. 4, Box 466, Dixon, IL Baskowshi, Bonnie G. P.O. Box 42, Sacramento, CA
Week 11
36113
80237
29412
78702
28681
27606
61021
85841
Page 53 of 58
Week 11
Procedures to Summarize Data
5. How to Use PROC REPORT
PROC REPORT is another PROC that is worth the time and effort required to learn. • When used well, reports can be generated directly as SAS output that require little or no further editing before presentation. • PROC REPORT encompasses many of the features of Procs PRINT, MEANS and TABULATE. • Its features allow the presentation of detail (individual observations) and summary data, incorporated into an organized report. • The flexibility in reporting using PROC REPORT is a little better than that for PROC TABULATE. However, both save later “cut and paste” work to create a project document. • TIP – Consider using this procedure when you need to generate regular, multiple status reports. PROC REPORT is handy inasmuch as, once a report has been designed, and the programming statements saved, it can easily be re-generated. • PROC REPORT is also nice for having features that allow comprehensive control over fonts, background and text colors (or output text or html files) as well as the controls it provides over spacing, page breaks and other formatting tools. PROC FORMAT also has features akin to a PUT statement (LINE in Proc Report); this allows you to insert your own text in the report.
Week 11
Page 54 of 58
Week 11
Procedures to Summarize Data
When using PROC REPORT, plan in advance the layout of your report. • A report’s layout is largely determined by the designation of variables into various categories. • For each variable, a DEFINE statement is used to designate the display category for the report:
DISPLAY – A row appears for every observation for variables with this designation. By default, all variables are considered DISPLAY variables, unless another designation is given.
ORDER – A row appears for every observation for variables with this designation, as a display variable, but this designation will order by value.
ACROSS – Variables with this designation determine columns for the report – one for each distinct value of the variable present in the input data. ACROSS variables are comparable to CLASS variables in other procedures, but used to define columns in PROC REPORT.
GROUP – This designation groups on variable values to determine rows in the report, akin to using a CLASS variable in other procedures.
ANALYSIS – Numeric variables that are used for computation of summary statistics, for each cell of a report produced by ACROSS by GROUP designations. Example
• This example uses data from the study of change in functional status, 6 months post-cardiac catheterization among diabetic and non-diabetic patients. This study was described in the section on PROC TABULATE (see, p 27) .
• The example is in three parts. Each demonstrates a few new features. Part 1 - Prints means for physical function change scores by sex by age group by diabetic status. Part 2 - Adds formatting features and more summary statistics, and Part 3 - Displays other statistics and t-test results.
Week 11
Page 55 of 58
Week 11
Procedures to Summarize Data
* example part 1 *; * use NOWD to suppress windows, headskip to skip line after header row*; PROC REPORT DATA=FD2 NOWD HEADSKIP; * define columns for report *; * nominal vars: diab, age group and sex *; * numeric var: change in pf score *; COLUMN DIAB AGEGROUP SEX PF2_1 ; * define nominal variables to create rows, assign formats *; DEFINE DIAB / GROUP FORMAT=DFMT.; DEFINE AGEGROUP / GROUP ; DEFINE SEX / GROUP FORMAT=$SEXF.; * define statistic to print for pf change as the mean *; * assign title to column *; DEFINE PF2_1 / ANALYSIS MEAN 'PF CHANGE MEAN'; TITLE1 'EXAMPLE USING PROC REPORT'; RUN; EXAMPLE USING PROC REPORT
DIAB Diabetic
AGEG ROUP <65 >=65
Nondiabetic
<65 >=65
Week 11
SEX Female Male Female Male Female Male Female Male
PF CHANGE MEAN 1.4254753 1.7350271 0.7417697 0.722176 5.9010155 5.5899143 1.5194106 2.5627356
Page 56 of 58
Week 11
Procedures to Summarize Data
* EXAMPLE part 2: ADDING FORMATTING FEATURES and more statistics *; PROC REPORT DATA=FD2 NOWD HEADSKIP; * Define several columns for pf change, to print several statistics *; COLUMN DIAB AGEGROUP SEX PF2_1=PFN PF2_1=PFMEAN PF2_1=PFMiN PF2_1=PFMAX; * define nominal variables adding column labels and formats *; DEFINE DIAB / GROUP 'DIABETIC STATUS' FORMAT=DFMT.; DEFINE AGEGROUP / GROUP 'AGE GROUP' FORMAT=$6.; DEFINE SEX / GROUP FORMAT=$SEXF.; * define each of the statistics to use for PF change *; * and give column header and format *; DEFINE PFN / ANALYSIS N 'N' FORMAT=3.; DEFINE PFMEAN / ANALYSIS MEAN 'PF CHANGE MEAN' FORMAT=6.1; DEFINE PFMIN / ANALYSIS MIN 'PF CHANGE MIN' FORMAT=6.1; DEFINE PFMAX / ANALYSIS MAX 'PF CHANGE MAX' FORMAT=6.1; * create line breaks after each group *; BREAK AFTER AGEGROUP / SKIP SUPPRESS; BREAK AFTER DIAB / SUPPRESS SKIP; TITLE2 'ADDING FORMATTING FEATURES AND MULTIPLE STATISTICS'; RUN;
EXAMPLE USING PROC REPORT ADDING FORMATTING FEATURES AND MULTIPLE STATISTICS
DIABETIC STATUS Diabetic
Nondiabetic
Week 11
AGE GROUP
SEX
N
PF CHANGE MEAN
PF CHANGE MIN
PF CHANGE MAX
<65
Female Male
34 45
1.4 1.7
-28.1 -20.8
27.3 37.0
>=65
Female Male
40 58
0.7 0.7
-20.5 -27.7
30.8 36.4
<65
Female Male
92 253
5.9 5.6
-29.3 -28.4
41.1 38.0
>=65
Female Male
100 187
1.5 2.6
-24.2 -28.0
25.4 36.3
Page 57 of 58
Week 11
Procedures to Summarize Data
** EXAMPLE part 3:
REPORTING TEST STATISTICS **;
PROC REPORT DATA=FD2 NOWD HEADSKIP; ** Define column variables, including several columns for pf change *; COLUMN DIAB AGEGROUP SEX PF2_1=PFN PF2_1=PFMEAN PF2_1=PFSTDERR PF2_1=PFTEST; ** define nominal DEFINE DIAB / DEFINE AGEGROUP / DEFINE SEX /
vars with column labels and formats **; GROUP 'DIABETIC STATUS' FORMAT=DFMT.; GROUP 'AGE GROUP' FORMAT=$6.; GROUP FORMAT=$SEXF.;
** define statistics for pf change, **; ** including test that mean change is different from 0 **; DEFINE PFN / ANALYSIS N 'N' FORMAT=3.; DEFINE PFMEAN / ANALYSIS MEAN 'PF CHANGE MEAN' FORMAT=6.1; DEFINE PFSTDERR / ANALYSIS STDERR 'STD ERROR PF CHANGE' FORMAT=6.1; DEFINE PFTEST / ANALYSIS PRT 'P-VALUE: MEAN=0' FORMAT=8.4; BREAK AFTER AGEGROUP / SKIP SUPPRESS; BREAK AFTER DIAB / SUPPRESS SKIP; TITLE3 'ADDING TEST OF MEAN CHANGE DIFFERENT FROM ZERO'; RUN; EXAMPLE USING PROC REPORT ADDING FORMATTING FEATURES AND MULTIPLE STATISTICS ADDING TEST OF MEAN CHANGE DIFFERENT FROM ZERO
DIABETIC STATUS Diabetic
Nondiabetic
Week 11
AGE GROUP
SEX
N
PF CHANGE MEAN
STD ERROR PF CHANGE
P-VALUE: MEAN=0
<65
Female Male
34 45
1.4 1.7
2.4 1.6
0.5522 0.2906
>=65
Female Male
40 58
0.7 0.7
1.8 1.6
0.6876 0.6521
<65
Female Male
92 253
5.9 5.6
1.4 0.8
0.0001 0.0001
>=65
Female Male
100 187
1.5 2.6
1.1 0.9
0.1602 0.0039
Page 58 of 58