SAS M2007 Data Mining Conference October 1-2, 2007
Data Preparation for Data Mining in Health Care using SAS
Las Vegas
S. Greg Potts, MBA 1 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
Introduction
Data mining practitioners are well aware that most of the total effort required to complete a data mining project is not spent in the “trendier” aspects of the project such as problem definition or algorithm/technique selection, application, and interpretation of the results.
Data Preparation for Data Mining in Health Care using SAS
Unfortunately, most time – up to 80% often cited – is spent “in the trenches” acquiring the data (i.e., most business data today is stored in transactional data warehouses where data elements essential for mining are dispersed across multiple tables), getting to know the data (i.e., conducting exploratory data analysis), and preparing the data mining table (i.e., summarizing data to the “unit of analysis” and creating derived variables to be used as targets and inputs in the modeling analysis) in the form required by the data mining algorithm (i.e., most current algorithms require data in the form of a a one-row-per-subject data table).
Contents
This presentation will present two case studies in using SAS to extract and prepare data for data mining. The first case will explore how to extract and prepare transactional (Medicaid claims) data for directed data mining, where the goal is to explain or predict the value(s) of a particular target variable. The second case will explore data extraction and preparation for undirected data mining (cluster analysis) using hospital-level data supplied to Medicare Quality Improvement Organizations (QIOs). ©2007 Arkansas Foundation for Medical Care, Inc.
<< Previous | Next >> AFMC and QI in Medicare and Medicaid . . . . . . . . . 3 AFMC as QIO and EQRO . . . . . . . . . . . . . . .4 Data Preparation for Directed Data Mining. . . . 7 Data Preparation for Undirected Data Mining .17 References . . . . . . . . .29
2
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
1
AFMC and QI in Medicare and Medicaid • Medicare provides health insurance for people age 65 and over, those with permanent kidney failure and certain people with disabilities (more than 400,000 individuals in Arkansas)
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
• Medicaid is a jointly-funded, Federal-State medical assistance program for certain low income and needy people (more than 600,000 individuals in Arkansas).
3 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
AFMC as QIO and EQRO • Centers for Medicare & Medicaid Services (CMS) designated Quality Improvement Organization (QIO) for the state of Arkansas. - Assist providers (Hospitals, Physicians, Nursing Homes, etc.) with measuring and reporting quality measures and redesigning care processes; provide statistical support and assistance in interpreting data results.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
• External Quality Review Organization (EQRO-like) & Review Agent for the Arkansas Medicaid Program. - Prior Authorization Reviews, Retrospective Reviews, HEDIS measures, Patient Satisfaction Surveys, and Data Mining. • Multi-disciplinary team of clinicians, statisticians, and consultants • Goal: To ensure that everyone receives the right care, at the right place, at the right time – every time.
4 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
2
AFMC’s Data Mining Team Subject Matter/Policy Experts
Data Experts
Data Preparation for Data Mining in Health Care using SAS
William E. Golden, MD, FACP Vice President of Quality Improvement; UAMS Professor of Medicine and Public Health
Clayton Wells, MS
<< Previous | Next >>
Medicaid Data Analysis Manager S. Greg Potts, MBA
Nena Sanchez, MS Vice President, Medicare & Medicaid Operations
Data Mining Team Leader
AFMC’s Data Mining Team combines clinical, policy, and data experts.
Donna West, RPh, PhD Consultant UAMS Associate Professor of Pharmacy
5 SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
©2007 Arkansas Foundation for Medical Care, Inc.
Today’s Health Care Environment Data Preparation for Data Mining in Health Care using SAS
EMPLOYER/CONSUMER DEMANDS FOR ACCOUNTABILITY & TRANSPARENCY
RISING COSTS $$$$$$$$$
EXPLOSION IN CLINICAL KNOWLEDGE
QUALITY
WIDE VARIATIONS IN QUALITY
<< Previous | Next >>
Many challenges exist in today’s dynamic health care environment.
IMPROVEMENT
INDUSTRY FRAGMENTATION
PROVIDER/PAYER DATA SILOS
6 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
3
Data Preparation for Data Mining in Health Care using SAS
Data Preparation for Directed Data Mining
<< Previous | Next >>
7 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
Directed Data Mining in Medicaid • AFMC Medicaid Data Mining Projects are often conducted with an eye toward identifying high cost drivers for utilization review and/or cost containment or to identify care coordination inefficiencies, which can be opportunities for quality improvement.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
• Projects are client and/or literature-based. • In directed data mining, the goal is to explain or predict the value(s) of a target variable. • Recipient/Member is the often the “Unit of Analysis”. Target is (usually) total costs ($) per member while inputs are (usually) binary and represent diagnosis, procedure, drug, and provider type code classes. • Techniques used include decision trees and (infrequently) regression.
8 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
4
Medicaid Data Source • Arkansas Medicaid Decision Support System (DSS) Data Warehouse - contains over 6 years of historical medical claims data among 500+ data tables (some containing millions of rows of data) at granular levels of detail along with eligibility and demographic data.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
Claims Analysis Columns
Drug Detail Columns
NDC Code Drug Nam Drug Class ……..
Clm Num Dtl Num Recip ID Amt Paid NDC Code
Eligibility Columns
Recip ID Plan Code Elig Curr Elig Beg Elig End …….
Enrollment Columns
Recip ID County DOB Gender Race …… 9
©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
Data Acquisition: Medicaid Data Queries • Access Oracle® Data Warehouse through BusinessObjects® to learn table relationships and build Oracle® SQL queries when necessary. • AFMC licenses SAS/ACCESS® Interface to ODBC. • Established a Windows® ODBC connection with Oracle® Data Warehouse. • Copy and Paste Oracle® SQL code into SAS program and execute.
• Advantages: 1) Pulls data back in SAS Data set and, 2) Bypasses BusinessObjects’® query size limitations. ©2007 Arkansas Foundation for Medical Care, Inc.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
Data is acquired via SQL queries.
10 SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
5
Medicaid Transactional Claims Data Example Recipient ID
Clm Num
DTL Num
Prov Type
Dx
Proc
Amt_ Paid
001
050600407
1
Phys
250.00
99212
$25.00
…
001
050600407
2
Phys
640.01
81000
$43.38
…
002
050600408
1
Pharm
002
050600409
1
Phys
250.10
99212
$45.13
…
003
050600410
1
Phys
427.0
A0426
$240.46
…
003
050600411
2
Phys
427.0
A0390
$225.72
…
$106.45
NDC Code
Data Preparation for Data Mining in Health Care using SAS
00406035705
Drug Class
280808
…
<< Previous | Next >>
…
• Multiple rows of claim detail data make up one claim per recipient/member. • Challenge is to summarize data to the recipient level and to create target and input variables to be used in modeling analysis. 11 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
SAS Procedures for Transactional Data Preparation – Summing Costs • Use PROC SQL to summarize paid amounts (AMT_PAID) to recipientlevel and to break claim costs out by medical vs. pharmacy.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
• Total Costs (target) are calculated from the variables created in these tables using a DATA step later in the data prep program.
12 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
6
SAS Procedures for Transactional Data Preparation – Creating Binary Inputs • Use PROC SQL to create table of recipient IDs with non-missing Dx codes.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
As shown, a number of SAS procedures and language statements are needed to transform transactional data into a one-row-per unit-of-analysis format suitable for directed data mining techniques.
• Use FIRST. by-processing to create an enumeration variable.
• Sort Data Set • Convert sorted data from rows to columns using PROC TRANSPOSE
• Array to create binary input variables (0,1) for code classes.
• Repeat process for Procedure Class, Drug Class, and Provider Types
13
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
©2007 Arkansas Foundation for Medical Care, Inc.
Sample Directed Data Mining Modeling Data Set
Data Preparation for Data Mining in Health Care using SAS
Total Cost = f (Clinical Diagnoses, Procedures, Prescription Drugs, Providers Seen, Age, Gender) per Recipient
<< Previous | Next >> Input Variables
Target Variable
Total SFY 2004 Costs
Dx Group1
Dx Group2
Recipient ID
(001-139 Infectious & Parasitic Diseases)
(140-239 Neoplasms)
…
001
$15,232.84
1
1
…
1
…
1
…
002
$2,006.72
1
0
…
1
…
0
…
003
$8,354.89
0
1
…
1
…
1
…
.
.
.
.
…
.
…
.
…
.
.
.
.
…
.
…
.
…
Procedure Code Group1
…
(00100-01999 Anesthesiology)
Thera Class1
…
(04:00.00 Antihistamines)
Mining data set contains both Interval (i.e., Total Costs) and binary variables (i.e., DX_GROUP_1)
14 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
7
Why use Decision Trees?
Data Preparation for Data Mining in Health Care using SAS
• Excellent Tool for Data Mining - Profiling • Allow you to easily see patterns in data with respect to a target variable (i.e., Interval – total cost $ per recipient).
<< Previous | Next >>
• Decision Tree algorithm “reads” the data and determines the best variable on which to “split” the data. • Splits continue as long as they are statistically significant.
15 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
Sample Decision Tree Results SFY 2004 Continuously-Enrolled Medicaid Recipients with a Diabetes Dx (250.0X-250.9X) (Partial Tree Results)
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
This group of recipients with a Diabetes Dx accrued average total costs more than three times other recipients with the same diagnosis. Why? Child nodes from decision tree may require more drill-down analysis or model tuning in the form of variable reduction and reapplication. ©2007 Arkansas Foundation for Medical Care, Inc.
16
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
8
Data Preparation for Data Mining in Health Care using SAS
Data Preparation for Undirected Data Mining
<< Previous | Next >>
17 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
Undirected Data Mining in Medicare • AFMC Medicare Data Mining Projects are often conducted with an eye toward identifying opportunities for quality improvement and safeguarding Medicare program funds.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
• In undirected data mining, the goal is to uncover the hidden structure in data without respect to a target variable. • Cluster analysis (PROC FASTCLUS) is technique often used.
18 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
9
Hospital Payment Monitoring Program (HPMP) • The Centers for Medicare and Medicaid Services (CMS) developed the Hospital Payment Monitoring Program (HPMP) primarily to calculate and monitor the Medicare fee-for-service paid claims error rate for inpatient acutecare hospital services.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
HPMP is a QIO Statement of Work (SOW) priority.
• Under contracts with CMS, several companies – including QIOs like AFMC – are responsible for operating the HPMP.
19 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
HPMP, QIOs, and Reporting • Each quarter, QIOs receive state hospital-level data containing the calculated paid claims rates and other summary measures for 14 different target areas identified by CMS as prone to payment errors. This quarterly report is known as the First-Look Analysis Tool for Hospital Outlier Monitoring (FATHOM).
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
Example target area paid claims rate: Target 12: Three Day Transfer to SNF Numerator: count of discharges to a SNF with a three-day length of stay Denominator: count of all discharges to a SNF or swing bed • From the FATHOM data, QIOs can generate and distribute the Program for Evaluating Payment Patterns Electronic Report (PEPPER). The PEPPERs are hospital-specific reports that allow Inpatient Prospective Payment System (IPPS) hospitals to compare their own billing practices in the 14 target areas with other IPPS hospitals within the state. • Each QIO uses these data tools to work with hospitals in their state to reduce improper admissions and Diagnosis-Related Group (DRG) payment errors. ©2007 Arkansas Foundation for Medical Care, Inc.
20
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
10
Resource Challenge and Answer • With 50+ IPPS Arkansas hospitals and limited staff, AFMC desired to find a way to identify hospitals that have been “extreme outliers” (95th percentile or above on two measures) over time, thus indicating a need for close monitoring and possible notification.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
• Answer: Cluster Analysis to segment hospitals into 3 “like” groups based on 5 of the 11 hospital/target-level calculated measures in the data (see right), then graphing the results of two of the 5 measures to determine “extreme outliers”.
21 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
Data Acquisition: Medicare Hospital-Level Data
Data Preparation for Data Mining in Health Care using SAS
• The FATHOM hospital-level data is provided to each QIO from another CMS/HPMP contractor in a Microsoft Access® database (*.mdb).
<< Previous | Next >>
• PROC IMPORT is used to import data table from the *.mdb file.
22 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
11
SAS Procedures for Data Preparation at UA-level – Standardization • Per SAS/STAT 9.1.3 Online Documentation regarding PROC FASTCLUS, Pg. 1382-1383: “Variables with larger variances exert a larger influence in calculating the clusters...Therefore it is necessary to standardize the variables before performing the cluster analysis.”
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
• PROC STANDARD is used to standardize all variables used in the cluster analysis to mean=0, std=1 prior to invoking PROC FASTCLUS.
23 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
Cluster Analysis • Run PROC FASTCLUS against output data set containing standardized variables.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
24 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
12
Plot to Produce Visual Representation • Run PROC GPLOT to plot two of the five variables used in cluster analysis to determine outliers (note: highlighted syntax produces red vertical and horizontal references to the 95th percentile value for each variable being plotted):
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
25 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
Plot Results
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
• (Standardized) Variables used in cluster analysis being plotted are: - Outlier Value: A single number from -10 to 10, describing how unusual a hospital is compared to all IPPS hospitals in the state. - Outlier Times Count: Outlier value weighted by number of discharges. This measure captures both the unusualness of the hospitals’ target outlier value and the volume of target discharges. • Outliers facilities are those facilities that consistently fall in the upper-right-most quadrant (>95th percentile on BOTH measures). 26 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
13
Sample Letter to Hospitals Regarding Outlier Status
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
27 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
Conclusions • Most time spent in a Data Mining project is spent acquiring and preparing data, not on algorithm/technique selection and application.
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
• SAS has a vast arsenal of tools to help you acquire, prepare/transform, and mine your data.
28 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
14
References Data Preparation for Data Mining Using SAS Mamdouh Refaat (Morgan Kaufmann/Elsevier) http://books.elsevier.com/us/mk/us/subindex.asp?isbn=97 80123735775&country=United+States&community=mk&r ef=&mscssid=KHFKVKDF6HNU8HJND9V1RB81QR712R 2E or http://www.amazon.com/Preparation-Mining-KaufmannManagementSystems/dp/0123735777/ref=pd_bbs_sr_1/102-55832127763329?ie=UTF8&s=books&qid=1179937157&sr=1-1 Base SAS Procedures and syntax (PROC IMPORT, PROC SQL, PROC TRANSPOSE) http://support.sas.com/documentation/onlinedoc/91pdf/sas doc_913/base_proc_8977_new.pdf SAS Language Reference (Arrays) http://support.sas.com/documentation/onlinedoc/91pdf/sas doc_913/base_lrconcept_9196.pdf
Decision Trees in Enterprise Miner http://support.sas.com/documentation/onlin edoc/91pdf/sasdoc_91/em_gs_7281.pdf SAS/STAT Procedures (PROC FASTCLUS) http://support.sas.com/documentation/onlin edoc/91pdf/sasdoc_91/stat_ug_7313.pdf or http://www2.sas.com/proceedings/sugi24/ Stats/p270-24.pdf
Data Preparation for Data Mining in Health Care using SAS << Previous | Next >>
Hospital Payment Monitoring Program (HPMP) http://oig.hhs.gov/oas/reports/region3/3050 0007.pdf
29 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
Acknowledgements
Data Preparation for Data Mining in Health Care using SAS
• M2007 Co-Chairs: •Jerry Oglesby, SAS Institute
<< Previous | Next >>
•Goutam Chakraborty, Oklahoma State University • Rona Bellinger, AFMC Manager of Web & Graphic Services • Karen Gabel and Tori Gammill, AFMC HPMP Team Members • AFMC’s Data Mining Team
30 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
15
Questions??? Contact: S. Greg Potts, MBA Data Mining Team Leader Arkansas Foundation for Medical Care Office of Projects and Analysis 401 West Capitol, Suite 410 Little Rock, AR 72201 (501) 212-8734 Phone (501) 375-1201 Fax E-mail:
[email protected]
31 ©2007 Arkansas Foundation for Medical Care, Inc.
SAS M2007 Data Mining Conference | October 1-2, 2007 | Las Vegas
16