Data Management & Warehousing
Data Quality: Common Problems & Checks Date: 24 April 2009 Location: Zagreb, Croatia David M. Walker
[email protected] +44 (0) 7050 028 911 - http://www.datamgmt.com
Agenda • • • • •
Introduction Common Problems Automated Checking Profiling Checks Conclusions
24 April 2009
© 2009 Data Management & Warehousing
Page 2
Introduction • Data Quality problems are a SOURCE SYSTEM issue and an ETL issue – They just manifest themselves in the data warehouse
• Prevention is better than cure – Fixing the source system or the ETL is ALWAYS cheaper and more effective than cleaning the data in the ETL or in the Data Warehouse itself
• Data Quality is a continuous process – It is never finished and always needs to be monitored 24 April 2009
© 2009 Data Management & Warehousing
Page 3
The Impact of Poor Data Quality • Devalues the data warehouse – Discourages people from trusting or using the system and therefore curtailing the life of the data warehouse
• Highlights failings in the source system and/or the business process – Businesses would rather fix at any cost in the data warehouse and pretend that there isn’t a source system problem 24 April 2009
© 2009 Data Management & Warehousing
Page 4
Common Problems • 11 types of problem that account for the most common problems • They usually reflect poor design and/or implementation of systems • Most can be fixed or monitored and managed to limit the impact
24 April 2009
© 2009 Data Management & Warehousing
Page 5
Referential Issues • Keys that are not unique – Systems that do not enforce the unique primary key or flat files or spreadsheets – Also generated by ETL that creates a surrogate key incorrectly
• Referential Integrity Failures – Where referentially integrity is not enforced values in the child table are created that are not in the parent table 24 April 2009
© 2009 Data Management & Warehousing
Page 6
Data Type Issues • Format Errors – Typically in Date/Time type fields – 02/04/2009 2nd April (UK) or 4th Feb (US)
• Inappropriate Data Types – Storing Dates in Character Strings 20090624 as YYYYMMDD format string – But what about 20090230?
24 April 2009
© 2009 Data Management & Warehousing
Page 7
Data Model Issues • De-normalised tables – Commonly created for performance reasons – Inherently duplicates data – Often gets out of sync
• Data/Column Retirement – Upgrade to system retires a column – ETL continues to use the old column
• Poor Table/Column Naming – Don’t assume that a column does what it says – Don’t assume that a column is still being used for it’s original purpose 24 April 2009
© 2009 Data Management & Warehousing
Page 8
Data Content Issues • Null Values – Systems that have many optional fields will often have missing values – Null values allow rows to be silently omitted from queries
• Inappropriate Values – Databases allow special characters and/or leading/trailing white space – “DataspaceQuality” != “DatatabQuality” 24 April 2009
© 2009 Data Management & Warehousing
Page 9
Data Feed Issues • Missing Data – Where a stream of files are loaded by ETL if one is dropped it can go un-noticed – Common with CDR type loads in Telcos
• Late Data – A short term data quality issue – Leaves users believing there is a problem – Produces inconsistent reporting over time 24 April 2009
© 2009 Data Management & Warehousing
Page 10
Automated Checking • Regularly run checks • Broad Coverage across systems – 100s and 1000s not 10s of queries – Run in a low priority loop in the background
• Used against: – Sources – Data Staging – Data Warehouse
• No Product Required – We often implement this as a controlling shell script and lots of small scripts, one for each check 24 April 2009
© 2009 Data Management & Warehousing
Page 11
Trending • Absolute Trending – Track an expected value over time • e.g. Returned Mail is usually less than 500 items per day • If the value is <500 status is green, 501 to 1000 amber and >1000 red
• Statistical Process Control (SPC) Trending – Track an expected value where the value changes over time • • • •
24 April 2009
e.g. Telco CDRs – expect more as the company grows Don’t want to be continuously changing the threshold Compare current load to historical means If current load within 2 Standard Deviations – Green, 3 Standard Deviations – Amber, 4 or more Standard Deviations - Red © 2009 Data Management & Warehousing
Page 12
Flow Control • Flow Control – ETL manipulates data • Joins, De-duplicates, Filters, Aggregates, etc
– Use the formula: Source Count - Filtered Count – DeDup Count – Target Count = 0
• Trusted Source – Compare the result with a third system – e.g. Does the Count of Switch CDRs = Count of those processed by the billing system Count of those processed in the DWH 24 April 2009
© 2009 Data Management & Warehousing
Page 13
Business Rules Based • Specific rules to match known business rules – Account holders > 18 (Sys Date – DoB) – Account holders < 115 – Credit Card numbers are 16 digits long – Number of accounts without a status
• Result should yield Zero
24 April 2009
© 2009 Data Management & Warehousing
Page 14
Automated Checking - Ops • Managed by exception – Red given priority – Amber are always followed up
• Massive number of checks – 100’s are good – 1000’s are better
• Presentation – Alerts, RAG, Graphical, Numerical, etc. 24 April 2009
© 2009 Data Management & Warehousing
Page 15
Data Profiling Checks • Run manually because they need to be interpreted by a human • Leads to new business rules being added to the automated checks • Can be done with simple reporting tools or commercial data profiling tools
24 April 2009
© 2009 Data Management & Warehousing
Page 16
Frequency Outliers • Count discreet values in a table and check items with many more or less than normal – e.g. DoB 01-01-01 many times more common than any other value indicates source default and something that needs work – e.g. Count of SMS messages significantly lower on a given day may equate to a genuine system failure and therefore not a DQ problem 24 April 2009
© 2009 Data Management & Warehousing
Page 17
Maximum & Minimum • Determine what a valid range for any value should be – e.g. age between 18 and 115 – Immediately finds individual data quality issues that can be resolved – Allows an analyst to create new business rules to prevent future problems
24 April 2009
© 2009 Data Management & Warehousing
Page 18
Sequential Keys • If a system has a sequential key: Max Value – Min Value – Count = 0 • If this is true – is it too perfect for an operational system and therefore test data • If this is false – what has caused the gaps, are the deletions intentional? 24 April 2009
© 2009 Data Management & Warehousing
Page 19
Data Types • Validation of mis-used data types before loading – e.g. Dates in Character fields – Format: YYYYMMDD – Check MM between 01 and 12 – Check DD between 01 and 31 – Check MMDD does not include 0230, 0231 – etc. 24 April 2009
© 2009 Data Management & Warehousing
Page 20
Skewed Pattern Profiling • Looking for specific patterns in data – e.g. UK National Insurance Numbers (?) have the format AA 99 99 99 A – Pattern match all values looking for exceptions
• Number Lengths are a special case – e.g Credit Card Numbers are 16 digits long
24 April 2009
© 2009 Data Management & Warehousing
Page 21
Content Checking • Content Checking is the manual review of character strings • Needs a good understanding of the nature of the data • Often determines the need to do analysis of other types
24 April 2009
© 2009 Data Management & Warehousing
Page 22
Nulls & White Space • Nulls – Fields that have large proportion of nulls are usually not useful – Also common is default status of null (e.g. Account us either closed or null)
• White Space – Not Null fields with a single space – Tab instead of space – Leading/Trailing white space – Double White Space: “DavidSpaceSpaceWalker” 24 April 2009
© 2009 Data Management & Warehousing
Page 23
Punctuation & Control Chars • Punctuation – CSV files that are not properly quoted perform field shifts – Address lines with extra commas
• Control Characters – Data fields that contain ASCII character codes 0 to 31 and 127 to 159 are often ‘invisible’ when viewed in queries but cause failures – Also be aware of ‘code-page’ specifics 24 April 2009
© 2009 Data Management & Warehousing
Page 24
Problem Management Matrix
24 April 2009
© 2009 Data Management & Warehousing
Page 25
Continuous DQ Process
24 April 2009
© 2009 Data Management & Warehousing
Page 26
Quality is FREE … … as long as you are prepared to INVEST HEAVILY in it Philip Crosby 1980 Especially true of Data Quality 24 April 2009
© 2009 Data Management & Warehousing
Page 27
Data Management & Warehousing
Data Quality: Common Problems & Checks
Thank You David M. Walker
[email protected] +44 (0) 7050 028 911 - http://www.datamgmt.com