cohesion institute
Data Warehouse Architecture Best Practices December 5, 2005
Speaker: R. Michael Pickering President, Cohesion Systems Consulting Inc.
cohesion institute
Agenda
Introductions Business Intelligence Background Architecture Best Practices Questions & Answers
March 21, 2009
DW Architecture Best Practices
2
cohesion institute
Data Warehouse Architecture Best Practices Introductions
cohesion institute
Presenter Biography
R. Michael Pickering
President and Chief Architect, Cohesion Systems Consulting Inc.
over 8 years DW experience
previously, Managing Consultant, BI&W, Oracle Consulting (Canada) before that, Red Brick Systems, Inc. Manulife Reinsurance, Bell Canada, USDA, Kraft Foods, LCBO, Telecom Argentina, Nortel Networks, Procter & Gamble, Bayer, Syncrude, OMoHLTC…
Mr. Pickering has had DW articles published in The Handbook of Data Management
March 21, 2009
DW Architecture Best Practices
4
cohesion institute
Cohesion Systems Consulting
Provides DW and BI services, specializing in:
Architecture & Implementation Consulting Project Management Databases, Appliances & Emerging Technology Training & Mentoring
Since inception in 2000, clients have included Enbridge, CIBC, The Bank of New York, Loyalty Management Group, Canada Post Borderfree, Katz Group March 21, 2009
DW Architecture Best Practices
5
cohesion institute
Audience Survey
By a show of hands, please indicate your experience with:
normalization dimensional modeling operational data store data consolidation Extract Transform Load (ETL) metadata architecture DW appliances
March 21, 2009
DW Architecture Best Practices
6
cohesion institute
Data Warehouse Architecture Best Practices Business Intelligence Background
cohesion institute
What is Business Intelligence?
A Data Warehouse is usually one component of an overall business intelligence solution IT people may be tempted to think in terms of products and technologies BUT...
March 21, 2009
DW Architecture Best Practices
8
cohesion institute
Overarching Goal
March 21, 2009
The overarching goal of business intelligence is to provide the information necessary to MANAGE a business This means providing information in support of management decision making, which is why BI is also called “Decision Support” DW Architecture Best Practices
9
cohesion institute
BI is about “Data Abstraction” Stages (4)
wisdom knowledge information data
audience for a data warehouse typically considers higher slices of data abstraction pyramid lowest level of pyramid is too detailed & unwieldy March 21, 2009
DW Architecture Best Practices
10
cohesion institute
It’s Not Technology
Business Intelligence is about delivering business value
provide tangible benefit by answering important questions that can help the business to achieve its strategic focus
Improving profitability
Reducing cost
Who are our five most profitable clients? What are our least profitable products? Who are our lowest cost suppliers? Which materials incur highest spoilage costs?
Improving customer satisfaction
What factors may lead to lost customers?
March 21, 2009
DW Architecture Best Practices
11
cohesion institute
Business of BI
In some cases, legislation such as SarbanesOxley or Basel II makes some kind of BI fundamental to doing business Many leading companies use BI to achieve competitive advantage
E.g. Walmart, Dell, Amazon.com, Kraft, American Express, etc…
March 21, 2009
DW Architecture Best Practices
12
cohesion institute
Data Warehouse Architecture
architecture is about delivering an elegant solution that meets the solution requirements
this means really understanding the problem
DW architecture is part art, part science March 21, 2009
DW Architecture Best Practices
13
cohesion institute
Good Architecture
‘It’s not easy to describe a good design, but I’ll know it when I see it’
March 21, 2009
DW Architecture Best Practices
14
cohesion institute
BI Architecture Requirements
must recognize change as a constant take incremental development approach existing applications must continue to work need to allow more data and new types of data to be added March 21, 2009
DW Architecture Best Practices
15
cohesion institute
End User Acceptance
understandability
understandability is in the eyes of the beholder want to hide the complexity try to make it:
March 21, 2009
intuitive, obvious visible, memorable
DW Architecture Best Practices
16
cohesion institute
End User Acceptance
performance
March 21, 2009
don’t want to interrupt the thinking process provide one click, instantaneous access warehouse must be available, “production” system DW Architecture Best Practices
17
cohesion institute
Data Warehouse Architecture Best Practices Architecture Best Practices
cohesion institute
High Level Architecture
remember the different “worlds”
on-line transaction processing (OLTP) business intelligence systems (BIS)
users are different data content is different data structures are different architecture & methodology must be different March 21, 2009
DW Architecture Best Practices
19
cohesion institute
Two Different Worlds
On-Line Transaction Processing
Entity Relational Data Model
created in 1960’s to address performance issues with relational database implementations normalized to most efficiently get data in divides the data into many discrete entities many relationships between these entities this approach was documented by C.J. Date in An Introduction to Database Systems
March 21, 2009
DW Architecture Best Practices
20
cohesion institute
Two Different Worlds
Business Intelligence Systems
Dimensional Data Model
also called star schema designed to easily get information out fewer relationships than ERD, the only table with multiple joins connecting to other tables is the central table developed in 1960’s by data service providers, formalized by Ralph Kimball in The Data Warehouse Toolkit
March 21, 2009
DW Architecture Best Practices
21
cohesion institute
Entity Relation Disadvantages
all tables look the same people can’t visualize/remember diagrams software can’t navigate as schema becomes too complex business processes mixed together many artificial keys created March 21, 2009
DW Architecture Best Practices
22
cohesion institute
Dimensional Model Advantages
simplicity humans can navigate and remember software can navigate deterministically business process explicitly separated (Data Mart) not so many keys (keys = # of attendant tables) March 21, 2009
DW Architecture Best Practices
23
cohesion institute
Best Practice #1
Use a data model that is optimized for information retrieval
dimensional model denormalized hybrid approach
March 21, 2009
DW Architecture Best Practices
24
cohesion institute
Data Acquisition Processes
Extract Transform Load (ETL)
the process of unloading or copying data from the source systems, transforming it into the format and data model required in the BI environment, and loading it to the DW also, a software development tool for building ETL processes (an ETL tool) many production DWs use COBOL or other general-purpose programming languages to implement ETL
March 21, 2009
DW Architecture Best Practices
25
cohesion institute
Data Quality Assurance
data cleansing
the process of validating and enriching the data as it is published to the DW also, a software development tool for building data cleansing processes (a data cleansing tool) many production DWs have only very rudimentary data quality assurance processes
March 21, 2009
DW Architecture Best Practices
26
cohesion institute
Data Acquisition & Cleansing
getting data loaded efficiently and correctly is critical to the success of your DW
implementation of data acquisition & cleansing processes represents from 50 to 80% of effort on typical DW projects inaccurate data content can be ‘the kiss of death’ for user acceptance
March 21, 2009
DW Architecture Best Practices
27
cohesion institute
Best Practice #2
Carefully design the data acquisition and cleansing processes for your DW
Ensure the data is processed efficiently and accurately Consider acquiring ETL and Data Cleansing tools Use them well!
March 21, 2009
DW Architecture Best Practices
28
cohesion institute
Data Model
Already discussed the benefits of a dimensional model No matter whether dimensional modeling or any other design approach is used, the data model must be documented
March 21, 2009
DW Architecture Best Practices
29
cohesion institute
Documenting the Data Model
The best practice is to use some kind of data modeling tool
CA ERwin Sybase PowerDesigner Oracle Designer IBM Rational Rose Etc.
Different tools support different modeling notations, but they are more or less equivalent anyway Most tools allow sharing of their metadata with an ETL tool March 21, 2009
DW Architecture Best Practices
30
cohesion institute
Data Model Standards
data model standards appropriate for the environment and tools chosen in your data warehouse should be adopted considerations should be given to data access tool(s) and integration with overall enterprise standards standards must be documented and enforced within the DW team
someone must ‘own’ the data model
to ensure a quality data model, all changes should be reviewed thru some formal process March 21, 2009
DW Architecture Best Practices
31
cohesion institute
Data Model Metadata
Business definitions should be recorded for every field (unless they are technical fields only) Domain of data should be recorded Sample values should be included As more metadata is populated into the modeling tool it becomes increasingly important to be able to share this data across ETL and Data Access tools March 21, 2009
DW Architecture Best Practices
32
cohesion institute
Metadata Architecture
The strategy for sharing data model and other metadata should be formalized and documented Metadata management tools should be considered & the overall metadata architecture should be carefully planned March 21, 2009
DW Architecture Best Practices
33
cohesion institute
Best Practice #3
Design a metadata architecture that allows sharing of metadata between components of your DW
consider metadata standards such as OMG’s Common Warehouse Metamodel (CWM)
March 21, 2009
DW Architecture Best Practices
34
cohesion institute
Alternative Architecture Approaches
Bill Inmon: “Corporate Information Factory” Hub and Spoke philosophy “JBOC” – just a bunch of cubes Let it evolve naturally
March 21, 2009
DW Architecture Best Practices
35
cohesion institute
What We Want (Architectural Principal)
In most cases, business and IT agree that the data warehouse should provide a ‘single version of the truth’ Any approach that can result in disparate data marts or cubes is undesireable This is known as data silos or… March 21, 2009
DW Architecture Best Practices
36
cohesion institute
Enterprise DW Architecture
how to design an enterprise data warehouse and ensure a ‘single version of the truth’? according to Kimball:
start with an overall data architecture phase use “Data Warehouse Bus” design to integrate multiple data marts use incremental approach by building one data mart at a time
March 21, 2009
DW Architecture Best Practices
37
cohesion institute
Data Warehouse Bus Architecture
named for the bus in a computer
standard interface that allows you to plug in cdrom, disk drive, etc. these peripherals work together smoothly
provides framework for data marts to fit together allows separate data marts to be implemented by different groups, even at different times
March 21, 2009
DW Architecture Best Practices
38
cohesion institute
Data Mart Definition
data mart is a complete subset of the overall data warehouse
a single business process OR a group of related business processes
think of a data mart as a collection of related fact tables sharing conformed dimensions, aka a ‘fact constellation’
March 21, 2009
DW Architecture Best Practices
39
cohesion institute
Designing The DW Bus
determine which dimensions will be shared across multiple data marts conform the shared dimensions
produce a master suite of shared dimensions
determine which facts will be shared across data marts conform the facts
standardize the definitions of facts
March 21, 2009
DW Architecture Best Practices
40
cohesion institute
Dimension Granularity
conformed dimensions will usually be granular
makes it easy to integrate with various base level fact tables easy to extend fact table by adding new facts no need to drop or reload fact tables, and no keys have to be changed
March 21, 2009
DW Architecture Best Practices
41
cohesion institute
Conforming Dimensions
by adhering to standards, the separate data marts can be plugged together e.g. customer, product, time they can even share data usefully, for example in a drill across report ensures reports or queries from different data marts share the same context
March 21, 2009
DW Architecture Best Practices
42
cohesion institute
Conforming Dimensions (cont’d)
accomplish this by adding any dimension attribute(s) needed in any data mart(s) to the standard dimension definition
attributes not needed everywhere can always be ignored
typically harder to determine how to load conformed dimensions than to design them initially
need a single integrated ETL process what is the SOR for each attribute? how do we deal with attributes for which there is more than one possible SOR?
March 21, 2009
DW Architecture Best Practices
43
cohesion institute
Conforming Facts
in an enterprise, some metrics may not have the same generally accepted definition across all business units conforming facts is generally a bigger design challenge than conforming dimensions
why?
March 21, 2009
DW Architecture Best Practices
44
cohesion institute
Conforming Facts - Benefits
ensures the constituent data marts can as clearly as possible represent fact data expressed on the same basis using consistent definitions ensures reports or queries from different data marts share consistent content success of an Enterprise DW hinges on successfully conformed facts
any perceived inconsistencies in fact definitions across data marts will generally be considered to be a DW bug or data problem by users if users don’t have full confidence in data quality they may stop using the DW
March 21, 2009
DW Architecture Best Practices
45
cohesion institute
Data Consolidation
a current trend in BI/DW is ‘data consolidation’ from a software vendor perspective, it is tempting to simplify this:
‘we can keep all the tables for all your disparate applications in one physical database’
March 21, 2009
DW Architecture Best Practices
46
cohesion institute
Data Integration
To truly achieve ‘a single version of the truth’, must do more than simply consolidating application databases Must integrate data models and establish common terms of reference
March 21, 2009
DW Architecture Best Practices
47
cohesion institute
Best Practice #4
Take an approach that consolidates data into ‘a single version of the truth’
Data Warehouse Bus
conformed dimensions & facts
OR?
March 21, 2009
DW Architecture Best Practices
48
cohesion institute
Operational Data Store (ODS)
a single point of integration for disparate operational systems contains integrated data at the most detailed level (transactional) may be loaded in ‘near real time’ or periodically can be used for centralized operational reporting March 21, 2009
DW Architecture Best Practices
49
cohesion institute
Role of an ODS in DW Architecture
In the case where an ODS is a necessary component of the overall DW, it should be carefully integrated into the overall architecture Can also be used for:
Staging area Master/reference data management Etc…
March 21, 2009
DW Architecture Best Practices
50
cohesion institute
ODS Data Model
Not clear if any design approach for an ODS data model has emerged as a best practice
normalized dimensional denormalized/hybrid any suggestions?
March 21, 2009
DW Architecture Best Practices
51
cohesion institute
Best Practice #5
Consider implementing an ODS only when information retrieval requirements are near the bottom of the data abstraction pyramid and/or when there are multiple operational sources that need to be accessed
Must ensure that the data model is integrated, not just consolidated May consider 3NF data model Avoid at all costs a ‘data dumping ground’
March 21, 2009
DW Architecture Best Practices
52
cohesion institute
Capacity Planning
DW workloads are typically very demanding, especially for I/O capacity Successful implementations tend to grow very quickly, both in number of users and data volume Rules of thumb do exist for sizing the hardware platform to provide adequate initial performance
typically based on estimated ‘raw’ data size of proposed database e.g. 100-150 Gb per modern CPU
March 21, 2009
DW Architecture Best Practices
53
cohesion institute
SMP Server Scale Up
Scaling performance within a single SMP server is referred to as ‘scale up’ Database benchmarks suggest Windows scalability is near that of Linux IBM claims near-linear scalability for Linux (on commodity hardware) up to about 4 processors
Probably not cost effective to scale up Linux much beyond 4 processors
IBM claims near-linear scalability for AIX on POWER5 up to about 8 processors March 21, 2009
DW Architecture Best Practices
54
cohesion institute
Scale Out
There is an increasing trend in IT to ‘scale out’ processing capacity by deploying many small, commodity servers rather than a single large SMP system This strategy tends to work well for relatively simple applications such as network or web servers For very complex workloads such as a data warehouse, this strategy is much more difficult to effectively implement
Especially so for the database server itself
March 21, 2009
DW Architecture Best Practices
55
cohesion institute
Scale Up vs. Scale Out
To obtain the total number of processors required for the estimated DW workload, must plan either to scale up or scale out Both options are viable but, all other things being equal, scaling up is less disruptive to end users and requires less work to implement
scaling up can offer lower hardware investment, if practical however, network bandwidth or latency issues can limit effectiveness of parallelism
March 21, 2009
DW Architecture Best Practices
56
cohesion institute
Best Practice #6
Create a capacity plan for your BI application & monitor it carefully Consider future additional performance demands
Establish standard performance benchmark queries and regularly run them Implement capacity monitoring tools Build scalability into your architecture May need to allow for scaling both up and out!
March 21, 2009
DW Architecture Best Practices
57
cohesion institute
Open Source Affordability
Another emerging trend in IT generally is to utilize Open Source software running on commodity hardware
this is expected to offer lower total cost of ownership certainly, GNU/Linux and other Open Source initiatives do provide very good functionality and quality for minimal cost
This trend also applies to BI & DW:
most traditional rdbms’s are now supported on Linux however, open source rdbms’s lag behind on providing good performance for DW queries
March 21, 2009
DW Architecture Best Practices
58
cohesion institute
DW Appliances
DW appliances, consisting of packaged solutions providing all required software and hardware, are beginning to offer very promising price/performance production experience is limited so far, so this is not yet a ‘best practice’
March 21, 2009
DW Architecture Best Practices
59
cohesion institute
Data Warehouse Architecture Best Practices Q&A
cohesion institute
cohesion systems consulting inc
the modern art of data abstraction