Allan Peda
Enterprise Application Integration Y52.3250 Prof. Aldrich Wright
Allan Peda Week 3, Kimball vs Inmon Data Warehouse Designs
Due Date: February 17, 2007, 10PM
Page 1 of 7
Allan Peda
Assignment Describe a Business Problem or set of business problems that the your company is trying to solve through a data warehouse or data mart. Compare and contrast the Kimball and Inmon approaches to data warehouse. Pick the one that best solves your business problem and why.
Problem Statement The chosen business problem for this data warehouse is the determination of enrollment trends and subject interest within a university. This information derived from these data is of tremendous value as it will be used to develop long term growth plans for the university including campus expansion, departmental funding levels and potential areas for enhanced research and development funding. The initial focal point of this from a business unit perspective will be student registration data and library collection access records. This will define initial data feeds for ETL processing.
Problem Resolution The chosen design is the relational (Inmon) database design as it is considered to be quicker to develop, supports gradual expansion over time, is robust, flexible and scalable. These attributes are considered more important than speed of implementation or raw performance. Kimballs multidimensional system (also known as BUS for reasons discussed below) is considered more rigid because well defined conforming dimension and fact tables must be selected at the outset of system development. The fact that the BUS architecture is modeled on the analytical requirements of the end users creates difficulties when these requirements change.
Page 2 of 7
Allan Peda Relational
Multidimensional
High Performance
Considered Quicker To Implement
Gradual Development Supported
Robust
Flexible
Scalable Performance
A more extensive discussion of each system is developed below.
Common Attributes among all data warehouse designs No matter which design paradigm is used to approach the development of a data warehouse design, the basic definition of a data warehouse remains the same. It is a subject oriented, integrated, nonvolatile and time variant collection of data in support of management's decisions. As such it can be viewed as one type of read intensive online analytical processing (OLAP) system which must integrate a large amount of data derived from multiple business units. These data must be consistent and summarized at the appropriate level of granularity, consequently an Extract, Transform and Load (ETL) step is necessary in drawing data from the operational (OLTP) systems.1 Both designs emphasize the importance of the ETL step in developing cleansed data which is loaded from a common Integrated Data Store. There must be no direct dependence on legacy or operational data.
The Multidimensional (Kimball) Approach As noted earlier, the Kimball design is developed along the requirements of the end user. As such the data must be denormalized and refactored using a well defined set of end user requirements. The underlying logical architecture is not 1 This step is sometimes altered slightly with data transformation done primarily within the database. This is referred to as Extract, Load and Transform (ELT). http://www.b-eye-network.com/blogs/linstedt/archives/2005/05/elt_and_etl_can.php
Page 3 of 7
Allan Peda
relational, instead it is multidimensional (MOLAP, often referred to as a multidimensional hypercube). It should be noted that star schemas can be developed using tables within an RDBMS. In such cases it is possible to reduce data redundancy and lower storage requirements by reintroducing some level of normalization to the data. Such controlled reintroduction of normalization is known as snowflaking of a star schema. Physical MOLAP database systems (as opposed to the logical schema) are typically proprietary commercial off the shelf (COTS) products, and as such there is little publicly information regarding the actual implementation of such systems. Kimball data warehouses are comprised of multiple subject oriented data marts which each center on a few core fact tables. These tables contain data which are easily aggregated. Surrounding these fact tables are multiple dimension tables, which contain the associated characteristics of these facts (time being an almost universal attribute for example). These dimension tables are considered to be the entry points used to access and aggregate the associated facts. They are used for filtering, grouping and labeling. In order for queries to determine what correlations between facts and attributes, it is desirable for dimension and fact tables to be shared among several subject areas. The careful design and selection of this set of conforming fact and dimension tables is referred to by Kimball as the BUS architecture.
Page 4 of 7
Allan Peda
Illustration 1: Data flow in the Kimball Data Warehouse Architecture (from The Data Warehouse Toolkit, Kimball 2nd edition,2003)
The Relational (Inmon) Approach In contrast with the architecture developed by Ralph Kimball, Bill Inmon's design is based on a more flexible relational design, which is shaped by the enterprise data model. The relational design is the more abstract and more flexible one, however this flexibility is achieved at a cost, and it is generally accepted that relational model is not optimal in terms of reporting performance requirements. Data which are spread among several parent-child relationships will require multiple lookups in order to fetch each record. In order to obtain acceptable performance levels, the relational data warehouse is often systematically denormalized in places. The following techniques are recommended by Inmon: 1. Tables may be physically merged (possibly by creating materialized views during load step). 2. Data may be refactored so they are no longer atomic; this may be accomplished via embedded arrays.
Page 5 of 7
Allan Peda
3. Data which has a low probability of access may be placed in a separate tables. 4. Redundant data may be deliberately introduced to reduce the need to lookup other tables. 5. Derived data may be placed within the database. This is a special instance of data redundancy. This may be accomplished via regularly scheduled batch processes or by use of a creative index during the data loading step. 6. Most noteworthy perhaps is the fact that Inmon states that it is “patently incorrect” to enforce referential integrity within a data warehouse. As such any parent-child integrity within a data warehouse is referred to by Inmon as a relationship “artifact”. Inmon's design also relies on star (and snowflake) joins, however these design techniques are restricted to the summarized data presented at the data warehouse level. In Ralph Inmon's data warehouse designs, data marts are not linked together via a data bus of conforming fact and dimension tables, instead the RDBMS system upstream of the data marts is considered the warehouse proper, and it is there where relations are derived allowing queries across subject areas (contrast his to the use of conforming dimensions and facts within Kimball's architectures). As noted earlier Kimball's logical architectures naturally map directly to a physical multidimensional online analytical processing (OLAP) database system known as MOLAP, while Bill Inmon's designs are inherently relational OLAP (ROLAP). Combinations, or hybrids of both systems are known as HOLAP.
Page 6 of 7
Allan Peda
Illustration 2: Data flow for the Inmon Relational Data Warehouse
References 1. Inmon, W.H., Building The Data Warehouse, 4th edition. Indianapolis, Indiana: Wiley Publications, 2005. 2. Kimball, R and Ross, M. The Data Warehouse Toolkit, 2nd edition. Indianapolis, Indiana: Wiley Publications, 2003. 3. Website for The Kimball Group http://www.kimballgroup.com/ 4. Website for Bill Inmons Corporate Information Factory (CIF) http://www.inmoncif.com/home/
Page 7 of 7