DATA WAREHOUSING
Data Warehousing Concepts
What is a Data Warehouse?
A Data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data in support of the management's decision –making process (OR) Data warehouse is a relational database for Query and Analysis Rather than for transaction processing.
Subject-oriented(customer,products,sales,etc) Non-volatile Time-varying Integrated
(William.Inmon –1993)
Subject-oriented
A Data Warehouse is organized around major subjects, such as customer, supplier, product and sales. Rather than concentrating on day-to-day operations and transaction processing of an organization.
Integrated
A Data Warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records. data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures and so on.
Time Variant
Data are stored to provide information from a historical perspective (example: the past 5 -10 years). Every key structure in the data warehouse contains, either implicitly or explicitly an element of time
Non-Volatile
A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. due to this separation, a data warehouse does not require transaction processing,recovery,and concurrency control mechanisms. it usually requires only two operations in data accessing :initial loading loading of data and access of data.
Comparison of OLTP systems and data warehousing systems OLTP Systems
Data warehousing systems
Holds Current data
Holds Historical data
Stores detailed data
stores detailed,lightly,and highly summarized data
Data is dynamic
Data is largely static
Transaction-driven
Analysis driven
Application -oriented
Subject -oriented
Supports day-to-day decisions
Supports strategic decisions
Serves large number of clerical/operational users
Serves relatively low number of managerial users
Dimension Modeling Dimension: Dimension is a structure which consists of levels, and hierarchies is defined on each level. Example:
SEX
MALE
FEMALE
Dimension Modeling Example: Profession
Secretary
Engineer
Chemical
Civil
Executive
Junior
Level 0
Teacher
Elementary
Level 1
High School
Level 2
FACTS
Fact: Fact consists of whole data with primary key, foreign key relation ship with dimensions and also consists of measures. There are Three types of facts 1.ADDITIVE FACTS 2.SEMI ADDITIVE FACTS 3.NON ADDITIVE FACTS
Fact less Fact & Conform dim
Fact less fact is a fact it does not contain Measures. A Dimension which can share more than one Fact is called Conform Dimension Collection of Star Schemas and Snowflake Schemas is called Galaxy.
Star schema & Snow flake Schema
Star schema: A Centralized fact table surrounded with dimension tables having Primary, Foreign key relation ship between them is called star schema. Snow flake Schema: A normalized star schema Is called Snow flake Schema Galaxy: Collection of Star schemas and snow flake Schema is called Galaxy.
Star Schema Sex-Dim
Date-Dim
Sex key Sex
Fact Table
Profession-Dim Profession-key Profession-class Title Level discipline
Profession Key ----------------------------Sex key ---------------------------Address key ----------------------------Date key --------------------------------------Measures (Numeric)
Date key Current year Current month Current week
Conform Dim
Address Dim Address key Country State City
Snow Flake Schema Item Dim Item key Item name Type Supplier key
Supplier Dim Supplier key Supplier name Supplier address Supplier type
Fact Table Time key Item key Location key ------------------------------------Rupees sold Units sold
Location Dim Location key Street City key
Time Dim Time key Year Quarter Month day
City Dim City key City name State Country Pin code
TYPES OF MAPPINGS
History Simple pass through (None) Slowly growing target (Full) Slowly changing dimension (depends)
Types of SCD’S
4. 5. 6.
Slowly changing dimension—1 Slowly changing dimension—2 Time stamping versioning Flaging Slowly changing dimension—3
Slowly changing Dimension-1
SCD-1:When you does not want History use this kind of mapping (Only insert else Update takes place) it inserts the new row or Update the existing dimensions.
Slowly changing Dimension-2
SCD -2 (Time stamp): When you want
maintain full history use this kind of mapping. Inserts new and changed dimensions. Creates an effective date range to track changes.
SCD -2 (Versioning): Inserts new and
changed dimensions. Creates a version number and increments the primary key to track changes.
SCD -2 (Flaging): Inserts new and changed dimensions. Flags the current version and increments the primary key to track changes.
Slowly changing Dimension-3 SCD -3 : when you want partial history use this kind of mapping. It Inserts new dimensions. Updates changed values in existing dimensions. Optionally uses the load date to track changes.
Data Warehouse Execution Architecture Source Systems
ETL
DATA WAREHOUSE
DB2/400
Oracle ODBC SQL Server
Native FTP
Flat files
Informatica (ETL) UNIX-HP
ODBC Native
NCR Tera Data Warehouse
Reporting( DSS) Servers
Architecture
ODBC