DATA WAREHOUSING
Agenda Introduction Process DSS
Information Dimensions
processing
OLAP
Architecture Best
Practise Case
Types
Shilpa Surve
Introduction
Definition Data Warehouse is
• • • •
a
Subject-Oriented Integrated Time-Variant Non-volatile
12/04/08
4
What are Data Warehouses? Data warehouses store large volumes of data which are frequently used by DSS It is maintained separately from the organization’s operational databases Data warehouses are relatively static with only infrequent updates A data warehouse is a stand-alone repository of information, integrated from several, possibly heterogeneous operational databases
12/04/08
5
Steps in Building a Warehouse Identify
key business drivers, sponsorship, risks, ROI Survey information needs and identify desired functionality and define functional requirements for initial subject area. Architect long-term, data warehousing architecture Evaluate and Finalize DW tool & technology Conduct Proof-of-Concept 12/04/08
6
Steps in building Data Warehouse Design target data base schema Build data mapping, extract, transformation,
cleansing and aggregation/summarization rules Build initial data mart, using exact subset of enterprise data warehousing architecture and expand to enterprise architecture over subsequent phases Maintain and administer data warehouse 12/04/08
7
The Three Views of Data Warehousing
Strategic or Business view • Define key business drivers of data warehouse • How can business-driven approach achieve high ROI?
Architectural or Technology view
• Alternative data warehousing architectures • How can the right architecture achieve a high ROI?
Methodology or Implementation view
• Development and implementation methodology • How can the right methodology achieve a rapid ROI? 12/04/08
8
Swathi Velisetty
Process
DW Components Metadata Layer Extraction FS1 FS2
. . .
Transmission N
FSn
E T W O R K
Data Mart Population
Cleansing
S T A G I N G A R E A
Aggregation Summarization Transformation
ODS
DM1
DW
DM2
DMn
OLAP ANALYSIS
Legacy System 12/04/08
10
Cleansing process Process Metadata Raw data (Staging Area)
Cleansing Process
Good Control Metadata
•Clean the Raw Data •Mark it Good/Bad •Generate the cleansing Reports and mail to the DWA and Feed System representatives 12/04/08
Clean data
Bad Cleansing Reports
11
Transformation Process
Clean Operationa l Data
Process Metadata •Mapping Detail •Transformation Rule
Operational Data Store
Transformatio n Process Control Metadata
•Transform the cleaned Operational Data into DSS Data •Load the DSS data into ODS •ODS contains the current DSS data at the lowest level of granularity 12/04/08
12
Summarization Process DW
ODS Summarizatio n Process
Weekly
Monthly
Yearly
Control Metadata •Summarize and aggregate ODS data and Populate to the Warehouse • Periodicity of Summarization Process depends upon the level of summarization at Warehouse ( weekly, monthly, daily ) 12/04/08
13
Enterprise Data Warehouse Legacy
Select
Extract
Clien t/ Serve
Transform
OLTP
Integrate
External Operational Systems Enterprise wide 12/04/08
Metadata Repositor y DATA WAREHOUSE
A P I
U S E R S
Maintain Data Preparation
14
Distributed Data Marts Legacy
Select
Extract
Clien t/ Serve
Transform
OLTP
Integrate
External Operational Systems Data 12/04/08
Data Mart
Maintain Data Preparation
Data Mart
A P I
U S E R S
Data Mart
15
Multi-tiered Data Warehouse Legac y
Clien t/ Serve
Select
Extract
Transform O LTP
Extern al
Integrate Maintain
Operational Systems Enterprise wide 12/04/08
Data Mart
Metadat a Reposito
DATA WAREHOUSE
Data Mart
A P I
U S E R S
Data Mart
16
Example Monthly sales by region for 1991Weekly sales by region for
Monthly Sales by Product for 1991-94
Weekly sales by product/subproduct for 1991-94 Sales Detail for 1991-
Metadata
Sales Detail for 1985-90 12/04/08
17
Atul zade
Decision support system
What is DSS? Decision Support Systems (DSS) are interactive computer-
based systems intended to help decision makers utilize data and models to identify and solve problems and make decisions. Data Warehouse is the foundation of DSS process. It is a Strategy and a Process for Staging Corporate Data. ➭ Enable users to get a “Business View” of the data ➭ Facilitate Data based Decision Making that would drive and improve the Business ➭ Discover “Hidden Trends”
12/04/08
19
Driving Forces for DSS ➭ Changes in the Business Environment Business Speed
N O I T I ET
Customer s
RESULT:
P M O C 12/04/08
Refor m
Technology
20
How to answer these Business Queries? What is the sales distribution region wise?
How did my revenue improve in the past 5 years?
What are the slow movers in my product line?
Which channel costs me more and pays less?
Which of my Sales Agents are doing better?
Strategic Planning / Budgeting Currency Risk, Interest Rate Risk, Liquidity Risk
What is Defaulter’s Profile? Who are my profitable customers? 12/04/08
21
OLTP v/s DSS Environment OLTP
Environment
• get data IN • large volumes of simple transaction queries • continuous data changes • low processing time • mode of processing • transaction details • data inconsistency • mostly current data • high concurrent usage • highly normalized data structure • static applications • automates routines
DSS Environment
• get information OUT • small number of diverse queries • periodic updates only • high processing time • mode of discovery • subject oriented - summaries • data consistency • historical data is relevant • low concurrent usage • fewer tables, but more columns per table • dynamic applications • facilitates creativity
Benefits for Business User • • • • • •
Flexible Information Access High Availability Ease of Use Quality & Completeness of Data Focus on Information Processing Information Base for Knowledge Discovery
12/04/08
23
Classification of Business Users Executives/Managers
• •
Multi-dimensional analysis, reporting tools
Knowledge Worker
• •
Ad hoc queries, detail & summary data, application focus
Power-Analyst
• •
Ad hoc queries, Data Analysis & Data Mining
Customer Contacts
• • 12/04/08
Detail Data at specific levels 24
Prem Sequera
Information processing
Data Processing to Information Processing Business Objectives & Goals Application Domains and Business Functions B U SI N E S S T
T R A C E
Query Processin g Report Generatio
R A
E L E M E N T S
Operationa l Data Store (ODS)
C E
OLAP/ Query Tools
Enterprise Data Warehouse
OLAP Appl.
Data Mart A
Appl. Spec. Analysis
Data Mart B
Appl. Spec. Analysis
Data Mart N
Appl. Spec. Analysis
D A T A E L E M E N T S Feed Systems and External Sources
KNOWLEDGE DISCOVERY Data Mining Applications KNOWLEDGE MANAGEMENT
Management Decision: Value Chain
Data Processing
Heterogeneous Data Sources Information Processing Knowledge Processing 12/04/08
26
Subject Oriented Analysis Process Oriented
Subject Oriented
Entry Sales Rep Quantity Sold Part Number Date Customer Name Product Description Unit Price Mail Address Transactional Storage 12/04/08
Sales
Customers Products
Data Warehouse Storage 27
Integration of Data Appl. A - M, F Appl. B - 1, 0 Appl. C - X, Y
M, F
Unit of Attributes
Appl. A - pipeline cm. Appl. B - pipeline inches Appl. C - pipeline mcf
pipeline cm
Physical Attributes
Appl. A - balance dec(13,2) Appl. B - balance PIC 9(9)V99 Appl. C - balance float Appl. A - bal-on-hand
Naming Appl. B - current_balance ConventionsAppl. C - balance Data Consistenc y
Appl. A - date (Julian) Appl. B - date (yymmdd) Appl. C - date (absolute)
Transactional Storage 12/04/08
Integration
Encodin g
balance dec(13, 2)
balance
date (Julian)
Data Warehouse Storage 28
Volatility of Data Volatil e Inse rt
NonVolatile Chan ge
Acces s
Delet e Chan ge
Inse rt
Acces s Record-by-Record Data Manipulation
Transactional Storage 12/04/08
L oad
Mass Load / Access of Data
Data Warehouse Storage 29
Time Variant Data Analysis Current Data
Transactional Storage 12/04/08
Historical Data
Data Warehouse Storage 30
Kairav Parikh
Dimension
What is a Dimension? Data Warehouse is •Subject-Oriented • •Integrated •Time-Variant •Non-volatile collection of data in support of management’s decision.
Subject
12/04/08
Dimension
32
Dimensional Hierarchy
Re
Europe
Asia
nt
America
Country Level
Pa
re
Continent Level
la tio
n
World Level
Geography Dimension World
USA
State Level
FL
City Level Attributes:
Miami
Canada GA
Population, Tourist’s Place
12/04/08
VA
Tamp a
CA Orland o
Argenti na WA Dimension Naple s Member /
Business Entity 33
Types of Dimensions • Simple Dimensions (e.g. Time) • Related Dimensions (e.g. Gender of a Customer) • Spool Dimensions (e.g. Account as an interaction between Customer and Product) • Bucket Dimensions (e.g. Income Ranges of a Customer) • Slowly Changing Dimensions (e.g. changes in Organization) • Fast Varying Dimensions (e.g. changes Retail Customers attributes) • Unused Dimensions (e.g. Order No., Invoice No.)
12/04/08
34
Dimensional Modeling STEP 1 •Identify Subjects (Dimensions) •Identify Hierarchies of a Dimension •Identify Attributes of levels in Hierarchies •Define Grain
Industry Segment Industry Type
Countr y State City
Fin. Class
Customer
12/04/08
35
Dimensional Modeling STEP 2 •Use KPIs to identify the Facts •Group the Facts in a logical set
12/04/08
Financial Transactions
Non-Financial Transactions
Trans. Amount
No. of Cheques Cleared
No. of Bonds
No. of Visits to a Branch
No. of Transactions Service Cost ...
No. of DEMAT Transactions
...
36
Dimensional Modeling STEP 3 •Link the Group of Facts to the Dimensions that participate in the Facts
Customer Time
Product Financial Transactions
Organization
Channel
12/04/08
37
Dimensional Modeling STEP 4 •Define Granularity for each Group of Facts
Customer (Customer) Time (Day-Hour)
Product (Scheme) Financial Transactions
Organization (Branch)
Channel (Channel) 12/04/08
38
Data Warehouse Schemas Star Schema •A Group of Facts connected to Multiple Dimensions
Channel
Financial Transactions
Time
Customer
12/04/08
Organization
Product
39
Data Warehouse Schemas Snow-flake Schema (= Extended Star Schema) •A Group of Facts connected to Dimensions, which are split across multiple hierarchies and attributes
Time
Channel
Product Financial Transactions
Organization
Customer Segment 12/04/08
Geograp hy 40
Data Warehouse Schemas Galaxy Schema •Multiple Groups of Facts links by few common dimensions
Dimension
Dimension Fact1
Dimension
Dimension Fact2
Dimension 12/04/08
Dimension
Fact3
Dimension 41
Akshay Shiveshwarkar
OLAP
On-Line Analytical Processing OLAP can
be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.)
12/04/08
43
What is MDDB? A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is • •
12/04/08
intimately related and stored, viewed and analyzed from different perspectives (Dimensions).
44
RDBMS v/s MDDB MODEL
COLOR SALES VOL.
MINI VAN BLUE 6 MINI VAN RED 5 MINI VAN WHITE 4 SPORTS COUPE BLUE 3 SPORTS COUPE RED 5 SPORTS COUPE WHITE 5 SEDAN BLUE 4 SEDAN RED 3 SEDAN WHITE 2
9 x 3 = 27 cells
12/04/08
3 x 3 = 9 cells
45
Benefits of MDDB over RDBMS
Ease of Data Presentation & Navigation Intuitive, Spreadsheet / Crosstab like data views
Storage Space
Very low Space Consumption compared to Relational DB
Performance
Gives much better performance. Relational DB may give comparable results only through database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.
Ease of Maintenance
No overhead as data is stored in the same way it is viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance 12/04/08
46
Issues with MDDB •
•
•
Sparsity
– Controlled Sparsity – Random Sparsity
Data Explosion
– Due to Sparsity – Due to Summarization
Performance
– Doesn’t perform better than RDBMS at high data volumes (>20-30 GB)
12/04/08
47
OLAP Features Subject oriented approach to Decision Support Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
12/04/08
48
Features of OLAP - Drill Down / Up Sales at region/District/Dealership Level
•
Moving Up and moving down in a hierarchy is referred to as “drill-up” / “roll-up” and “drill-down” 12/04/08
49
Ritesh Raushan
Architecture Types
Implementation Techniques OLAP Architectures • MOLAP - Multidimensional OLAP • Multidimensional Databases for database and application logic layer
• ROLAP - Relational OLAP
• Access Data stored in relational Data Warehouse for OLAP Analysis. • Database and Application logic provided as separate layers
• HOLAP - Hybrid OLAP
• OLAP Server routes queries first to MDDB, then to RDBMS and result processed on-the-fly in Server
• DOLAP - Desk OLAP
• Personal MDDB Server and application on the desktop
12/04/08
51
MOLAP - MDDB storage Web Browser
OLAP
Cube OLAP Calculation Engine
OLAP Tools
OLAP Applications
12/04/08
52
ROLAP - Standard SQL storage Relational DW
MDDB - Relational Mapping
Web Browser
SQL
OLAP Calculation Engine
OLAP Tools
OLAP Applications 12/04/08
53
HOLAP - Combination of RDBMS and MDDB OLAP Cube
Relational DW
Any Client
Web Browser
SQL
OLAP Calculation Engine
OLAP Tools
OLAP Applications 12/04/08
54
Architecture Comparison
12/04/08
55
Kiran Naik
Case
Thank you