Review
Architecture
Characteristics Relational
ROLAP
OLAP
Multidimensional
Today Star Schema
OLAP
VS. MOLAP
Implementation techniques for OLAP
10/15/08
Sudarshan
Fact table Dimensions Drilling Down & Roll up Slicing & Dicing
Bit map indexes Join indexes File org. 1
What is Star Schema?
Star Schema is a relational database schema for representing multidimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entityrelationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantage of star schema are slicing down, performance increase and easy understanding of2 10/15/08 Sudarshan
Steps in designing Star Schema
Identify a business process for analysis(like sales). Identify measures or facts (sales dollar). Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension). List the columns that describe each dimension.(region name, branch name, region name). Determine the lowest level of summary in a fact table(sales dollar). 10/15/08
Sudarshan
3
Important aspects of Star Schema & Snow Flake Schema
In a star schema every dimension will have a primary key. In a star schema, a dimension table will not have any parent table. Whereas in a snow flake schema, a dimension table will have one or more parent tables. Hierarchies for the dimensions are stored in the dimensional table itself in star schema. Whereas hierarchies are broken into separate tables in snow flake schema. These hierarchies helps to drill down the data from topmost hierarchies to the lowermost hierarchies. 4 10/15/08 Sudarshan
Fact
Facts are numeric measurements (values) that represent a specific business activity. Example, sales figures are numeric measurements that represent product and/or service sales. Facts are used in business data analysis, are units, cost, prices and revenues. Facts are stored in a FACT table I.e. the center of the star schema. 10/15/08
Sudarshan
5
Fact Table
The centralized table in a star schema is called as FACT table, that contains facts and connected to dimensions. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). A fact table usually contains facts with the same level of aggregation. 10/15/08
Sudarshan
6
Many OLAP applications are based on a fact table
For example, a supermarket application might be based on a table Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)
The table can be viewed as multidimensional
Market_Id, Product_Id, Time_Id are the dimensions that represent specific supermarkets, products, and time intervals Sales_Amt is a function of the other three 10/15/08
Sudarshan
7
Fact Table (Conclusion)
Central table
10/15/08
mostly raw numeric items narrow rows, a few columns at most large number of rows (millions to a billion) Access via dimensions
Sudarshan
8
Dimension
Qualifying characteristics that provide additional perspective to a given fact. Example, sales might be compared by product from region to region and from one time period to the next. Here sales have product, location and time dimensions. Such dimensions are stored in DIMENSIONAL TABLE. 10/15/08
Sudarshan
9
Dimension Tables
The dimensions of the fact table are further described with dimension tables Fact table: Sales (Market_id, Product_Id, Time_Id, Sales_Amt) Dimension Tables: Market (Market_Id, City, State, Region) Product (Product_Id, Name, Category, Price) Time (Time_Id, Week, Month, Quarter)
10/15/08
Sudarshan
10
Attributes
Each dimension table contain attributes. Used to search, filter and classify facts. Example, Sales, we can identify some attributes for each dimension:
Product Dimension: product ID, description, product type Location Dimension: region, state, city. Time Dimension: year quarter, month, week and date.
10/15/08
Sudarshan
11
Attributes hierarchy
AH provides a top-down data organization Used for aggregation and drilldown/roll-up data analysis. Example, location dimension attributes can be organized in a hierarchy by region, state and city. AH provides the capability to perform drill-down and roll-up searches. Allows the DW and OLAP systems to to have defined path. 10/15/08
Sudarshan
12
(location) all
all Europe
region country city
Germany
Frankfurt
...
...
...
Spain
10/15/08
Canada
Vancouver ... L. Chan
office
North_America
Sudarshan
...
...
Mexico
Toronto
M. Wind 13
Multidimensional Data Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths
gi o
n
Re
Industry Region
Year
Product
Category Country Quarter Product
City Office
Month
Week
Day
Month 10/15/08
Sudarshan
14
2Qtr
3Qtr
4Qtr
sum
U.S.A
Pr o
du
TV PC VCR sum
1Qtr
Date
Total annual sales of TV in U.S.A.
Canada Mexico
Country
ct
A Sample Data Cube
sum
10/15/08
Sudarshan
15
Star Schema A single fact table and for each dimension one dimension table Does not capture hierarchies directly
T i m
e c u s t 10/15/08
date, custno, prodno, cityname, ...
f a c t
Sudarshan
p r o d c i t y
16
Example of Star Schema: Figure 1.6
10/15/08
Sudarshan
17
In the example, sales fact table is connected to dimensions location, product, time and organization. It shows that data can be sliced across all dimensions and again it is possible for the data to be aggregated across multiple dimensions. "Sales dollar" in sales fact table can be calculated across all dimensions independently or in a combined manner which is explained below. Sales dollar value for a particular product Sales dollar value for a product in a location Sales dollar value for a product in a year within a 10/15/08
Sudarshan
18
Example of Star Schema time
item
time_key day day_of_the_week month quarter year
Sales Fact Table time_key item_key branch_key
branch branch_key branch_name branch_type
location_key units_sold dollars_sold avg_sales
item_key item_name brand type supplier_type
location location_key street city province_or_street country
Measures 10/15/08
Sudarshan
19
Aggregation
Many OLAP queries involve aggregation of the data in the fact table
For example, to find the total sales (over time) of each product in each market, we might use SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY S.Market_Id, S.Product_Id
The aggregation is over the entire time dimension and thus produces a twodimensional view of the data 10/15/08
Sudarshan
20
Time
The output of the previous query Market_Id M2 M3
Product_Id
M1 SUM(Sales_Amt) P1 P2 P3 P4 P5
10/15/08
3003
1503
6003
2402
4503
3
7503
7000
…
…
Sudarshan
M4
… … … … …
21
Typical OLAP Operations
Roll up (drill-up): summarize data
Drill down (roll down): reverse of roll-up
project and select
Pivot (rotate):
from higher level summary to lower level summary or detailed data, or introducing new dimensions
Slice and dice:
by climbing up hierarchy or by dimension reduction
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
10/15/08
Sudarshan
22
Drilling Down and Rolling Up
Some dimension tables form an aggregation hierarchy Market_Id →
Executing a series of queries that moves down a hierarchy (e.g., from aggregation over regions to that over states) is called drilling down
City → State → Region
Requires the use of the fact table or information more specific than the requested aggregation (e.g., cities)
Executing a series of queries that moves up the hierarchy (e.g., from states to regions) is called 23 10/15/08rolling up Sudarshan
Drilling Down
Drilling down on market: from Region to State Sales (Market_Id, Product_Id, Time_Id, Sales_Amt) Market (Market_Id, City, State, Region)
–
SELECT
S.Product_Id, M.Region, SUM
(S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY S.Product_Id, M.Region –
SELECT S.Product_Id, M.State, SUM (S.Sales_Amt) FROM Sales S, Market M
10/15/08
Sudarshan
24
Rolling Up
Rolling up on market, from State to Region
If we have already created a table, State_Sales, State_Sales using
1.
SELECT S.Product_Id, M.State, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY S.Product_Id, M.State then we can roll up from there to:
2.
SELECT T.Product_Id, M.Region, SUM (T.Sales_Amt) FROM State_Sales T, Market M WHERE M.State = T.State GROUP BY T.Product_Id, M.Region
10/15/08
Sudarshan
25
Roll-up and Drill Down Higher Level of Aggregation
Roll Up
Drill-Down
Sales Channel Region Country State Location Address Sales RepresentativeLow-level Details
10/15/08
Sudarshan
26
“Slicing and Dicing” The Telecomm Slice
Product
Household Telecomm Video Audio
i g e
R
s n o Europe
Far East India Retail Direct Special
10/15/08
Sales Channel Sudarshan
27
Snowflake Schema
A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. i.e dimension table hierarchies are broken into simpler tables. In star schema example we had 4 dimensions like location, product, time, organization and a fact table (sales)
10/15/08
Sudarshan
28
Snowflake schema Represent dimensional hierarchy directly by normalizing tables. Easy to maintain and saves storage
T i m
e c u s t 10/15/08
p r o d
date, custno, prodno, cityname, ...
f a c t
Sudarshan
c i t y
r e g i o n29
Example of Snowflake Schema
10/15/08
Sudarshan
30
Schema time time_key day day_of_the_week month quarter year
item Sales Fact Table time_key item_key branch_key
branch
location_key
branch_key branch_name branch_type
units_sold dollars_sold avg_sales
Measures 10/15/08
Sudarshan
item_key item_name brand type supplier_key
supplier
supplier_key supplier_type
location location_key street city_key
city
city_key city province_or_street country
31
Indexing Techniques
Exploiting indexes to reduce scanning of data is of crucial importance Bitmap Indexes Join Indexes Other Issues
10/15/08
Text indexing Parallelizing and sequencing of index builds and incremental updates Sudarshan
32
Indexing Techniques Bitmap index:
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value for the indexed column
not suitable for high cardinality domains
10/15/08
Sudarshan
33
BitMap Indexes
Example: the attribute sex has values M and F. A table of 100 million people needs 2 lists of 100 million bits
10/15/08
Sudarshan
34
Bitmap Index
M
Y
F
Y
F
N
M
N
F
Y
F
N
Customer
10/15/08
0
1
0
1
1
1
1
0
0
0
0
0
1
1
1
1
0
0
Query : select * from customer where gender = ‘F’ and vote = ‘Y’ Sudarshan
35
Bit Map Index
Base Table
Cust C1 C2 C3 C4 C5 C6 C7
Region Rating N H S M W L W H S L W L N H
Customers where
10/15/08
Region Index
Row ID N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 1 0 0 0 Region = W
Sudarshan
And
Rating Index
Row ID H M L 1 1 0 0 2 0 1 0 3 0 0 0 4 0 0 0 5 0 1 0 6 0 0 0 7 1 0 0 Rating = M
36
BitMap Indexes
Comparison, join and aggregation operations are reduced to bit arithmetic with dramatic improvement in processing time Significant reduction in space and I/O (30:1) Adapted for higher cardinality domains as well. Compression (e.g., run-length encoding) exploited Products that support bitmaps: Model 204, TargetIndex (Redbrick), IQ (Sybase), Oracle 7.3
10/15/08
Sudarshan
37
Join Indexes Pre-computed joins A join index between a fact table and a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute
e.g., a join index on city dimension of calls fact table correlates for each city the calls (in the calls table) from that city
10/15/08
Sudarshan
38
Join Indexes
Join indexes can also span multiple dimension tables
10/15/08
e.g., a join index on city and time dimension of calls fact table
Sudarshan
39
Star Join Processing
Use join indexes to join dimension and fact table
Calls C+T Time
C+T+L Location Plan
10/15/08
Sudarshan
C+T+L +P
40
Bitmapped Join Processing Bitmaps Time Location Plan
10/15/08
Calls
1 0 1
Calls
0 0 1
AND
1 1 0
Calls
Sudarshan
41
OLAP Is FASMI
Fast Analysis Shared Multidimensional Information
Nigel Pendse, Richard Creath - The OLAP Report 10/15/08
Sudarshan
42
Warehouse Products
Computer Associates -- CA-Ingres Hewlett-Packard -- Allbase/SQL Informix -- Informix, Informix XPS Microsoft -- SQL Server Oracle -- Oracle7, Oracle Parallel Server Red Brick -- Red Brick Warehouse SAS Institute -- SAS Software AG -- ADABAS
Sybase -- SQL Server, IQ, MPP
10/15/08
Sudarshan
43
Products
Oracle 8 Informix
Online Dynamic Server XPS --Extended Parallel Server Universal Server for object relational applications
Sybase
Adaptive Server 11.5 Sybase MPP Sybase IQ
10/15/08
Sudarshan
44
Warehouse Server Products
Red Brick Warehouse Tandem Nonstop IBM
DB2 MVS Universal Server DB2 400
Teradata
10/15/08
Sudarshan
45