DATA WAREHOUSING AND DATA MINING By: Raghav Agrawal Btech( E&T ) III yr – A A1607107107
Overview ❚ Introduction ❚ Data Warehousing ❚ Data Warehousing V/S OLAP ❚ Data Mining
2
Motivation: “Necessity is the Mother of Invention”
Data explosion problem
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
3
What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
4
Warehouses are Very Large Databases 35%
Respondents
30% 25% 20% 15% 10%
Initial Projected 2Q96
5% 0%
Source: META Group, Inc.
5GB
10-19GB 5-9GB
50-99GB
20-49GB
250-499GB
100-249GB
500GB-1TB 5
Very Large Data Bases ❚ Terabytes -- 1012 bytes:
Walmart -- 24 Terabytes
❚ Petabytes -- 1015 bytes:
Geographic Information Systems National Medical Records
❚ Exabytes -- 1018 bytes:
❚ Zettabytes -- 1021 bytes: Weather images ❚ Zottabytes -- 1024 bytes: Intelligence Agency Videos 6
Data Warehousing -It is a process ❚ Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible ❚ A decision support database maintained separately from the organization’s operational database 7
Characteristics of Data Warehouse ❚ A data warehouse is a ❙ subject-oriented ❙ integrated ❙ time-varying ❙ non-volatile
collection of data that is used primarily in organizational decision making.
8
Data Warehouse Architecture Relational Databases Optimized Loader ERP Systems
Extraction Cleansing Data Warehouse Engine
Analyze Query
Metadata Repository 9
Application Areas Industry Application Finance Credit Card Analysis Insurance Claims, Fraud Analysis Telecommunication Call record analysis Transport Logistics management Consumer goods promotion analysis Data Service providersValue added data Utilities Power usage analysis 10
What makes data mining possible? ❚ Advances in the following areas are making data mining deployable: ❙ data warehousing ❙ better and more data (i.e., operational, behavioral, and demographic) ❙ the emergence of easily deployed data mining tools and ❙ the advent of new data mining techniques. 11
Benefits of Data warehouse A data warehouse provides a common data model for all data of interest regardless of the data's source. ❙ Prior to loading data into the data warehouse, inconsistencies are identified and resolved. ❙ Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down operational systems. ❙ Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems. ❙
12
DISADVANTAGES OF DATA WAREHOUSES Data warehouses are not the optimal environment for unstructured data. ❙ There is an element of latency in data warehouse data. ❙ Data warehouses can have high costs. Maintenance costs are high. ❙ Data warehouses can get outdated relatively quickly. ❙
13
So, what’s different b/w OLTP & DW?
Object 5
OLTP vs Data Warehouse ❚ OLTP ❙ Application Oriented ❙ Used to run business ❙ Detailed data ❙ Current up to date ❙ Isolated Data
❚ Warehouse ❙ Subject Oriented ❙ Used to analyze business ❙ Summarized and refined ❙ Snapshot data ❙ Integrated Data
15
OLTP V/S Data Warehouse ❚ OLTP ❙ Performance Sensitive ❙ Few Records accessed at a time (tens) ❙ Read/Update Access ❙ No data redundancy ❙ Database Size 100MB -100 GB
❚ Data Warehouse ❙ Performance relaxed ❙ Large volumes accessed at a time(millions) ❙ Mostly Read (Batch Update) ❙ Redundancy present ❙ Database Size 100 GB - few terabytes
16
To summarize DW & OLTP... ❚ OLTP Systems are used to “run” a business
❚ The Data Warehouse helps to “optimize” the business 17
What Is Data Mining? ❚ The objective of data mining is to extract valuable information from your data, to discover the “hidden gold.” ❚ that you do not need a data warehouse to successfully use data mining—all you need is data. ❚ On-Line Analytical Processing (OLAP)- DM tool.
18
DATA MINING MODELS Acc. To IBM ❚ Verification Model
❚ . Discovery Model
19
Steps for Data Mining ❚ Identify Find sales relationships between specific products or services ❙ Identify specific purchasing patterns over time ❙ Identify potential types of customers ❙ Find product sales trends. ❙
20
❚ Select ❙
❙ ❙ ❙ ❙
Are the data adequate to describe the phenomena the data mining analysis is attempting to model? Can you enhance internal customer records with external lifestyle and demographic data? Are the data stable—will the mined attributes be the same after the analysis? If you are merging databases can you find a common field for linking them? How current and relevant are the data to the business goal?
21
❚ Prepare ❙ Establish strategies for handling missing data, extraneous noise, and outliers ❙ Identify redundant variables in the dataset and decide which fields to exclude ❙ Decide on a log or square transformation, if necessary ❙ Visually inspect the dataset to get a feel for the database ❙ Determine the distribution frequencies of the data
22
❚ Audit the data What is the attributes in ❙ What is the database? ❙ What is the dataset? ❙
ratio of categorical/binary the database? nature and structure of the overall condition of the
❚ Select the Tool Is the data set heavily categorical? What platforms do your candidate tools support? ❙ Are the candidate tools ODBC-compliant? ❙ What data format can the tools import? ❙ ❙
23
❚ Format the Solution ❙ What is the optimum format of the solution— decision tree, rules, C code, SQL syntax? ❙ What are the available format options? ❙ What is the goal of the solution?
❚ Construct the Model Are error rates at acceptable levels? Can you improve them? ❙ What extraneous attributes did you find? Can you purge them? ❙ Is additional data or a different methodology necessary? ❙ Will you have to train and test a new data set? ❙
24
❚ Validate the Findings ❙ ❙ ❙
Do the findings make sense? Do you have to return to any prior steps to improve results? Can use other data mining tools to replicate the findings?
❚ Deliver The Findings ❙ Will additional data improve the analysis? ❙ What strategic insight did you discover and how is it applica-ble? ❙ What proposals can result from the data mining analysis? ❙ Do the findings meet the business objective?
25
❚ Integrate The Solution ❙ SQL syntax for distribution to end-users ❙ C code incorporated into a production system ❙ Rules integrated into a decision support system.
26
Data Mining Algorithms Some of the DM algorithms are ❚ Neural Networks ❚ Decision Trees
27
Neural Network
28
Decision Trees
29
Thank you !!!
30