Dwdm_unit-2.pdf

  • Uploaded by: mayank
  • 0
  • 0
  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Dwdm_unit-2.pdf as PDF for free.

More details

  • Words: 7,701
  • Pages: 149
UNIT-2 BCA VI SEM

Ms. Sushma Malik Asst. Prof., IINTM

Ms. Deepti Gupta Asst. Prof., IINTM

Contents  Data Warehouse(DW)  Features of DW  OLTP and OLAP  Difference  Data Warehouse  Architecture  Models  Applications  Advantages and Disadvantages  ETL  Metadata  Starnet Query Model 2/16/2019

2

 Data Cube: Multidimensional Model  Introduction  Schemas 

 

Star Schema Snowflake Schema Fact Constellation

 Data Warehouse Usage  Data Warehouse Implementation  Data Cube Materialization  OLAP Operations –Rollup,Drill-Down,Slice,Dice,Pivot  OLAP Models  OLAM  Data Mining Architecture  Mining Frequent Patterns  Association Rule Mining  

2/16/2019

Apriori Algorithm FP Growth 3

What is Data Warehouse?  The term "Data Warehouse" was first coined by Bill

Inmon in 1990.  According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data.  A data warehouse refers to a data repository that is maintained separately from an organization’s operational databases. Data warehouse systems allow for integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historic data for analysis. 2/16/2019

4

What is data warehouse?  Data warehousing provides architectures and

tools for business executives to systematically organize, understand, and use their data to make strategic decisions.  The data warehouse is the core of the BI system which is built for data analysis and reporting.  Data warehouses generalize and consolidate data in multidimensional space. The construction of data warehouses involves data cleaning, data integration, and data transformation, and can be viewed as an important preprocessing step for data mining. 2/16/2019

5

 Moreover, data warehouses provide online analytical

processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining.  Many other data mining functions, such as association,

classification, prediction, and clustering, can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the data warehouse has become an increasingly important platform for data analysis and OLAP and will provide an effective platform for data mining. 2/16/2019

6

Data Warehouse Features(previous year paper)  Subject Oriented − A data warehouse is organized around

major subjects such as customer, supplier, product, and sales. Rather than concentrating on the day-to-day operations and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses typically provide a simple and concise view of particular subject issues by excluding data that are not useful in the decision support process.

 Integrated −A data warehouse is usually constructed by

integrating multiple heterogeneous sources, such as relational databases, flat files, and online transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on.

2/16/2019

7

Data Warehouse Features  Time Variant − Data are stored to provide information

from an historic perspective (e.g., the past 5–10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time element.  Non-volatile − Non-volatile means the previous data is not

erased when new data is added to it. A data warehouse is kept separate from the operational database. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data. 2/16/2019

8

2/16/2019

9

2/16/2019

10

How are organizations using the information from data warehouse? Many organizations use this information to support business decision-making activities, including:  (1) Increasing customer focus, which includes the analysis

of customer buying patterns (such as buying preference, buying time, budget cycles, and appetites for spending);  (2) Repositioning products and managing product portfolios by comparing the performance of sales by quarter, by year, and by geographic regions in order to finetune production strategies;  (3) Analyzing operations and looking for sources of profit  (4) Managing customer relationships, making environmental corrections, and managing the cost of corporate assets. 2/16/2019

11

Data Warehouse- An update driven Approach  Data warehousing employs an update driven approach in which information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and analysis.  Unlike online transaction processing databases, data warehouses do not contain the most current information. However, a data warehouse brings high performance to the integrated heterogeneous database system because data are copied, preprocessed, integrated, annotated, summarized, and restructured into one semantic data store.  Furthermore, query processing in data warehouses does not interfere with the processing at local sources. Moreover, data warehouses can store and integrate historic information and support complex multidimensional queries. 2/16/2019

12

OLTP and OLAP  Because most people are familiar with commercial relational database systems, it is easy to understand what a data warehouse is by comparing these two kinds of systems.  The major task of online operational database systems is to perform online transaction and query processing. These systems are called online transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting.  Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of different users. These systems are known as online analytical processing (OLAP) systems. 2/16/2019

13

Major Distinguishing Features between OLTP and OLAP(previous year paper) The major distinguishing features of OLTP and OLAP are summarized as follows:  Users and system orientation: An OLTP system is customeroriented and is used for transaction and query processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts.  Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. An OLAP system manages large amounts of historic data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier to use for informed decision making. 2/16/2019

14

 Database design: An OLTP system usually adopts an

entity-relationship (ER) data model and an application-oriented database design. An OLAP system typically adopts either a star or a snowflake model and a subject-oriented database design.

 View: An OLTP system focuses mainly on the current data

within an enterprise or department, without referring to historic data or data in different organizations. In contrast, an OLAP system often spans multiple versions of a database schema, due to the evolutionary process of an organization. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple storage media.

2/16/2019

15

 Access patterns: The access patterns of an OLTP

system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are mostly read-only operations (because most data warehouses store historic rather than up-todate information), although many could be complex queries.

2/16/2019

16

2/16/2019

17

Data Warehousing: A multitiered Architecture(previous year paper)

2/16/2019

18

 Data warehouses often adopt a three-tier architecture:  The bottom tier is a warehouse database server that is

almost always a relational database system. Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (e.g., customer profile information provided by external consultants).

 These tools and utilities perform data extraction, cleaning,

and transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to update the data warehouse.The data are extracted using application program interfaces known as gateways. A gateway is supported by the underlying DBMS and allows client programs to generate SQL code to be executed at a server. Examples of gateways include ODBC (Open Database Connection) and OLEDB (Object

2/16/2019

19

Linking and Embedding Database) by Microsoft and JDBC (Java Database Connection).This tier also contains a metadata repository, which stores information about the data warehouse and its contents.  2. The middle tier is an OLAP server that is typically

implemented using either (1) a relational OLAP(ROLAP) model (i.e., an extended relational DBMS that maps operations on multidimensional data to standard relational operations); or (2) a multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly implements multidimensional data and operations).

  3. The top tier is a front-end client layer, which contains

query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

2/16/2019

20

Data warehouse Models  Enterprise warehouse: An enterprise warehouse collects

all of the information about subjects spanning the entire organization.  It provides corporate-wide data integration, usually from one

or more operational systems or external information providers, and is cross-functional in scope.  It typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.  An enterprise data warehouse may be implemented on traditional mainframes, computer superservers, or parallel architecture platforms. It requires extensive business modeling and may take years to design and build. 2/16/2019

21

 Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to specific selected subjects. For example, a marketing data mart may confine its subjects to customer, item, and sales. The data contained in data marts tend to be summarized. Data marts are usually implemented on low-cost departmental servers that are Unix/Linux or Windows based. The implementation cycle of a data mart is more likely to be measured in weeks rather than months or years. However, it may involve complex integration in the long run if its design and planning were not enterprise-wide.  Depending on the source of data, data marts can be categorized as independent or dependent.  Independent data marts are sourced from data captured from one or more operational systems or external information providers, or from data generated locally within a particular department or geographic area.  Dependent data marts are sourced directly from enterprise data warehouses. 2/16/2019

22

 Virtual warehouse: A virtual warehouse is a set of

views over operational databases.  For efficient query processing, only some of the possible summary views may be materialized.  A virtual warehouse is easy to build but requires excess capacity on operational database servers.

2/16/2019

23

Difference Between Data Warehouse and Data Mart

2/16/2019

24

Data Warehouse Applications A data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a plan-executeassess "closed-loop" feedback system for the enterprise management. Data warehouses are widely used in the following fields :  Financial services  Banking services  Consumer goods  Retail sectors  Controlled manufacturing 2/16/2019

25

How Data warehouse works?  By merging all of this information in one place, an

organization can analyze its customers more holistically.  This helps to ensure that it has considered all the information available.  Data warehousing makes data mining possible. Data mining is looking for patterns in the data that may lead to higher sales and profits.

2/16/2019

26

Advantages of Data Warehouse:  Data warehouse allows business users to quickly access      

critical data from some sources all in one place. Data warehouse provides consistent information on various cross-functional activities. It is also supporting ad-hoc reporting and query. Data Warehouse helps to integrate many sources of data to reduce stress on the production system. Data warehouse helps to reduce total turnaround time for analysis and reporting. Restructuring and Integration make it easier for the user to use for reporting and analysis. Data warehouse allows users to access critical data from the number of sources in a single place. Therefore, it saves user's time of retrieving data from multiple sources. Data warehouse stores a large amount of historical data. This helps users to analyze different time periods and trends to make future predictions.

2/16/2019

27

Disadvantages of Data Warehouse:  Not an ideal option for unstructured data.  Creation and Implementation of Data Warehouse is  

  

surely time confusing affair. Data Warehouse can be outdated relatively quickly. Difficult to make changes in data types and ranges, data source schema, indexes, and queries. The data warehouse may seem easy, but actually, it is too complex for the average users. Sometime warehouse users will develop different business rules. Organizations need to spend lots of their resources for training and Implementation purpose.

2/16/2019

28

ETL-Extraction, Transformation and Loading(previous year paper) Data warehouse systems use back-end tools and utilities to populate and refresh their Data .These tools and utilities include the following functions:  Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.  Data cleaning, which detects errors in the data and rectifies them when possible.  Data transformation, which converts data from legacy or host format to warehouse format.  Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.  Refresh, which propagates the updates from the data sources to the warehouse. 2/16/2019

29

Metadata Repository(previous year paper)  Metadata are data about data. When used in a data

warehouse, metadata are the data that define warehouse objects. Metadata repository lies within the bottom tier of the data warehousing architecture. Metadata are created for the data names and definitions of the given warehouse.

2/16/2019

30

 A metadata repository should contain the following:  A description of the data warehouse structure, which

includes the warehouse schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents.  Operational metadata, which include data lineage

(history of migrated data and the sequence of transformations applied to it), currency of data (active, archived, or purged), and monitoring information (warehouse usage statistics, error reports, and audit trails).  The algorithms used for summarization, which include

measure and dimension definition algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and predefined queries and reports. 2/16/2019

31

 Mapping from the operational environment to the

data warehouse, which includes source databases and their contents, gateway descriptions, data partitions, data extraction, cleaning, transformation rules and defaults, data refresh and purging rules, and security (user authorization and access control).  Data related to system performance, which include

indices and profiles that improve data access and retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and replication cycles.  Business metadata, which include business terms and

definitions, data ownership information, and charging policies. 2/16/2019

32

Role of Metadata  Metadata play a very different role than other data warehouse data and are important for many reasons.  For example, metadata are used as a directory to help the decision support system analyst locate the contents of the data warehouse, and as a guide to the data mapping when data are transformed from the operational environment to the data warehouse environment.  Metadata also serve as a guide to the algorithms used for summarization between the current detailed data and the lightly summarized data, and between the lightly summarized data and the highly summarized data.  Metadata should be stored and managed persistently (i.e., on disk). 2/16/2019

33

Data Cube: A Multidimensional Data Model  “What is a data cube?” A data cube allows data to be modeled and viewed in multiple dimensions.  It is defined by dimensions and facts.  In general terms, dimensions are the perspectives or entities with respect to which an organization wants to keep records.  For example, AllElectronics may create a sales data warehouse in order to keep records of the store’s sales with respect to the dimensions time, item, branch, and location. These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations at which the items were sold.  Each dimension may have a table associated with it, called a dimension table, which further describes the dimension.  For example, a dimension table for item may contain the attributes item name, brand, and type. Dimension tables can be specified by users or experts, or automatically generated and adjusted based on data distributions. 2/16/2019

34

A multidimensional data model  A multidimensional data model is typically organized

around a central theme, such as sales. This theme is represented by a fact table.  Facts are numeric measures. Think of them as the quantities by which we want to analyze relationships between dimensions.  Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars), units sold (number of units sold), and amount budgeted.  The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables. 2/16/2019

35

A multidimensional data model  Multidimensional data model stores data in the form

of data cube. Mostly, data warehousing supports two or three-dimensional cubes.  A data cube allows data to be viewed in multiple dimensions. A dimensions are entities with respect to which an organization wants to keep records. For example in store sales record, dimensions allow the store to keep track of things like monthly sales of items and the branches and locations.  A multidimensional databases helps to provide datarelated answers to complex business queries quickly and accurately. 2/16/2019

36

A multidimensional data model

2/16/2019

37

A multidimensional data model

2/16/2019

38

A multidimensional data model

2/16/2019

39

Schemas for Multidimensional Data Model are  Star Schema  Snowflakes Schema  Fact Constellations Schema

2/16/2019

40

Star Schemas for Multidimensional Model  The simplest data warehouse schema is star schema  

 

because its structure resembles a star. Star schema consists of data in the form of facts and dimensions. The fact table present in the center of star and points of the star are the dimension tables. In star schema fact table contain a large amount of data, with redundancy. Each dimension table is joined with the fact table using a primary or foreign key.

2/16/2019

41

2/16/2019

42

An example of E-Commerce Website

2/16/2019

43

Dimension Tables

2/16/2019

44

2/16/2019

45

2/16/2019

46

2/16/2019

47

As, it can be observed from previous slides that dimension tables contains lot of redundant data, so we need a schema where tables can be normalized to reduce redundancy.

2/16/2019

48

ANOTHER EXAMPLE OF STAR SCHEMA

2/16/2019

49

Snowflake Schemas for Multidimensional Model  The snowflake schema is a more complex than star

schema because dimension tables of the snowflake are normalized.  The snowflake schema is represented by centralized fact table which is connected to multiple dimension table and this dimension table can be normalized into additional dimension tables.  The major difference between the snowflake and star schema models is that the dimension tables of the snowflake model are normalized to reduce redundancies. 2/16/2019

50

2/16/2019

51

SNOWFLAKE SCHEMA FOR SAME EXAMPLE

2/16/2019

52

Normalized Dimensional Tables

2/16/2019

53

2/16/2019

54

2/16/2019

55

Another Example of Snowf lake Schema

2/16/2019

56

Difference between the snowflake and star schema models  The major difference between the snowflake and star

schema models is that the dimension tables of the snowflake model may be kept in normalized form to reduce redundancies.  Such a table is easy to maintain and saves storage

space. However, this space savings is negligible in comparison to the typical magnitude of the fact table.  Furthermore, the snowflake structure can reduce the

effectiveness of browsing, since more joins will be needed to execute a query. 2/16/2019

57

Fact constellation Schemas for Multidimensional Model  A fact constellation can have multiple fact tables that

share many dimension tables.  This type of schema can be viewed as a collection of stars, Snowflake and hence is called a galaxy schema or a fact constellation.  The main disadvantage of fact constellation schemas is

its more complicated design.

2/16/2019

58

2/16/2019

59

FACT CONSTELLATION

2/16/2019

60

Data Warehouse Usage for Information Processing  Data warehouses and data marts are used in a wide

range of applications. Business executives use the data in data warehouses and data marts to perform data analysis and make strategic decisions.  Data warehouses are used extensively in banking and

financial services, consumer goods and retail distribution sectors, and controlled manufacturing such as demand-based production. 2/16/2019

61

 There are three kinds of data warehouse applications:

information processing, analytical processing, and data mining.

 Information processing supports querying, basic

statistical analysis, and reporting using crosstabs, tables, charts, or graphs. A current trend in data warehouse information processing is to construct low-cost web-based accessing tools that are then integrated with web browsers.

 Analytical

processing supports basic OLAP operations, including slice-and-dice, drill-down, rollup, and pivoting. It generally operates on historic data in both summarized and detailed forms. The major strength of online analytical processing over information processing is the multidimensional data analysis of data warehouse data.

2/16/2019

62

 Data mining supports knowledge discovery by

finding hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools.

2/16/2019

63

OLAP vs. Data Mining  The functionalities of OLAP and data mining can be

viewed as disjoint:  OLAP is a data summarization/aggregation tool that helps

simplify data analysis, while data mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data.  OLAP tools are targeted toward simplifying and supporting

interactive data analysis, whereas the goal of data mining tools is to automate as much of the process as possible, while still allowing users to guide the process. 2/16/2019

64

Data Warehouse Implementation  A data cube is a lattice of cuboids. Suppose that you want to

create a data cube for AllElectronics sales that contains the following: city, item, year, and sales in dollars. You want to be able to analyze the data, with queries such as the following:  “Compute the sum of sales, grouping by city and item.”  “Compute the sum of sales, grouping by city.”  “Compute the sum of sales, grouping by item.”

 What is the total number of cuboids, or group-by’s, that

can be computed for this data cube?

 the total number of cuboids, or groupby’s, that can be

computed for this data cube is

2/16/2019

65

2/16/2019

66

The compute cube operator and the Curse of Dimensionality One approach to cube computation extends SQL so as to include a compute cube operator. Firstly the cube is defined using SQL statement define cube and then the compute cube operator computes aggregates over all subsets of dimensions specified in the operation.

2/16/2019

67

Curse of Dimensionality  Online analytical processing may need to access different

cuboids for different queries. Therefore, it may seem like a good idea to compute in advance all or at least some of the cuboids in a data cube. Precomputation leads to fast response time and avoids some redundant computation.  A major challenge related to this precomputation, however, is that the required storage space may explode if all the cuboids in a data cube are precomputed, especially when the cube has many dimensions. The storage requirements are even more excessive when many of the dimensions have associated concept hierarchies, each with multiple levels.  This problem is referred to as the curse of dimensionality. 2/16/2019

68

2/16/2019

69

Data Cube Materialization  There are three choices for data cube materialization given a base cuboid: 1. No materialization: Do not precompute any of the “nonbase” cuboids. This leads to computing expensive multidimensional aggregates on-the-fly, which can be extremely slow. 2. Full materialization: Precompute all of the cuboids. The resulting lattice of computed cuboids is referred to as the full cube. This choice typically requires huge amounts of memory space in order to store all of the precomputed cuboids. 3. Partial materialization: Selectively compute a proper subset of the whole set of possible cuboids. Alternatively, we may compute a subset of the cube, which contains only those cells that satisfy some user-specified criterion, such as where the tuple count of each cell is above some threshold. We will use the term subcube to refer to the latter case, where only some of the cells may be precomputed for various cuboids. Partial materialization represents an interesting trade-off between storage space and response time. 2/16/2019

70

Data warehouse Application/usage Most common sectors where Data warehouse is used:  Airline: In the Airline system, it is used for operation purpose like crew assignment, analyses of route profitability, frequent flyer program promotions, etc.  Banking: It is widely used in the banking sector to manage the resources available on desk effectively. Few banks also used for the market research, performance analysis of the product and operations.  Healthcare: Healthcare sector also used Data warehouse to strategize and predict outcomes, generate patient's treatment reports, share data with tie-in insurance companies, medical aid services, etc.  Public sector: In the public sector, data warehouse is used for intelligence gathering. It helps government agencies to maintain and analyze tax records, health policy records, for every individual. 2/16/2019

71

 Investment and Insurance sector: In this sector, the warehouses are primarily used to analyze data patterns, customer trends, and to track market movements.  Retain chain: In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to track items, customer buying pattern, promotions and also used for determining pricing policy.  Telecommunication: A data warehouse is used in this sector for product promotions, sales decisions and to make distribution decisions.  Hospitality Industry: This Industry utilizes warehouse services to design as well as estimate their advertising and promotion campaigns where they want to target clients based on their feedback and travel patterns. 2/16/2019

72

Online Analytical Processing Server (OLAP) -Definition  On-Line Analytical Processing (OLAP) is a category of

software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.

2/16/2019

73

 Analysts frequently need to group, aggregate and join

data. These operations in relational databases are resource intensive. With OLAP data can be precalculated and pre-aggregated, making analysis faster.  OLAP databases are divided into one or more cubes.

The cubes are designed in such a way that creating and viewing reports become easy.

2/16/2019

74

2/16/2019

75

 At the core of the OLAP, concept is an OLAP Cube. The

OLAP cube is a data structure optimized for very quick data analysis.  The OLAP Cube consists of numeric facts called measures which are categorized by dimensions. OLAP Cube is also called the hypercube.  Usually, data operations and analysis are performed using the simple spreadsheet, where data values are arranged in row and column format. This is ideal for two-dimensional data. However, OLAP contains multidimensional data, with data usually obtained from a different and unrelated source. Using a spreadsheet is not an optimal option. The cube can store and analyze multidimensional data in a logical and orderly manner. 2/16/2019

76

How does OLAP work?  A Data warehouse would extract information from

multiple data sources and formats like text files, excel sheet, multimedia files, etc.  The extracted data is cleaned and transformed. Data is loaded into an OLAP server (or OLAP cube) where information is pre-calculated in advance for further analysis.

2/16/2019

77

Basic analytical operations of OLAP  Four types of analytical operations in OLAP are:  Roll-up  Drill-down  Slice and dice  Pivot (rotate)

2/16/2019

78

Roll-up  Roll-up

is also known as "consolidation" or "aggregation." The Roll-up operation can be performed in 2 ways  Reducing dimensions  Climbing up concept hierarchy. Concept hierarchy is a system of grouping things based on their order or level.

2/16/2019

79

2/16/2019

80

Roll-up  In this example, cities New York and Chicago are rolled

up into country USA  The sales figure of New York and Chicago are 440 and 1560 respectively. They become 2000 after roll-up  In this aggregation process, data in location hierarchy moves up from city to the country.

2/16/2019

81

Drill-down  In drill-down data is fragmented into smaller parts. It

is the opposite of the rollup process.  It can be done via moving down the concept hierarchy  Increasing a dimension

2/16/2019

82

2/16/2019

83

Drill Down  Quater Q1 is drilled down to months January,

February, and March. Corresponding sales are also registered.  In this example, dimension months are added.

2/16/2019

84

Slice  Here, one dimension is selected, and a new sub-cube is

created.  Dimension Time is Sliced with Q1 as the filter.  A new cube is created altogether.

2/16/2019

85

2/16/2019

86

Dice  This operation is similar to a slice. The difference in

dice is you select 2 or more dimensions that result in the creation of a sub-cube.

2/16/2019

87

2/16/2019

88

Pivot  In Pivot, you rotate the data axes to provide a

substitute presentation of data.

2/16/2019

89

2/16/2019

90

2/16/2019

91

Types of OLAP Servers/Models  MOLAP – Multidimensional OLAP  ROLAP- Relational OLAP  HOLAP –Hybrid OLAP

 DOLAP- Desktop OLAP  Web OLAP

2/16/2019

92

MOLAP  In the MOLAP model, data for analysis is stored in

specialized multidimensional databases. Large multidimensional arrays form the storage structures.  Precalculated and prefabricated multidimensional data cubes are stored in multidimensional databases.  The MOLAP engine in the application layer pushes a multidimensional view of the data from the MDDBs to the users.  The users who need summarized data enjoy fast response times from the preconsolidated data. 2/16/2019

93

2/16/2019

94

ROLAP  ROLAP works with data that exist in a relational

database. Facts and dimension tables are stored as relational tables. It also allows multidimensional analysis of data and is the fastest growing OLAP.  True ROLAP has three distinct characteristics:  Supports all the basic OLAP features and functions

discussed earlier  Stores data in a relational form  Supports some form of aggregation

2/16/2019

95

 Local hypercubing is a variation of ROLAP provided

by vendors. This is how it works:  1. The user issues a query.  2. The results of the query get stored in a small, local,

multidimensional database.  3. The user performs analysis against this local database.  4. If additional data is required to continue the analysis, the user issues another query and the analysis continues

2/16/2019

96

ROLAP Drawbacks of ROLAP model:  Demand for higher resources: ROLAP needs high utilization of manpower, software, and hardware resources.  Aggregately data limitations. ROLAP tools use SQL for all calculation of aggregate data. However, there are no set limits to the for handling computations.  Slow query performance. Query performance in this model is slow when compared with MOLAP.

2/16/2019

97

2/16/2019

98

2/16/2019

99

 HOLAP – Refers to hybrid OLAP. This model attempts

to combine the strengths and features of ROLAP and MOLAP  DOLAP – Refers to Desktop OLAP. It is meant to provide users of OLAP. In this methodology, multidimensional datasets are created and transferred to the desktop machine, requiring only the DOLAP software to exist on machine.  WEB OLAP- Refers to OLAP where OLAP data is accessible from a web browser.

2/16/2019

100

Advantages of OLAP  OLAP is a platform for all type of business includes planning, budgeting, reporting, and analysis.  Information and calculations are consistent in an OLAP cube. This is a crucial benefit.  Quickly create and analyze "What if" scenarios  Easily search OLAP database for broad or specific terms.  OLAP provides the building blocks for business modeling tools, Data mining tools, performance reporting tools.  Allows users to do slice and dice cube data all by various dimensions, measures, and filters.  It is good for analyzing time series.  Finding some clusters and outliers is easy with OLAP.  It is a powerful visualization online analytical process system which provides faster response times 2/16/2019

101

Disadvantages of OLAP  OLAP requires organizing data into a star or snowflake

schema. These schemas are complicated to implement and administer.  You cannot have large number of dimensions in a single OLAP cube.  Transactional data cannot be accessed with OLAP system.  Any modification in an OLAP cube needs a full update of the cube. This is a time-consuming process. 2/16/2019

102

Selection Criteria for OLAP Tools  Multidimensional representation of data.  Aggregation, summarization, precalculation, and   



derivations. Formulas and complex calculations in an extensive library. Cross-dimensional calculations. Time intelligence such as year-to-date, current and past fiscal periods, moving averages, and moving totals. Pivoting, cross-tabs, drill-down, and roll-up along single or multiple dimensions.

2/16/2019

103

OLAP Tools  IBM Cognos  SAP NetWeaver BW  Essbase

 icCube  Oracle Database OLAP option

2/16/2019

104

Starnet Query Model(previous year paper)  The querying of multidimensional databases can be based on a starnet model, which consists of radial lines emanating from a central point, where each line represents a concept hierarchy for a dimension. Each abstraction level in the hierarchy is called a footprint. These represent the granularities available for use by OLAP operations such as drill-down and roll-up.

 A starnet query model for the AllElectronics data warehouse is shown .This starnet consists of four radial lines, representing concept hierarchies for the dimensions location, customer, item, and time, respectively. Each line consists of footprints representing abstraction levels of the dimension. For example, the time line has four footprints: “day,” “month,” “quarter,” and “year.” 2/16/2019

105

 A concept hierarchy may involve a single attribute

(e.g., date for the time hierarchy) or several attributes (e.g., the concept hierarchy for location involves the attributes street, city, province or state, and country).  In order to examine the item sales at AllElectronics,

users can roll up along the time dimension from month to quarter, or, say, drill down along the location dimension from country to city.

2/16/2019

106

2/16/2019

107

Data Mining  Data Mining is defined as extracting information from   





huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data. Data mining is the process of extracting the useful information, which is stored in the large database. It is a powerful tool, which is useful for organizations to retrieve the useful information from available data warehouses. Data mining can be applied to relational databases, objectoriented databases, data warehouses, structuredunstructured databases, etc. Data mining is used in numerous areas like banking, insurance companies, pharmaceutical companies etc.

2/16/2019

108

Data Mining Architecture(previous year paper)

2/16/2019

109

Components of Data Warehouse Architecture  Data Sources  Operational database, World Wide Web (WWW), text files and other

documents are the actual sources of data. You need large volumes of historical data for data mining to be successful. Organizations usually store data in databases or data warehouses. Data warehouses may contain one or more databases, text files, spreadsheets or other kinds of information repositories.

 Different Processes  The data needs to be cleaned, integrated and selected before passing it

to the database or data warehouse server. As the data is from different sources and in different formats, it cannot be used directly for the data mining process because the data might not be complete and reliable. So, first data needs to be cleaned and integrated. Again, more data than required will be collected from different data sources and only the data of interest needs to be selected and passed to the server. These processes are not as simple as we think. A number of techniques may be performed on the data as part of cleaning, integration and selection.

2/16/2019

110

 Data Warehouse:  A data warehouse is a place which store information collected from multiple sources under unified schema. Information stored in a data warehouse is critical to organizations for the process of decision-making.  Data Mining Engine:  Data Mining Engine is the core component of data mining process which consists of various modules that are used to perform various tasks like clustering, classification, prediction and correlation analysis.  Pattern Evaluation:  The patterns generated by Data Mining Engine are evaluated by the pattern evaluation module for the measure of interestingness of the pattern by using a threshold value. It interacts with the data mining engine to focus the search towards interesting patterns.For example using support and confidence to judge whether the association rules derived from Market Basket Analysis are interesting or not. 2/16/2019

111

 e) Graphical User Interface  The graphical user interface module communicates between the user and the data mining system. This module helps the user use the system easily and efficiently without knowing the real complexity behind the process. When the user specifies a query or a task, this module interacts with the data mining system and displays the result in an easily understandable manner.  f) Knowledge Base  The knowledge base is helpful in the whole data mining process. It might be useful for guiding the search or evaluating the interestingness of the result patterns. The knowledge base might even contain user beliefs and data from user experiences that can be useful in the process of data mining. The data mining engine might get inputs from the knowledge base to make the result more accurate and reliable. The pattern evaluation module interacts with the knowledge base on a regular basis to get inputs and also to update it. 2/16/2019

112

OLAM = OLAP + DATA MINING  Online Analytical Mining integrates with Online

Analytical Processing with data mining and mining knowledge in multidimensional databases.  As data mining need to work on preprocessed data which is stored in data warehouses and OLAP cubes are computed over data warehouses, applying data mining techniques on OLAP cubes can make the best use of available infrastructures rather than building everything from scratch.

2/16/2019

113

 Effective data mining needs exploratory data analysis. A

user will often want to traverse through a database, select portions of relevant data, analyze them at different granularities, and present knowledge/ results in different forms.

 Multidimensional data mining provides facilities for

mining on different subsets of data and at varying levels of abstraction—by drilling, pivoting, filtering, dicing, and slicing on a data cube and/or intermediate data mining results.

 This, together with data/knowledge visualization tools,

greatly enhances the power and flexibility of data mining.

2/16/2019

114

Mining frequent patterns  Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a data set.  For example, a set of items, such as milk and bread, that

appear frequently together in a transaction data set is a frequent itemset.

 A subsequence, such as buying first a PC, then a digital

camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.

 A substructure can refer to different structural forms, such as

subgraphs, subtrees, or sublattices, which may be combined with itemsets or subsequences.

2/16/2019

115

 If a substructure occurs frequently, it is called a

(frequent) structured pattern. Finding frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data.

 Moreover, it helps in data classification, clustering,

and other data mining tasks. Thus, frequent pattern mining has become an important data mining task and a focused theme in data mining research.

 Frequent pattern mining searches for recurring relationships in a given data set. 2/16/2019

116

Support and Confidence(previous year paper)  The rule A- B holds in the transaction set D with

support s, where s is the percentage of transactions in D that contains A U B.  The rule A-- B has confidence c in the transaction

set D, where c is the percentage of transactions in D containing A that also contains B.

2/16/2019

117

2/16/2019

118

Market Basket Analysis  Frequent itemset mining leads to the discovery of

associations and correlations among items in large transactional or relational data sets. With massive amounts of data continuously being collected and stored, many industries are becoming interested in mining such patterns from their databases.  The discovery of interesting correlation relationships

among huge amounts of business transaction records can help in many business decision-making processes such as catalog design, cross-marketing, and customer shopping behavior analysis. 2/16/2019

119

 A typical example of frequent itemset mining is

market basket analysis. This process analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”.  The discovery of these associations can help retailers

develop marketing strategies by gaining insight into which items are frequently purchased together by customers. For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip 2/16/2019

120

Market Basket Analysis  “Which groups or sets of items are customers likely to purchase on a given trip to the store?”  To answer your question, market basket analysis may be performed on the retail data of customer transactions at your store. You can then use the results to plan marketing or advertising strategies, or in the design of a new catalog.  For instance, market basket analysis may help you design different store layouts. In one strategy, items that are frequently purchased together can be placed in proximity to further encourage the combined sale of such items. If customers who purchase computers also tend to buy antivirus software at the same time, then placing the hardware display close to the software display may help increase the sales of both items. 2/16/2019

121

2/16/2019

122

 Market basket analysis can also help retailers plan

which items to put on sale at reduced prices. If customers tend to purchase computers and printers together, then having a sale on printers may encourage the sale of printers as well as computers.  Buying patterns that reflect items that are frequently

associated or purchased together. These patterns can be represented in the form of association rules.  Association rules are if-then statements that help

to show the probability of relationships between data items within large data sets in various types of databases. 2/16/2019

123

Association Rule Mining(previous year paper)  Association rule learning is a rule-based machine

learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.  Market Basket Analysis is one of the applications of

Association Rule Mining that can be used to find out the frequent itemsets that is items that are frequently purchased together. 2/16/2019

124

Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation  Apriori is a seminal algorithm proposed by R. Agrawal and

R. Srikant in 1994 for mining frequent itemsets for Boolean association rules  The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties,  Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k + 1) itemsets.  First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and 2/16/2019

125

 collecting those items that satisfy minimum support.









The resulting set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space. Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

2/16/2019

126

In general, association rule mining can be viewed as a two-step process: 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. 2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.

2/16/2019

127

2/16/2019

128

2/16/2019

129

Association Rules for Example in slide 116

2/16/2019

130

Refer following pages for Apriori Algorithm and FP-Growth Algorithm

2/16/2019

131

From Association Mining to Correlation Analysis • After generating the association rules using Apriori or FP growth Algorithm based on minimum support and minimum confidence interestingness measures, it can be concluded that which rules are strong or which are not. • But all strong rules(one that have support and confidence greater than minimum support and confidence) are not interesting and impactful for business.

Correlation Analysis • To decide whether two items in a rule AB are positively, negatively or not correlated, another interestingness measure that is used apart from support and confidence is lift, which can be computed as:

• If the resulting value of lift is less than 1, then the occurrence of A is negatively correlated with the occurrence of B, meaning that the occurrence of one likely leads to the absence of the other one.

• If the resulting value is greater than 1, then A and B are positively correlated, meaning that the occurrence of one implies the occurrence of the other. • If the resulting value is equal to 1, then A and B are independent and there is no correlation between them.

EXAMPLE

More Documents from "mayank"