Dw Architecture & Best Practices

  • Uploaded by: api-19730613
  • 0
  • 0
  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Dw Architecture & Best Practices as PDF for free.

More details

  • Words: 2,977
  • Pages: 67
Data Warehouse Architecture

Objectives Define Data Warehouse Architecture  Define Data Warehouse and Data Mart  Present a Data Warehouse Architectural Framework  Demo – Data Enterprise Integration Server 

Information Systems Architecture  Information Systems Architecture is the

process of making the key choices that are essential to the development of an information system. Architecture includes: ◦ ◦ ◦ ◦ ◦

Guiding Principles: Approaches/philosophies “Logical” representations of a system Hardware/Operating System Computing model: client/server vs traditional vs Web-based ◦ Tools and technologies



It is key, when making these choices that they are: ◦ Requirements driven ◦ Take into consideration operational, technical and financial feasibility ◦ Made within an architectural framework ◦

Architecture Drivers 

There are a lot of Drivers of Architecture

Corporat Corporat ee Politics Politics Business Business Plan Plan

System System Qualities Qualities Architecture Architecture

Current Current Systems Systems End EndUser User Require Require ments ments

Emergin Emergin gg Technolo Technolo gies gies

How is Architecture Different from Design? Its not – Architecture can be considered ‘high-level’ design  Architecture includes those aspects of the design that are essential to the information system  Architecture Example: 

◦ Users must be able to self-serve (guiding principle) ◦ “We will use a “hub and spoke” design where data will be placed in a central data warehouse, then be propagated to one or more data marts. (approach) ◦ We will normalize data in the central warehouse and use a dimensional design in the data marts (approach) ◦ We will use Oracle 8i as our DBMS (technical architecture)

Architecture vs Design 

Not Architecture: ◦ The Order subject area will be composed of the following tables: order_fact, customer_dim, product_dim and time_dim ◦ The customer_dim table will have the following attributes…….



The Value of Architecture  Communication: ◦ To business sponsors, and business users ◦ Between members of the project team



Planning:

◦ Cross Check for Project Plan ◦ Ensure that all important components of the data warehouse are accounted for



Flexibility and Growth

◦ Thinking about overall architecture will reduce risk associated with the ‘success’ of the data warehouse

Learning  Productivity and Reuse 

What’s different about DW Architecture? Transaction processing systems – growth is (relatively) predictable  Example: 

◦ A company uses SAP for order processing ◦ They are opening a new retail store ◦ They predict (based on experience) 2000 transactions per week ◦ To process this volume, we need 3 workstations to capture the transactions ◦ Peak time each day is 11-2 when 50% of transactions occur ◦

What’s Different About Data Warehouse Architecture?  Success

drives explosive growth ◦ More users ◦ More (complex) queries ◦ More data

Data Warehouse

Growth

 Performance is SAP R/3 Siebel

unpredictable ◦



Time

Unpredictable queries Unpredictable use patterns

The Great Data Warehouse Architecture Debate 

Bill Inmon: “The enterprise data warehouse”

 

Ralph Kimball: “data marts”

 

The compromise: “Hub and Spoke” or “Federated” models

If you build it, They will come

What is a Data Mart?  A data mart is a collection of subject areas

organized for decision support based on the specific needs of a given user group.  Each mart may widely different from others (as we will see)  Typically, data marts are built on the dimensional data model: ◦ Facts – things that the organization wants to measure: revenue, orders, shipments, purchases, etc. ◦ Dimensions – the means by which the organization wants to analyze the measures (facts) – by customer, by time, by product – BY ANY COMBINATION!!



What is a Data Mart? There are two kinds of data marts--dependent 









and independent. A dependent data mart is one whose source is a data warehouse. An independent data mart is one whose source is the legacy applications environment. All dependent data marts are fed by the same source--the data warehouse. Each independent data mart is fed uniquely and separately by the legacy applications environment. Dependent data marts are architecturally and structurally sound. Independent data marts have a number of significant issues

Data Warehouse vs. Data Marts What comes first

From the Data Warehouse to Data Marts Information Less

Individually Structured

History Normalized Detailed

Departmentally Structured

Organizationally Structured Data

Data Warehouse

More

Data Warehouse and Data Marts OLAP Data Mart Lightly summarized Departmentally structured

Organizationally structured Atomic Detailed Data Warehouse Data

Data Mart Centric Data Sources

Data Marts

Data Warehouse

Problems with Data Mart Centric Solution

If you end up creating multiple warehouses, integrating them is a problem

True Warehouse Data Sources

Data Warehouse

Data Marts

Data Warehouse Architectures

Generic Two-Level Architecture  Independent Data Mart  Dependent Data Mart and Operational Data Store  Logical Data Mart and Real-Time Data Warehouse  Three-Layer architecture 

All involve some form of extraction, transformation and loading (ETL) ETL

19

Generic two-level data warehousing architecture

L T

One, companywide warehouse

E Periodic extraction  data is not completely current in warehouse 20

Independent data mart data warehousing architecture

Data marts:

Mini-warehouses, limited in scope

L

T E Separate ETL for each independent data mart

Data access complexity due to multiple data marts 21

Dependent data mart with operational data store: a three-level architecture

ODS provides option for obtaining current data

L

T E Single ETL for enterprise data warehouse (EDW)

Simpler data access Dependent data marts loaded from EDW 22

Logical data mart and real time warehouse architecture

ODS and data warehouse are one and the same

L

T E Near real-time ETL for Data Warehouse

Data marts are NOT separate databases, but logical views of the data warehouse  Easier to create new data marts 23

Three-layer data architecture for a data warehouse

24

The Major Data Warehouse Architectures Independent data marts  Hub and spoke architecture  Data mart bus architecture  Federated data warehouse 

 

Independent data mart architecture End user access/ applications

Data staging Independent data marts Data sources

•Developed independently. •No conformed dimensions (i.e., does not have the same categories and labels for data elements in data marts which would allowdata across data marts to be combined). •Built to a business unit or functional area.

Federated architecture Data mart Data staging

End user access/ applications Federated data store

Data stores

Data warehouse

•Key spokesperson: Doug Hackney (2000, 2002). •Combines data in an organization’s existing data warehousing environment. •Characterized by combing key metrics and measures in existing data marts, data warehouses and legacy systems.

Data Warehouse Architecture Selection

Architecture selection Architecture selection factors Information interdependence Upper management’s information needs Urgency of need View of the data warehouse Compatibility with existing systems Nature of end user tasks Resource constraints Perceived ability of the IT staff Source of sponsorship Expert influence

Data warehouse architectures Independent data mart Data mart bus architecture Hub and spoke architecture Federated

Best Practice #1 ØUse a data model that is optimized for information retrieval ◦ dimensional model ◦ denormalized ◦ hybrid approach

DW Architecture Best Practices12/04/09

32

Data Acquisition Processes 

Extract Transform Load (ETL) ◦ the process of unloading or copying data from the source systems, transforming it into the format and data model required in the BI environment, and loading it to the DW ◦ also, a software development tool for building ETL processes (an ETL tool) ◦ many production DWs use COBOL or other general-purpose programming languages to implement ETL DW Architecture Best Practices12/04/09

33

The ETL Process  Capture/Extract  Scrub

or data cleansing  Transform  Load and Index ETL = Extract, transform, and load

34

Capture/Extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Steps in data reconciliation

Static extract = capturing a snapshot of the source data at a point in time

Incremental extract = capturing changes that have occurred since the last static extract 35

Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality

Steps in data reconciliation (cont.)

Fixing errors: misspellings,

erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also: decoding, reformatting,

time stamping, conversion, key generation, merging, error detection/logging, locating missing data 36

Transform = convert data from format of operational system to format of data warehouse

Steps in data reconciliation (cont.)

Record-level:

Selection–data partitioning Joining–data combining Aggregation–data summarization

Field-level:

single-field–from one field to one field multi-field–from many fields to one, or one field to many 37

Load/Index= place transformed data into the warehouse and create indexes

Steps in data reconciliation (cont.)

Refresh mode: bulk rewriting

of target data at periodic intervals

Update mode: only changes in source data are written to data warehouse

38

Data Quality Assurance 

data cleansing ◦ the process of validating and enriching the data as it is published to the DW ◦ also, a software development tool for building data cleansing processes (a data cleansing tool) ◦ many production DWs have only very rudimentary data quality assurance processes

DW Architecture Best Practices12/04/09

39

Data Acquisition & Cleansing 

getting data loaded efficiently and correctly is critical to the success of your DW ◦ implementation of data acquisition & cleansing processes represents from 50 to 80% of effort on typical DW projects ◦ inaccurate data content can be ‘the kiss of death’ for user acceptance

DW Architecture Best Practices12/04/09

40

Best Practice #2 ØCarefully design the data acquisition and cleansing processes for your DW ◦ Ensure the data is processed efficiently and accurately ◦ Consider acquiring ETL and Data Cleansing tools ◦ Use them well!

DW Architecture Best Practices12/04/09

41

Data Model Already discussed the benefits of a dimensional model  No matter whether dimensional modeling or any other design approach is used, the data model must be documented 

DW Architecture Best Practices12/04/09

42

Documenting the Data Model 

The best practice is to use some kind of data modeling tool ◦ ◦ ◦ ◦ ◦





CA ERwin Sybase PowerDesigner Oracle Designer IBM Rational Rose Etc.

Different tools support different modeling notations, but they are more or less equivalent anyway Most tools allow sharing of their metadata with an ETL tool

DW Architecture Best Practices12/04/09

43

Data Model Standards 





data model standards appropriate for the environment and tools chosen in your data warehouse should be adopted considerations should be given to data access tool(s) and integration with overall enterprise standards standards must be documented and enforced within the DW team ◦ someone must ‘own’ the data model



to ensure a quality data model, all changes should be reviewed thru some formal process

DW Architecture Best Practices12/04/09

44

Data Model Metadata Business definitions should be recorded for every field (unless they are technical fields only)  Domain of data should be recorded  Sample values should be included  As more metadata is populated into the modeling tool it becomes increasingly important to be able to share this data across ETL and Data Access tools 

DW Architecture Best Practices12/04/09

45

Metadata Architecture The strategy for sharing data model and other metadata should be formalized and documented  Metadata management tools should be considered & the overall metadata architecture should be carefully planned 

DW Architecture Best Practices12/04/09

46

Best Practice #3 ØDesign a metadata architecture that allows sharing of metadata between components of your DW

DW Architecture Best Practices12/04/09

47

Alternative Architecture Approaches Bill Inmon: “Corporate Information Factory”  Hub and Spoke philosophy  “JBOC” – just a bunch of cubes  Let it evolve naturally 

DW Architecture Best Practices12/04/09

48

What We Want (Architectural Principal) In most cases, business and IT agree that the data warehouse should provide a ‘single version of the truth’  Any approach that can result in disparate data marts or cubes is undesireable  This is known as data silos or… 

DW Architecture Best Practices12/04/09

49

Enterprise DW Architecture how to design an enterprise data warehouse and ensure a ‘single version of the truth’?  according to Kimball: 

◦ start with an overall data architecture phase ◦ use “Data Warehouse Bus” design to integrate multiple data marts ◦ use incremental approach by building one data mart at a time

DW Architecture Best Practices12/04/09

50

Data Warehouse Bus Architecture 

named for the bus in a computer ◦ standard interface that allows you to plug in cdrom, disk drive, etc. ◦ these peripherals work together smoothly

provides framework for data marts to fit together  allows separate data marts to be implemented by different groups, even at different times 

DW Architecture Best Practices12/04/09

51

Data Mart Definition 

data mart is a complete subset of the overall data warehouse ◦ a single business process OR ◦ a group of related business processes



think of a data mart as a collection of related fact tables sharing conformed dimensions, aka a ‘fact constellation’ DW Architecture Best Practices12/04/09

52

Designing The DW Bus determine which dimensions will be shared across multiple data marts  conform the shared dimensions 

 produce a master suite of shared dimensions

determine which facts will be shared across data marts  conform the facts 

standardize the definitions of facts

DW Architecture Best Practices12/04/09

53

Dimension Granularity 

conformed dimensions will usually be granular ◦ makes it easy to integrate with various base level fact tables ◦ easy to extend fact table by adding new facts ◦ no need to drop or reload fact tables, and no keys have to be changed

DW Architecture Best Practices12/04/09

54

Conforming Dimensions 

by adhering to standards, the separate data marts can be plugged together ◦ e.g. customer, product, time

they can even share data usefully, for example in a drill across report  ensures reports or queries from different data marts share the same context 

DW Architecture Best Practices12/04/09

55

Data Consolidation a current trend in BI/DW is ‘data consolidation’  from a software vendor perspective, it is tempting to simplify this: 

◦ ‘we can keep all the tables for all your disparate applications in one physical database’

DW Architecture Best Practices12/04/09

56

Data Integration To truly achieve ‘a single version of the truth’, must do more than simply consolidating application databases  Must integrate data models and establish common terms of reference 

DW Architecture Best Practices12/04/09

57

Best Practice #4 ØTake an approach that consolidates data into ‘a single version of the truth’ ◦ Data Warehouse Bus conformed dimensions & facts

◦ OR?

DW Architecture Best Practices12/04/09

58

Operational Data Store (ODS) a single point of integration for disparate operational systems  contains integrated data at the most detailed level (transactional)  may be loaded in ‘near real time’ or periodically  can be used for centralized operational reporting 

DW Architecture Best Practices12/04/09

59

Best Practice #5 ØConsider implementing an ODS only when information retrieval requirements are near the bottom of the data abstraction pyramid and/or when there are multiple operational sources that need to be accessed

◦ Must ensure that the data model is integrated, not just consolidated ◦ May consider 3NF data model ◦ Avoid at all costs a ‘data dumping ground’

DW Architecture Best Practices12/04/09

60

Capacity Planning DW workloads are typically very demanding, especially for I/O capacity  Successful implementations tend to grow very quickly, both in number of users and data volume  Rules of thumb do exist for sizing the hardware platform to provide adequate initial performance 

◦ typically based on estimated ‘raw’ data size of proposed database e.g. 100-150 Gb per modern CPU

DW Architecture Best Practices12/04/09

61

SMP Server Scale Up Scaling performance within a single SMP server is referred to as ‘scale up’  Database benchmarks suggest Windows scalability is near that of Linux  IBM claims near-linear scalability for Linux (on commodity hardware) up to about 4 processors 

◦ Probably not cost effective to scale up Linux much beyond 4 processors



IBM claims near-linear scalability for AIX on POWER5 up to about 8 processors DW Architecture Best Practices12/04/09

62

Scale Up vs. Scale Out To obtain the total number of processors required for the estimated DW workload, must plan either to scale up or scale out  Both options are viable but, all other things being equal, scaling up is less disruptive to end users and requires less work to implement 

◦ scaling up can offer lower hardware investment, if practical ◦ however, network bandwidth or latency issues can limit effectiveness of parallelism DW Architecture Best Practices12/04/09

63

Best Practice #6 ØCreate a capacity plan for your BI application & monitor it carefully ØConsider future additional performance demands

◦ Establish standard performance benchmark queries and regularly run them ◦ Implement capacity monitoring tools ◦ Build scalability into your architecture ◦ May need to allow for scaling both up and out!

DW Architecture Best Practices12/04/09

64

Open Source Affordability 

Another emerging trend in IT generally is to utilize Open Source software running on commodity hardware ◦ this is expected to offer lower total cost of ownership ◦ certainly, GNU/Linux and other Open Source initiatives do provide very good functionality and quality for minimal cost



This trend also applies to BI & DW: ◦ most traditional rdbms’s are now supported on Linux ◦ however, open source rdbms’s lag behind on providing good performance for DW queries

DW Architecture Best Practices12/04/09

65

DW Appliances DW appliances, consisting of packaged solutions providing all required software and hardware, are beginning to offer very promising price/performance  production experience is limited so far, so this is not yet a ‘best practice’ 

DW Architecture Best Practices12/04/09

66

Role of an ODS in DW Architecture In the case where an ODS is a necessary component of the overall DW, it should be carefully integrated into the overall architecture  Can also be used for: 

◦ Staging area ◦ Master/reference data management ◦ Etc…

DW Architecture Best Practices12/04/09

67

Related Documents