Data warehousing The basic concepts & its applications
What is a Data warehouse?
It means…
collection of databases with information extracted from other databases. a repository of an organization's electronically stored data model of corporation's data, put together to answers the corporation's business questions. A data warehouse is designed to support strategic decision making. designed to facilitate reporting and analysis.
Comparing a Data Warehouse and an Operational Database Data Warehouse,
Operational Database
subject oriented
application oriented
integrated
multiple diverse sources
time –variant, current
real-time
nonvolatile
updateable
History of DATA WAREHOUSING
concept dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed "business data warehouse". In essence, data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments
In absence of a data warehousing architecture, an enormous amount of redundancy of information was required to support the multiple decision support environments that usually existed.
In larger corporations it was typical for multiple decision support environments to operate independently. Each environment served different users but often required much of the same data.
The process of gathering, cleaning and integrating data from various sources, usually long existing operational systems was replicated for each environment.
the operational systems were frequently reexamined as new decision support requirements emerged.
Based on analogies with real-life warehouses, data warehouses were intended as large-scale collection/storage/staging areas for corporate data.
Data could be retrieved from one central point or data could be distributed to "retail stores" or "data marts" which were tailored for ready access by users.
Types of Data Warehousing Applications Data
warehousing systems target three different types of applications:
personal query
productivity
and reporting
planning
and analysis
"Personal productivity applications spreadsheets,
statistical packages and graphics tools, are useful for manipulating and presenting data on individual PCs. Developed for a standalone environment, these tools address applications requiring only small volumes of warehouse data.
"Data query and reporting applications deliver
warehouse-wide data access through simple, list-oriented queries, the generation of basic reports. These reports provide a view of historical data they do not address the enterprise need for in-depth analysis and planning.
"Planning and analysis applications address essential business requirements as
budgeting, forecasting, product line and customer profitability, sales analysis, financial consolidations manufacturing mix analysis --applications that use historical, projected and derived data.
Benefits of Data Warehousing
Has a subject area orientation
Data is arranged by subject rather than by application, and is more intuitive for users to navigate. closer in concept to the way decision makers think about their business.
Integrates data from multiple, diverse sources
the integration provided by a data warehouse was seen as very desirable. Allows grouping multiple, diverse sources of data & integrating them into single administrative suite
Allows for analysis of data over time data snapshots taken at times that are significant to the decision making process make it possible to analyze trends over time. This can readily indicate areas where further analysis is warranted.
Adds ad hoc reporting and enquiry In addition to the points highlighted in the definition, a data warehouse is designed to be accessible with end-user tools, and this allows ad hoc reporting and analysis by end-users.
Relieves the development burden on IT With data warehouse and users trained in the use of appropriate desktop tools, users can find answers to their own questions. Provides improved performance for complex analytical queries
provides improved performance for complex analytical queries Using denormalization and aggregation. Frequently used aggregates are often precomputed and materialized in views known as summary tables These materialized views provide fast access to integrated data, regardless of the original data sources.
Relieves processing burden on transaction oriented databases Provides analysis capabilities to decision makers Allows for a continuous planning process Converts Corporate Data into Strategic Information
Drawbacks of data warehousing
1. Time spent in careful analysis of measurable needs 2. Design and implementation effort 3. Hardware costs 4. Software costs 5. On-going support and maintenance 6. Resulting re-engineering effort
Design process of data warehouse
The actual design process for developing data warehouse talk to the users determine their needs in terms that can be measured design a database to support those needs document the data descriptions and other attributes design the logic for translating data from various sources into an integrated data store Writing code for extracting data from various sources & transforming it into data warehouse, with updates to the metadata Finally package the procedures to handle scheduling, management and maintenance
Data Warehouse Architecture Architecture,
in the context of an organization's data warehousing efforts, is a conceptualization of how the data warehouse is built.
conceptualization
of data warehouse architecture consists of the following interconnected layers:
OPERATIONAL DATABASE LAYER source data for the data warehouse - An organization's EIS systems fall into this layer. INFORMATIONAL ACCESS LAYER data accessed for reporting and analyzing & tools for reporting and analyzing data - Business intelligence tools fall into this layer. DATA ACCESS LAYER The interface between the operational and informational access layer Tools to extract, transform, load data into the warehouse fall into this layer. METADATA LAYER The data directory -This is often usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.
Data Warehouse Components
LOAD MANAGEMENT
WAREHOUSE MANAGEMENT
relates to the collection of info. from disparate internal or external sources. loading process includes summarizing, manipulating and changing data structures into format that lends itself to analytical processing. Actual raw data kept alongside or within data warehouse enabling construction of new and different representations.
relates to day-to-day management of data warehouse. management tasks associated with warehouse includes ensuring its availability, the effective backup of its contents & security.
QUERY MANAGEMENT
relates to provision of access to contents of warehouse includes partitioning of info. into different areas with different privileges to different users. Access may be provided through custom-built applications, or ad hoc query tools.
Technologies Involved In Data Warehousing
source system identification: data warehouse design and creation: changed data capture: data acquisition: data cleansing: data aggregation multi-dimensional analysis tools: business intelligence (bi) metadata management data mining tools: data visualization tools: query tools:
APPROACHES TO STORING DATA
leading approaches to storing data in a data warehouse –
DIMENSIONAL
APPROACH NORMALIZED APPROACH.
DIMENSIONAL APPROACH
transaction data are partitioned into either "facts" (numeric transaction data), or "dimensions", (reference information giving context to facts).
Key advantage : easier for the user to understand and to use. the retrieval of data from the data warehouse tends to operate very quickly.
disadvantages
to maintain integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated difficult to modify data warehouse structure if organization adopting the dimensional approach changes the way in which it does business.
NORMALIZED APPROACH
the data in the data warehouse are stored following the Codd normalization rule.
Tables grouped by subject areas that reflect general data categories data on customers, products, finance, etc.)
Key advantage it is straightforward to add information into the database.
disadvantage because of no. of tables involved, difficult for users to 1) Join data from different sources into meaningful information 2) Access the information without a precise understanding of sources of data & data structure of the data warehouse.