What Is Data Warehousing

  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View What Is Data Warehousing as PDF for free.

More details

  • Words: 5,089
  • Pages: 13
What is Data warehousing? Answer A data warehouse can be considered as a storage area where interest specific or relevant data is stored irrespective of the source. What actually is required to create a data warehouse can be considered as Data Warehousing. Data warehousing merges data from multiple sources into an easy and complete form. What are fact tables and dimension tables? Answer As mentioned, data in a warehouse comes from the transactions. Fact table in a data warehouse consists of facts and/or measures. The nature of data in a fact table is usually numerical. On the other hand, dimension table in a data warehouse contains fields used to describe the data in fact tables. A dimension table can provide additional and descriptive information (dimension) of the field of a fact table. e.g. If I want to know the number of resources used for a task, my fact table will store the actual measure (of resources) while my Dimension table will store the task and resource details. Hence, the relation between a fact and dimension table is one to many. What is ETL process in data warehousing? Answer ETL is Extract Transform Load. It is a process of fetching data from different sources, converting the data into a consistent and clean form and load into the data warehouse. Different tools are available in the market to perform ETL jobs. Explain the difference between data mining and data warehousing. Answer Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Where as data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc. E.g. a data warehouse of a company stores all the relevant information of projects and employees. Using Data mining, one can use this data to generate different reports like profits generated etc. What is an OLTP system and OLAP system? Answer OLTP: Online Transaction and Processing helps and manages applications based on transactions involving high volume of data. Typical example of a transaction is commonly observed in Banks, Air tickets etc. Because OLTP uses client server architecture, it supports transactions to run cross a network.

OLAP: Online analytical processing performs analysis of business data and provides the ability to perform complex calculations on usually low volumes of data. OLAP helps the user gain an insight on the data coming from different sources (multi dimensional). What is PDAP? Answer A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily. E.g. using a data cube A user may want to analyze weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube. What is snow flake scheme design in database? Answer A snowflake Schema in its simplest form is an arrangement of fact tables and dimension tables. The fact table is usually at the center surrounded by the dimension table. Normally in a snow flake schema the dimension tables are further broken down into more dimension table. E.g. Dimension tables include employee, projects and status. Status table can be further broken into status_weekly, status_monthly. What is analysis service? Answer Analysis service provides a combined view of the data used in OLAP or Data mining. Services here refer to OLAP, Data mining. Explain sequence clustering algorithm. Answer Sequence clustering algorithm collects similar or related paths, sequences of data containing events. E.g. Sequence clustering algorithm may help finding the path to store a product of “similar” nature in a retail ware house. Explain discrete and continuous data in data mining. Answer Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender. Continuous data can be considered as data which changes continuously and in an ordered fashion. E.g. age Explain time series algorithm in data mining.

Answer Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series. E.g. Performance one employee can influence or forecast the profit What is XMLA? Answer XMLA is XML for Analysis which can be considered as a standard for accessing data in OLAP, data mining or data sources on the internet. It is Simple Object Access Protocol. XMLA uses discover and Execute methods. Discover fetched information from the internet while Execute allows the applications to execute against the data sources. Explain the difference between Data warehousing and Business Intelligence. Answer Data Warehousing helps you store the data while business intelligence helps you to control the data for decision making, forecasting etc. Data warehousing using ETL jobs, will store data in a meaningful form. However, in order to query the data for reporting, forecasting, business intelligence tools were born. What is Dimensional Modeling? Answer Dimensional modeling is often used in Data warehousing. In simpler words it is a rational or consistent design technique used to build a data warehouse. DM uses facts and dimensions of a warehouse for its design. A snow and star flake schema represent data modeling. What is surrogate key? Explain it with an example. Answer Data warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is not generated by the user but by the system. A primary difference between a primary key and surrogate key in few databases is that PK uniquely identifies a record while a SK uniquely identifies an entity. E.g. an employee may be recruited before the year 2000 while another employee with the same name may be recruited after the year 2000. Here, the primary key will uniquely identify the record while the surrogate key will be generated by the system (say a serial number) since the SK is NOT derived from the data. What is the purpose of Factless Fact Table? Answer Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they don’t really have facts or any information but are more commonly used for tracking some information of an event. Eg. To find the number of leaves taken by an employee in a month.

What is a level of Granularity of a fact table? Answer A fact table is usually designed at a low level of Granularity. This means that we need to find the lowest level of information that can store in a fact table. E.g. Employee performance is a very high level of granularity. Employee_performance_daily, employee_perfomance_weekly can be considered lower levels of granularity. Explain the difference between star and snowflake schemas. Answer A snow flake schema design is usually more complex than a start schema. In a start schema a fact table is surrounded by multiple fact tables. This is also how the Snow flake schema is designed. However, in a snow flake schema, the dimension tables can be further broken down to sub dimensions. Hence, data in a snow flake schema is more stable and standard as compared to a Start schema. E.g. Star Schema: Performance report is a fact table. Its dimension tables include performance_report_employee, performance_report_manager Snow Flake Schema: the dimension performance_report_employee_weekly, monthly etc.

tables

can

be

broken

to

What is the difference between view and materialized view? Answer A view is created by combining data from different tables. Hence, a view does not have data of itself. On the other hand, Materialized view usually used in data warehousing has data. This data helps in decision making, performing calculations etc. The data stored by calculating it before hand using queries. When a view is created, the data is not stored in the database. The data is created when a query is fired on the view. Whereas, data of a materialized view is stored. What is a Cube and Linked Cube with reference to data warehouse? Answer A data cube stores data in a summarized version which helps in a faster analysis of data. Where as linked cubes use the data cube and are stored on another analysis server. Linking different data cubes reduces the possibility of sparse data. E.g. A data cube may store the Employee performance. However in order to know the hours which calculated this performance, one can create another cube by linking it to the root cube (in this case employee performance). What is junk dimension?

Answer In scenarios where certain data may not be appropriate to store in the schema, this data (or attributes) can be stored in a junk dimension. The nature of data of junk dimension is usually Boolean or flag values. E.g. whether the performance of employee was up to the mark? , Comments on performance. What are fundamental stages of Data Warehousing? Answer Stages of a data warehouse helps to find and understand how the data in the warehouse changes. At an initial stage of data warehousing data of the transactions is merely copied to another server. Here, even if the copied data is processed for reporting, the source data’s performance won’t be affected. In the next evolving stage, the data in the warehouse is updated regularly using the source data. In Real time Data warehouse stage data in the warehouse is updated for every transaction performed on the source data (E.g. booking a ticket) When the warehouse is at integrated stage, It not only updates data as and when a transaction is performed but also generates transactions which are passed back to the source online data. What is active data warehousing? The term "Active Data Warehousing (ADW)" was coined in 1999 when a Teradata Magazinearticle described how a few customers were using data warehouse insights in operational processes. Since then, dozens of Teradata clients have deployed highly competitive applications using ADW techniques.

The trend started by Teradata customers and Teradata Engineering now goes by the names "pervasive BI," "operational BI," "real-time BI" and Active Data Warehousing (ADW). Active Data Warehousing describes the technical techniques needed to achieve those goals.

An enterprise data warehouse is "just another database" but with the distinction that it contains the most trusted and comprehensive data owned by the enterprise. It can be used in all the ways databases can be used — in eCommerce web sites, employee portals, web services, Java and .NET applications, and embedded in batch or workflow processes. The key to ADW applications is a robust, mature mixed-workload management subsystem. Teradata can help you concurrently run dozens of complex reports, multiple data load jobs, data mining and hundreds of fast tactical queries while meeting service level agreements for the front line user. Active Data Warehousing Advantages In the broadest sense, Active Data Warehousing extends the traditional data warehouse by enabling you to: • • • •

Provide access to analytic insights by customers, partners and suppliers Deliver up-to-the-minute fresh data in operational reports Develop new, highly competitive applications using web services, SOA and Web 2.0 technologies Simplify the IT architecture and reduce costs by consolidating ODS repositories

Vital Active Elements Your Active Data Warehouse requires attention to six critical success factors called "active elements": Active Access Frontline users access the data warehouse for operational decision making with a service level agreement of five seconds or less (also known as "web speed"). Active Load Near-real time data enter the data warehouse via mini-batch loading, replication services or continuous streams of data from message queuing systems. Active Events Event-driven architectures and business activity monitoring detect significant business events and issue alerts for timely, informed decisions. Active Workload Management Dynamic priority management inside the data warehouse ensures service levels are achieved across multiple user communities and workload types. Active Enterprise Integration Integration tools and designs connect the data warehouse to web sites, portals, SOA, web services, enterprise service busses, workflow and batch systems. Active Availability Policies, procedures and redundant hardware ensure the entire information supply chain is protected when subsystem failure occurs.

What is difference between dependent data warehouse and independent data wareshouse ? Data Mart Does Not Equal Data Warehouse • Bill Inmon • DM Direct, November 1999 Editor's Note: We are proud to introduce a new DM Direct feature. Starting with this issue of DM Direct, once a month we will be including a "classic" article from DM Review's archives. These classic articles provide valuable business intelligence/data warehousing insights and have been selected by the editorial department of DM Review. "...The

data

warehouse

is

nothing

more

than

the

union

of

all

the

data

marts...," Ralph Kimball, December 29, 1997. "You can catch all the minnows in the ocean and stack them together and they still do not make a whale," Bill Inmon, January 8, 1998. The single most important issue facing the information technology manager this year is whether to build the data warehouse first or the data mart first. The data mart vendors have said that data warehouses are difficult and expensive to build, take a long time to design and develop, require thought and investment, and mandate that the corporation face difficult issues such as integration of legacy data, managing massive volumes of data and cost justifying the entire DSS/data warehouse effort to the management committee. The picture painted by the data mart advocates for building the data warehouse is gloomy. It is also self-serving and incorrect. The data mart vendors look upon the data warehouse as an obstacle between themselves and the revenue that comes from making sales. Of course, they want to shun the data warehouse. The data warehouse lengthens their sales cycle, regardless of the long-term effect of building a bunch of data marts with no data warehouse. The data mart vendors are selling a very short-term perspective at the expense of long-term architectural success. The data mart advocates suggest that there may be alternate, much easier paths to DSS success than building a data warehouse. One of those paths is to build several data marts and when they grow big enough, call them a data warehouse, rather than build an actual data warehouse. The data mart advocates argue that the data mart can be built much more quickly and cheaply than a warehouse. When you build the data mart there is no need for a great amount of organizational hassle or discipline and no concern for the long-term architecture that is created by the data marts. Unfortunately, by avoiding the visceral organizational and design issues

of

warehousing, the data mart advocates miss much of the point of warehousing. By building an architecture consisting entirely of data marts, the data mart advocates lead the organization into an even larger mess. Instead of messy legacy operational

systems, now we have messy legacy operational systems AND messy data marts. Stovepipe data marts and stovepipe DSS applications are what result from building nothing but data marts. There is no integration when all that you build is data marts. And a DSS environment without integration is like a man without a skeletal system-hardly a useful, viable entity. A Change of Approaches In the early days of the data warehouse marketplace, the data mart vendors tried to jump on the warehouse gravy train by proclaiming that a data warehouse was the same thing as a data mart. In trade show after trade show, the data mart vendors confused people with what a data warehouse is and what a data mart is. The data mart vendors spread half truths and misinformation about data warehousing. The result was confusion. The obfuscation sowed by the data mart vendors caused a few confused customers to build data marts with no actual warehouse. After about the third data mart, the customer discovered something was rotten in Denmark. The architectural deficiency of building nothing but data marts was unmasked. The customer discovered that when you don't build a data warehouse, there is: • massive redundancy of detailed and historical data from one data mart to another, • inconsistent and irreconcilable results from one data mart to the next, • an unmanageable interface between the data marts and the legacy application environment, etc. In short order, the world discovered that a DSS environment without a data warehouse was an extremely unsatisfactory thing. Now that the world has found that building data marts is not the proper way to proceed in DSS, the data mart vendors and their spokesmen are back again and are sowing a different brand of confusion. This time they have altered their original words a little and have promised a new and improved path to easy success. In a slight twist of concept from the first time around, the notion now being spread is that a data warehouse is merely a collection of integrated data marts (whatever that is). The notion that multiple data marts can be integrated is oxymoronic. The whole essence of data marts is that mart users do their own thing so that they don't have to integrate with other marts. Simply stated, for a variety of very powerful reasons, you cannot build data marts, watch them grow and magically turn them a data warehouse when they reach a certain size. And by the same token, integrating data across data marts is equally unthinkable because each department that owns its own data mart has its own unique specifications.

In order to understand why one or more data marts cannot be transformed into a data warehouse, you must first understand what a data mart is and a data warehouse is. Different Architectural Structures A data mart and a data warehouse are essentially different architectural structures, even though when viewed from afar and superficially, they look to be very similar. What is a Data Mart? A data mart is a collection of subject areas organized for decision support based on the needs of a given department. Finance has their data mart, marketing has theirs, sales has theirs and so on. And the data mart for marketing only faintly resembles anyone else's data mart. Perhaps most importantly, the individual departments OWN the hardware, software, data and programs that constitute the data mart. The rights of ownership allow the departments to bypass any means of control or discipline that might coordinate the data found in the different departments. Each department has its own interpretation of what a data mart should look like and each department's data mart is peculiar to and specific to its own needs. Typically, the database design for a data mart is built around a star-join structure that is optimal for the needs of the users found in the department. In order to shape the star join, the requirements of the users for the department must be gathered. The data mart contains only a modicum of historical information and is granular only to the point that it suits the needs of the department. The data mart is typically housed in multidimensional technology which is great for flexibility of analysis but is not optimal for large amounts of data. Data found in data marts is highly indexed. There are two kinds of data marts--dependent and independent. A dependent data mart is one whose source is a data warehouse. An independent data mart is one whose source is the legacy applications environment. All dependent data marts are fed by the same source--the data warehouse. Each independent data mart is fed uniquely and separately by the legacy applications environment. Dependent data marts are architecturally and structurally sound. Independent data marts are unstable and architecturally unsound, at least for the long haul. The problem with independent data marts is that their deficiencies do not make themselves manifest until the organization has built multiple independent data marts. What is a Data Warehouse? Data warehouses are significantly different from data marts. Data warehouses are arranged around the corporate subject areas found in the corporate data model. Usually

the

data

warehouse

is

built

and

owned

by

centrally

coordinated

organizations, such as the classic IT organization. The data warehouse represents a truly corporate effort.

There may or may not be a relationship between any department's subject areas and the corporation's subject areas. The data warehouse contains the most granular data the corporation has. Data mart data is usually much less granular than data warehouse data (i.e., data warehouses contain more detail information while most data marts contain more summarized or aggregated data). The data warehouse data structure is an essentially normalized structure. The structure and the content of the data in the data warehouse do not reflect the bias of any particular department, but represent the corporation's needs for data. The volume of data found in the data warehouse is significantly different from the data found in the data mart. Because of the volume of data found in the data warehouse, the data warehouse is indexed very lightly. The data warehouse contains a robust amount of historical data. The technology housing the data warehouse is optimized on handling an industrial strength amount of data. The data warehouse data is integrated from the many legacy sources. In short, there are very significant differences between the structure and content of data that resides in a data warehouse and the structure and content of data that resides in a data mart. Figure 1 shows some of the differences between a data mart and a data warehouse. Because data that is granular, integrated and historical resides in a data warehouse, the data warehouse attracts a significant volume of data. Because the warehouse attracts a significant amount of data, it is advisable to build the warehouse iteratively. If you don't build the warehouse iteratively, you will spend years building the warehouse. From the very first literature that was ever written on data warehousing, it has been recognized that there was a need to get concrete, tangible results in front of the end user as quickly as possible. The best advice of the writers and consultants of the data warehousing industry has consistently been to build the warehouse quickly and to avoid large, lengthy efforts. Interestingly, the data mart advocates and their spokesmen claim that data warehouses take a long time to build. It is only in the hype issued by the data mart advocates that it has ever been suggested that the warehouse be built in "galactic" proportions. Figure 2 shows the recommended construction path for data warehouses. The most recent theory of the data mart advocates is that you can build one or more data marts, integrate them (although no one is very clear as to what that means) and then when they grow to a certain size, they can be (magically!) turned into a warehouse. This suggestion is sadly mistaken for a variety of reasons: * The data mart is designed to suit the needs of a department. Many departments with very different objectives must be satisfied. That is why there are many different data marts in the corporation, each with its own distinctive look and feel. The data

warehouse is designed to suit the collective needs of the entire corporation. A given design can be optimal for a single department or the corporation, but not both. The design objectives for the corporation are very different from the design objectives for a given department. * The granularity of data in the data mart is very different from the granularity of data in the data warehouse. The data mart contains aggregated or summarized data. The data warehouse contains the most detailed data that is found in the corporation. Since the data mart granularity is much higher than that found in the data warehouse, you cannot easily decompose the data mart granularity into data warehouse granularity. But you can always go the other direction and summarize detailed units of data into summarizations. • The structure of the data in the data mart (commonly a star join structure) is only faintly compatible with the structure of the data in the warehouse (a normalized structure). • The amount of historical data found in the data mart is very different from the history of the data found in the warehouse. Data warehouses contain robust amounts of history. Data marts contain only modest amounts of history. • The subject areas found in the data mart are only faintly related to the subject areas found in the data warehouse. • The relationships found in the data mart are not those relationships that are found in the data warehouse. • The types of queries satisfied in the data mart are quite different from those queries found in the data warehouse. • The kind of users (farmers) that are found in the marts are quite different from the type of users (explorers) that are found in the data warehouse. • The key structures found in the data mart are significantly different from those key structures found in the data warehouse, and so forth. Reality There are simply MAJOR, MAJOR significant differences between the data mart and the data warehouse environment. The assertion that a data mart can be turned into a data warehouse when it reaches a certain size or that data marts can be integrated together is no more valid than saying that when a tumbleweed grows large enough that it can be turned into an oak tree. Reality and genetics being what they are, it is true that a tumbleweed and an oak tree are, at one point in their life, green living organisms planted in the soil and are approximately the same size. But just because those two plants share a few basic characteristics at some moment in time does not mean that a tumbleweed can be turned into an oak tree. Only a misinformed person would mistake a tumbleweed for an oak tree at any stage in the life of the plants.

Data Warehousing vs. Business Intelligence The differentials are: DW - is a way of storing data and creating information through leveraging data marts. DM's are segments or

categories of information and/or data that are grouped together to provide 'information' into that segment or category. DW does not require BI to work. Reporting tools can generate reports from the DW. BI - is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools. You will find that BI is much like ERP in that it can be extremely expensive and invasive to your firm and there is a wide range between the offerings - low end to high end - which facilitates the pricing. There is a long list of tools to select from. There are also services that provide this as an outsource. Some of these services will allow you to eventually 'own' the solution and in-source it at a future date. Like anything else, this comes at a price. This scenario works well for those who do not have a high caliber IT staff and would like to get results with a short ramp up time, basically because the system is already built. Your rules and reports just have to be generated. That is a bit oversimplified, but you get the picture. Describe dimensional Modeling SQL Server 2005, and especially Analysis Services 2005, has embraced industry-standard dimensional modeling like a long-lost friend. When you build Analysis Services databases in SQL Server 2000, you have to compromise in creating the dimensional model. Specifically, you have to create multiple OLAP virtual dimensions and use member properties to capture attributes that don't fit into hierarchical dimensions. SQL Server 2005 Analysis Services uses attribute-based dimensions, so every attribute in a dimension is automatically treated as a standalone hierarchy. Now, you can use the metadata layer that defines Analysis Services cubes (the Unified Dimensional Model—UDM) to define one customer dimension instead of a half-dozen artificially separate dimensions. One benefit of this deeper dimensional support is that data warehouse and BI system developers no longer need to convert standard dimensional-modeling techniques to the limited world view of previous SQL Server versions. Instead, you can build dimensions that realistically represent how your business operates and can change as your business changes. In this article, we define dimensional models, describe the basic elements and techniques that support them, and offer a dimensionally based data architecture for your data warehouse and BI system. . . .

What are Dimensions and various types of Dimensions? Dimensions contains Textual attributes. It is Wide not Deep.(whereas FACT is deep not wide) about Dimensions.. The Slowly Changing Dimensions Wizard creates mappings to load slowly changing dimension tables: Type 1 Dimension mapping. Loads a slowly changing dimension table by inserting new dimensions and overwriting existing dimensions. Use this mapping when you do not want a history of previous dimension data. Type 2 Dimension/Version Data mapping. Loads a slowly changing dimension table by inserting new and changed dimensions using a version number and incremented primary key to track changes. Use this mapping when you want to keep a full history of dimension data and to track the progression of changes. Type 2 Dimension/Flag Current mapping. Loads a slowly changing dimension table by inserting new and changed dimensions using a flag to mark current dimension data and an incremented primary key to track changes. Use this mapping when you want to keep a full history of dimension data, tracking the progression of changes while flagging only the current dimension. Type 2 Dimension/Effective Date Range mapping. Loads a

slowly changing dimension table by inserting new and changed dimensions using a date range to define current dimension data. Use this mapping when you want to keep a full history of dimension data, tracking changes with an exact effective date range. Type 3 Dimension mapping. Loads a slowly changing dimension table by inserting new dimensions and updating values in existing dimensions. Use this mapping when you want to keep the current and previous dimension values in your dimension table. apart from these: there are... Confirmed dimensions Junk Dimensions Degenerated dimension What is degenerate dimension table A Degenerate dimension is a Dimension which has only a single attribute. This dimension is typically represented as a single field in a fact table. The data items thar are not facts and data items that do not fit into the existing dimensions are termed as Degenerate Dimensions. Degenerate Dimensions are the fastest way to group similar transactions. Degenerate Dimensions are used when fact tables represent transactional data. They can be used as primary key for the fact table but they cannot act as foreign keys.

Related Documents

Data Warehousing
April 2020 35
Data Warehousing
October 2019 40
Data Warehousing
June 2020 23
Data Warehousing
June 2020 24
Data Warehousing
June 2020 33