Training Program for Data Warehousing
DATA WAREHOUSING LESSONS LEARNT AND BEST PRACTICES
December 2000
TCS Confidential
DATA WAREHOUSING: LESSONS LEARNT AND BEST PRACTICES
Table of Contents 1. OVERVIEW.........................................................................................................................3 ...............................................................................................................................................3 1.1 SIGNIFICANT ISSUES IN DATA WAREHOUSING....................................................3 1.2 WHY DATA WAREHOUSE PROJECTS FAIL..............................................................4 1.3 STEPS FOR A SUCCESSFUL DATA WAREHOUSE....................................................4 1.4 PILOTING THE DATA WAREHOUSE..........................................................................5 1.5 IMPLEMENTING BEST OF BREED SOLUTION........................................................6 1.6 BUDGETING FOR A DATA WAREHOUSE..................................................................6 1.7 DATA WAREHOUSING STAFFING REQUIREMENTS..............................................7 1.8 TESTING.........................................................................................................................7 1.9 PERFORMANCE ISSUES AND TUNING....................................................................7
2
DATA WAREHOUSING: LESSONS LEARNT AND BEST PRACTICES
1. OVERVIEW The key to a successful data warehouse that truly improves business intelligence is a rock-solid design foundation. Only when such a structure is in place can a company leverage their data to empower knowledge workers to make more effective business decisions and be more productive. Many Data Warehousing projects fail due to lack of management sponsorship and wrongly identifying the business problem, amongst other things. This document identifies the common mistakes undertaken in Data Warehouse projects and lists steps for a successful data warehouse implementation.
1.1 SIGNIFICANT ISSUES IN DATA WAREHOUSING In an in-depth analysis report, the META Group described the following emerging issues as most significantly altering the data warehousing approach taken by many organizations:
•
Inadequate Planning Assumptions for Large Data Warehouse Initiatives. Once into the implementation stage, too many designers and project leaders for corporate data warehouse initiatives discover that their assumptions about the impact of organizational, political, and cultural dynamics — as well as their assessment of technology and skills shortfalls — were naïve and, in fact, major obstacles.
•
Technology Limitations. A common stumbling block for corporate data warehousing is that even after the data is scrubbed, rationalized, and loaded, the final database is too large to fit on any known server. In addition, even the most focused data mart solution can grind to a halt as new sources of internal and external data identified as critical to its business intelligence analysis deluge the system. Moreover, the processing power required to support access by the expanding user constituencies (encompassing employees, business partners, and ultimately customers) is beyond what is currently available. Massively parallel processing (MPP) systems and even clusters of inexpensive symmetric multiprocessing (SMP) systems often turn out to be white elephants, because these massive, centralized databases are too difficult to design, manage, and tune.
•
Data Mart Proliferation. Business users are unable to tolerate the long lead times typical of even the best-planned and executed data warehouse. As a result, many organizations focus efforts on data marts rather than corporate/enterprise-wide data warehousing. The uncoordinated architecture(s) that result often compromise the corporate IT infrastructure plans and result in future major diseconomies.
•
Exploding User Constituencies. Organizations continue to state ambitious plans to increase the number of online analytical users for their data warehouses. Yet, year-to-year measurements show that these same organizations consistently fail to deploy at such magnitudes. Meanwhile, the projected number of users is again restated each year for presumed emerging constituencies — including direct consumer access to certain data warehouses.
The META Group report further describes the following obstacles to Data Warehouse success: • • • • • •
Data Quality (41%) Managing End-User Expectations (31%), Legacy Data Transformation (28%), Business Rule Analysis (28%), Business Data Modeling (25%), and Managing Management Expectations (23%).
3
DATA WAREHOUSING: LESSONS LEARNT AND BEST PRACTICES
1.2 WHY DATA WAREHOUSE PROJECTS FAIL The overriding reasons for many decision support projects failure is not that the projects were technically unfeasible. On the contrary, many of the technological challenges of data warehousing have proven answers. The most common cause of failure is that the warehouse did not meet the business objectives of the organization. Measurement of clear business objectives is critical. Because once the data warehouse project is completed, the management team will have to justify the expenditure. The users always dictate the success or failure of the warehouse. They therefore need to be heavily involved throughout the data warehousing project. Any substantial project lacking executive management participation has a high probability of failure. Like any other large information systems project, data warehouse development can get bogged down if the scope is too broad and the number of people involved is too large. A clear purpose and scope are necessary to manage the application of information systems resources, as well as the expectations of potential data warehouse users. Many large warehouse projects have failed because of an inability of the organization to handle the size and scope of the project. It is tempting to think of a single repository where all of the enterprise's data problems can be solved in one fell swoop. For the initial data mart, which usually provides the data warehouse proof-of-concept, the scope must be sufficient to provide real, immediate, and high profile benefits. Blindly copying dirty source data into the data warehouse is extremely dangerous. Dirty data can lead business analysts to make erroneous business decisions and rapidly causes end users to lose confidence in the quality and integrity of the data warehouse. Lack of data integrity in the data warehouse is a common source of failure of many data warehouses and analytical applications. Disaster recovery is a particular challenge, given the large volumes of data found in today's warehouses. Many organizations just do not take disaster recovery seriously until they had some catastrophic experience. Unpredictable or unreliable systems management functions will raise user questions over the quality of the information and the usefulness of the data warehouse. In a worst-case scenario, poorly planned and executed system management functions can easily lead users to discontinue using their existing data warehouse applications and halt the funding of new applications.
1.3 STEPS FOR A SUCCESSFUL DATA WAREHOUSE A successful DSS project will always be driven by sound ‘business’ reasons for its use. Projects can fail due to lack of sufficient business reasons for its use, or due to pursuit of purely ‘technology’ reasons. This is understandable because of the very nature of DSS projects, which are delivered to ‘business’ users. Unless these users play an active role during the project lifecycle, there is a high risk that key requirements will be missed out, leading to user dissatisfaction. Clearly define the set of business issues to be addressed by the data warehouse. Different types of business problems yield different types of technical challenges. For example, customer churn analysis frequently involves overcoming the technical challenge of working with large dimensions. And product profitability analysis often requires the difficult process of allocating costs across multiple dimensions and creating analyses across time for changing dimensions. The technology dimension will of course play a major role in the enterprise strategy. Different strategies will require different technological characteristics and features. Just as the technology must
4
DATA WAREHOUSING: LESSONS LEARNT AND BEST PRACTICES align itself with the business mission, the strategy must also consider the technology. Therefore the introduction of any new technology must be aligned with the business units and processes which it is intended to support. Senior management sponsorship is crucial for data warehousing projects and the level of support must be commensurate with the scope of the project. Scalability refers to the ability of a system to increase in capacity as users demand more, as data stores grow, as more users are added to the system and as more applications are developed against the Warehouse. When considering a system to house a critical warehouse environment, it is important to select one that scales in a number of aspects. As the data on a system doubles, the query response time of that system should scale linearly, so that queries take no longer than twice the original response time to complete. Additionally, when the capacity of a system doubles, the platform should to be able to handle twice the query volume Taking it a step further, users need to have a personal stake in the success of the project. Make it a point to educate users on the fundamentals and processes of data warehousing. Teach them its benefits, as well as its limitations. This will aid significantly in managing their expectations. Data warehouses are best built in an iterative fashion. Do not misunderstand; this is not to recommend that a company should not build a fully functional, multiple Terabyte, multiple subject area, Web-enabled, end-to-end, enterprise data warehouse with a complete metadata interface. It simply means that the highest probability for success comes from implementing a decision support system in a phased approach. Organizations should try to isolate - if appropriate - the first phase of a data warehouse implementation to one or two important business areas. By using the first iteration as an opportunity to train the corporation, it will set the stage for bigger and better future implementations. Data warehousing projects are fundamentally different from operational projects. These fundamental challenges require highly experienced, senior-level individuals. These positions can be filled via inhouse resources or by consultants. If consultants are used to fill these roles, it is imperative that they are highly skilled at knowledge transfer; in-house employees should be assigned to shadow consultants for both of these roles. To identify the obstacles and objections to a data warehouse that must be addressed to ensure successful deployment
1.4 PILOTING THE DATA WAREHOUSE Piloting the system is a crucial stage of a data warehouse and many organizations miss taking advantages of the opportunities that a well run pilot can provide to the implementation of the data warehouse. The pilot demonstration may be scheduled at the beginning of the data warehousing project, often before any commitments to vendors of data warehousing components have been made. The function of the pilot/Proof of Concept is to demonstrate a working subset of the initial data mart, including the system infrastructure that can be subsequently expanded to become the initial production data mart. There are two basic reasons why an organization embarks on a pilot: Technology Prototype: The technology is new to the organization and there is a need to understand the benefits of the technology to see how it performs. In this case the organization should be able to measure the cost/benefit that the technology may provide. Business Prototype: The organization wants to know if a particular application of this technology to a business problem is the right fit. In other words will this technology solve the business problem. Objectives of piloting/POC include the following:
5
DATA WAREHOUSING: LESSONS LEARNT AND BEST PRACTICES •
Provide a test of the ability of an ETL tool vendor to support the specific requirements of the data warehouse
•
Provide a test of the ability of one or more end-user tool vendors to support the specific requirements of the data warehouse
•
Provide sufficient information to made a final selection of the ETL and end-user tool vendors
•
Prove that the architecture and infrastructure recommended for the data warehouse meet the functional requirements of the organization
•
Demonstrate to managers and end users a working subset of the functionality to be delivered in the first data mart
1.5 IMPLEMENTING BEST OF BREED SOLUTION Due to the availability of so many products from different vendors, each promising better functionality, organizations seeking Data Warehousing solutions typically go for best-of-the-breed solutions to derive maximum benefit. Therefore, integrating a best of breed set of technologies into your data warehousing infrastructure that can flexibly accommodate changing requirements can be extremely challenging. The objective is to develop a workable set of technologies, processes, and standards that help guide your decisions for both the initial building and deployment of the warehouse and its subsequent set of BI applications. However, while implementing a best-of-breed solution the following considerations should be taken: •
Pull together a multi-disciplinary team.
•
Confirm the enterprise business vision.
•
Establish enterprise data model.
•
Select vendor(s) for technology and integration.
•
Identify the scope of any interoperability standards, as well as the proprietary vendor extensions to the standard.
•
Design and develop extract, transform and load routines able to accommodate a wide range of disparate source and target platforms.
•
Design and develop the enterprise metadata management system.
•
Design and develop the warehouse management and scheduling approach test to prove and ensure that data that is moved through the architecture arrives at target systems in a timely and predictable fashion while meeting integrity, consistency, completeness, and granularity requirements.
•
Use performance modeling and testing to ensure that load times meet availability requirements.
•
Roll out the solution to users.
•
Put in place the on-going procedures to keep the system up and running.
1.6 BUDGETING FOR A DATA WAREHOUSE A data warehouse must be compiled using a variety of components glued together and is not something that can be bought off the shelf as a unitized whole. This effort reduces product cohesiveness and makes quantification of costs more challenging. Moreover, new products are
6
DATA WAREHOUSING: LESSONS LEARNT AND BEST PRACTICES entering the market at an ever-faster pace and major vendors are putting new emphasis on data warehousing. Data Warehousing budgets should thus include plans for the addition or substitution of new products in the latter stages of implementation. Additionally, hardware upgrades must be planned to account for the data warehouse's broadening deployment throughout the business units. Other events affecting budgeting will occur, including: •
Historical data will quickly accumulate in the warehouse
•
Subject areas will be added
•
To maintain performance levels achieved during the initial rollout, memory, disk, and processors will have to be added to the hardware platform that supports the data warehouse's database management system (DBMS)
1.7 DATA WAREHOUSING STAFFING REQUIREMENTS Including business analysts from the business units as key members of the data warehouse implementation is essential. Most organizations would struggle if they tried to undertake the task with IS staff resources alone. Business analysts understand what data the warehouse needs, and they can assist with the deployment tasks that face the IS staff and help train end users in using the data warehouse.
1.8 TESTING Testing is an important process in the project and it spans through all the phases. The purpose of performing testing is to assure that the work product is defect free. For this, it is required to build the test cases that will satisfy all user requirements. These test cases need to be tracked/traced throughout the development phases (requirements, design, and construction).
1.9 PERFORMANCE ISSUES AND TUNING From end-user perspective, the main objective of the DW program is to ensure end-user requirements in terms of publishing functional reports and providing the ad hoc query facility with accepted time period for publishing. To successfully achieve this and ensuring that the user expectations are realistic, it is necessary to define "performance". Once "performance" is defined for the project, performance issues have to be identified. Upon successful identification performance issues, benchmarking and projection of key performance indicators need to be performed. Throughout the development process, performance issues have to be monitored and tuned to ensure the desired performance.
7