Data Warehouse Project Management Listen, Listen, Listen to the Stakeholders By Hari Mailvaganam A paramount determining factor in the success of data warehousing is the input of stakeholders. Data warehousing is very unique to an organization, its business processes, systems architecture and decision support needs. Project management for data warehousing allows for large amounts of user input and at all phases of the project. There are commercial software products tailored for data warehouse project management. A good project plan lists the critical tasks that must be performed and when each task should be started and completed. It identifies who is to perform the tasks, describes deliverables to be created and identifies milestones for measuring progress. There are a number of publications on data warehousing project management. The standard bearer publication is Ralph Kimball's Data Warehouse Lifecycle Toolkit. Over the years I have worked with different project management methodologies and processes for data warehousing. In this article I have listed a summary of a project management methodology that I have developed that also ties in with the Rational Unified Process (RUP). This is especially useful if the data warehousing project is part of a greater development effort which follows RUP. The program management framework for data warehousing follows tested methodologies. The methodology described below follows the project management workflow of the Unified Process.
Figure 1. Rational Unified Process; Copyright IBM
The project plan should include presentations at regular intervals, say monthly, to management and stakeholders. The presentation will include: • • • •
Discussion on any challenges and determine if project is on schedule; Review of activities and priorities to be achieved before next meeting; Contingency plans to make up time and address problems; An open question-and-answer period followed by a summary.
Phases of the Unified Process The phases in project management following UP, Figure 1, are: 1. Inception 2. Elaboration 3. Construction 4. Transition The project lifecycle is composed over time into four sequential phases, each concluded by a major milestone, Figure 2. The RFP is produces at the end of the inception phase.
Figure 2. Phases in the Unified Process; © IBM 1. Inception Phase • • • • • • • • • • • • • • •
Project planning and evaluation. Requirements gathering. Define features the system must support. Identify the stakeholders who oversee the system use. Users and other applications (i.e. actors) that will interface with the system. Define framework to identify business rules. List the events that the system must be aware of and respond to. List the constraints and risks place on the project. Define product selection criteria. Define project management environment. Prepare and obtain approval of sponsors for project mandate and approach. Define data warehouse objectives. Define query library/location matrix. Define query libraries. Establish meta-data.
• • • • •
Prepare migration specifications. Develop query usage monitoring. Technical infrastructure assessment. Current technology assessment. Release of RFP.
The project plan must also place importance to: • • • • • •
Change control. Risk assessment. Training. Scope agreements. Resources. Communication.
2. Elaboration Phase • • • • • • • • •
Review of RFPs and selecting winning candidate. Determine size of the data warehouse. Determine complexity and cleanliness of the data. Identify number of source databases and their characteristics. Select DSS tools. Determine network requirements. Conduct training. Integration of data. Plan quality assurance testing procedures.
3. Construction Phase • • • • •
Implement components. Systems integration. Conduct testing. Plan deployment. Develop support material.
4. Transitions Phase • • • • • • • • • •
Roll-out of modules. Managing change-control requests. Conduct beta testing to validate the new system against user expectations. Perform converting operational databases. Evaluate performance and benchmarking. Evaluate that deployment baselines are complete. Obtain stakeholder feedback on deployment baselines consistency with evaluation criteria. Fine-tune product based on feedback. Release product to users. Product release milestones.
Please contact us if you have any questions or suggestions.
What is Data Cleansing? The elevator pitch: "Data cleansing ensures that undecipherable data does not enter the data warehouse. Undecipherable data will affect reports generated from the data warehouse via OLAP, Data Mining and KPI's." A very simple example of where data cleansing would be utilized is how dates are stored in separate applications. Example: 11th March 2007 can be stored as '03/11/07' or '11/03/07' among other formats. A data warehousing project would require the different date formats to be transformed to a uniform standard before being entered in the data warehouse.
Why Extract, Transform and Load (ETL)? Extract, Transform and Load (ETL) refers to a category of tools that can assist in ensuring that data is cleansed, i.e. conforms to a standard, before being entered into the data warehouse. Vendor supplied ETL tools are considerably more easy to utilized for managing data cleansing on an ongoing basis. ETL sits in front of the data warehouse, listening for incoming data. If it comes across data that it has been programmed to transform, it will make the change before loading the data into the data warehouse. ETL tools can also be utilized to extract data from remote databases either through automatically scheduled events or via manual intervention. There are alternatives to purchasing ETL tools and that will depend on the complexity and budget for your project. Database Administrators (DBAs) can write scripts to perform ETL functionality which can usually suffice for smaller projects. Microsoft's SQL Server comes with a free ETL tool called Data Transforming Service (DTS). DTS is pretty good for a free tool but it does has limitations especially in the ongoing administration of data cleansing. Example of ETL vendors are Data Mirror, Oracle, IBM, Cognos and SAS. As with all product selections, list what you think you would require from an ETL tool before approaching a vendor. It may be worthwhile to obtain the services of consultants that can assist with the requirements analysis for product selection.
Figure 1. ETL sits in front of Data Warehouses
How important is Data Cleansing and ETL to the success of Data Warehousing Projects? ETL is often out-of-sight and out-of-mind if the data warehouse is producing the results that match stakeholders expectations. As a results ETL has been dubbed the silent killer of data warehousing projects. Most data warehousing projects experience delays and budget overruns due to unforeseen circumstances relating to data cleansing.
How to Plan for Data Cleansing? It is important is start mapping out the data that will be entered into the data warehouse as early as possible. This may change as the project matures but the documentation trail will come in extremely valuable as you will need to obtain commitments from data owners that they will not change data formats without prior notice. Create a list of data that will require Extracting, Transforming and Loading. Create a separate list for data that has a higher likelihood of changing formats. Decide on whether you need to purchase ETL tools and set aside an overall budget. Obtain advice from experts in the field and evaluate if the product fits into the overall technical hierarchy of your organization.
Planning for a Data Warehouse Data Warehousing have been voted the most complex and resource intensive software projects by CIOs of Fortune 1000 companies. Enterprise Resource Planning (ERP) implementations top the list. Data Warehousing comes in all shapes and sizes, which bear a direct relationship to cost and time involved. The approach to starting a Data Warehousing project will vary and the steps listed below are summary of some of the points to consider. 1) Get Professional Advice Data Warehousing makes a great difference is the lead has been there and done that. It will save a huge bundle to get professional advice upfront. Endless meeting times can be saved and the risk of an abandon data warehousing project can be reduced. 2) Plan the Data Know what metrics you want to measure in the Data Warehouse and ensure that there is the appropriate data to provide for the analysis. If you wish to obtain periodic Key Performance Index (KPI) data for shipping logistics, make sure that the appropriate data is piped into the data warehouse. 3) Who will use the Data Warehouse The power Data Warehouse consumers are business and financial managers. Data Warehouses are meant to deliver clear indications on how the business is performing. Plot out the expected users for the Data Warehouse in the enterprise. Gauge that they will have the appropriate reports in a format which is quickly understandable. Ensure that planning exercises are conducted in advance to accumulate scenarios on how the Data Warehouse will be used. Always remember that data has to be presented attractively and in a format business managers will feel comfortable. Text files with lines of numbers will not suffice! 4) Intergration to External Applications Most Data Warehousing projects sink or swim by their ability to extract data from external applications. Enterprises have a slew of applications either developed inhouse or obtain from a vendor. Conceptually, your Data Warehouse will act as the heart to diverse applications running in the enterprise. All important data will flow in or out of the Data Warehouse. 5) Technology, What Technology? At the bits and bytes level, a Data Warehouse is a souped up database. It will be built from one of the major Relational Database Management System (DBMS) vendors like Oracle, IBM, Microsoft. Open source databases, like mySQL, can also support Data Warehousing with the right support in place.
Data Warehousing RFP Much of a data warehousing project's success is determined at the onset. A data warehousing request for proposal (RFP) is a powerful tool in dictating the project's tone. This article gives an overview of some of the components that are covered in an RFP. Data warehousing projects are unique to an organization's business requirements, computer resources and budget. Please contact us if you have any questions. 1. GENERAL INFORMATION FOR VENDORS Lists the Project Vision Project timelines Nature of contract Scope Reasoning: Give an overview of project and format for the RFP process. 2. SYSTEM BACKGROUND Description of the Technical Environment a) Description on the Client and Server specifications at the Ministry. b) Breakdown of user types for the data warehouse: User Types (Examples Only)
c)
Description
Number of Users
Novice User
Uses “canned” reports and queries.
x
Power User
Frequent user, edits reports, queries directly and builds cubes for multidimensional analysis.
x
Software tools for the data warehouse: • ETL Tools • Report and Query Tools • Data Analysis • Data Visualization Tools Reasoning: Give an overview of the expected installation. Also specify points of particular interest. 3. SYSTEM REQUIREMENTS List of General Requirements → Give an overview of the data warehouse environment. Reasoning: An overview of the rational for the data warehouse implementation. Mandatory Requirements a) List if the software is Client/Server based or web-based. Described what functions are available via the Internet browser. Describe what plug-ins (if any) are necessary for the client machines. b) State that the vendor must communicate with the Ministry’s database and applications. Ask the vendor how they propose to do this. c) The vendor must be committed to its products remaining compatible with future versions of Oracle. Ask the vendor to offer documents as evidence of commitment to this environment. d) The vendor’s system must enable the overall optimization of the Ministry’s environment:
Ask for how the vendor’s system minimizes user response time while maximizing the use of system resources. ii) Ask the vendor to explain the system’s architecture for data selection, processing, formatting. iii) Ask for evidence that network traffic is minimized and how the need to process large data sets on the client is also minimized. Reasoning: Based on user interviews and technical review, these are the must have features for the data warehouse installation. I)
Strongly Desired Features Examples: → The vendor’s system must allow users to schedule reports. → Does your system permit users to schedule at a certain time and date? Daily? Weekly? Based on an event? Describe how scheduling is accomplished and the options for scheduling reports. Reasoning: Based on the requirements analysis, the items listed here are additional features that will be good to have but not mandatory.
Desired Features Examples: → Does your system possess a redundant query control? This feature returns a saved version of a report whenever an identical report has been executed earlier – the report, output format, and raw data all remain constant in the interim. Describe how this works in your system. What are the limits in terms of how far back reports are archived? Reasoning: Following the rational of “Strongly Desired Features” above, this section lists a lower desired set of requirements. 4. INFORMATION REQUIRED FROM VENDORS Product Profile o Product name o Product description o Current release level o Date current release level was available o Projected date for the next release candidate o Current products install base o Industry verticals of installation o Number of installations o Breakdown of installations/users Reasoning: Gauge vendor’s product specifications and maturity. Consulting o Describe the level of involvement the consultants will have at each stage of the project. o What is the experience of the consultants? o Location of Consultants. o Please forward resumes of consultants. Reasoning: Evaluate experience and competency of the vendor’s team. Also enquire on the consultants’ geographic location, this will help evaluate possible response rate. Business Partners o Ask the vendor to state which products are fully integrated to your solution. o Describe the interconnectivity and interface requirements.
Reasoning: Enquire what other products can integrate well with the solution, e.g. Crystal Reports, Cognos. Also, the vendor will state the ease at which integration can occur. Full feature of the business intelligence solution and products o Ask what makes the product stand out. Reasoning: This will give the vendor a chance to distinguish their products. Further business intelligence capabilities such as OLAP, Data mining can be evaluated. Application areas o Describe the areas the solution supports i.e. budgeting, financial planning etc. Reasoning: Evaluate which areas the data warehousing solution can serve. The “Mandatory Features” section above will inform the vendor the area which are necessary for the Ministry. This section gives further information on the vendors solution and possible areas which the Ministry can use in the future. Web-based and standalone clients o Query if the solution supports both web-based and standalone clients. Ask which functionality is not supported in both versions (i.e. what works in a standalone client and not web-based and vice versa) Reasoning: The architecture is of a fundamental importance. This section will enquire if the vendor’s solution will match the Ministry requirements. Pricing and Licensing Model o How much does the product cost, including initial training, support and consulting? o
Reasoning: Gauge the cost of the solution and other indirect costs. System Installation and Tool Administration Examples: o Please describe the platform and system requirements of your product. o For web-based clients, are there specific requirements on the type and version of browser that may be used (e. g. Netscape vs. Internet Explorer)? o Please describe the overall architecture of your product (e. g. Client/ Server, two-tier/ three-tier architecture, etc.), including how your product functions in a geographically distributed environment. o Please describe the steps involved in installing your product. o Please list the administrative functions that are included in your product (e. g. defining roles, backup/ recovery, etc.). o How do tool administrators detect when your product requires additional hardware or requires human intervention (e. g. to restart a process, etc.). o How often are major/ minor software upgrades available? How would we be notified of a new software release? o Please describe how standalone (I. e. non-web-based) clients are distributed to end-user desktops. How are software upgrades distributed to end-users? o Are there ways to make the upgrade process transparent to the end-users? Reasoning: The above questions and statements will give further information on the vendor’s solution regarding the installation and administration of your product. Security o How are end-users and administrators authenticated to your product? o How do end-users and administrators gain access to reports, data sets, etc. (e. g. via roles)?
o o o o
Please fully describe how reports and data are protected from unauthorized users. Please describe how row-level access to the data is attained. What level of encryption is used on passwords, data, reports, etc.? How does the existence of a firewall change the security architecture?
How does the web-based client differ in terms of security from the standalone client? Reasoning: Questions on security must also pass the security criteria set by regulators or industry standards . Performance and Scalability o Does the product supports a substantial increase in data size, frequency and complexity of end-user queries? o How is performance measured within your product? Reasoning: Evaluation of how the product can scale to future expectations. o
Data Warehouse Design Data warehousing is more an art-form than cookie cutter science. The business variables and technical risks are very unique to each installation. The business users have different goals and expectations. Data warehousing is more often successful than not if there is a reservoir of data warehousing expertise in-house. This article will focus on the data warehousing design methodologies most commonly proposed. These designs are in an evolving flux as business needs and technical cost change. Quite often the design chosen will be a combination of the methodologies below and additional requirements - the data warehouse design third way. I am a proponent of the third way data warehousing design. Third way takes into account the business specifics and needs of the installing company and technical resources available. It uses the best design patterns of both methodologies plus additional requirements unique to the business. The two major design methodologies of data warehousing are from Ralph Kimball and Bill Inmon. The design methodologies developed by Kimball and Inmon have lines drawn in the sand. Both Kimball and Inmon view data warehousing as separate from OLTP and Legacy applications. Kimball views data warehousing as a constituency of data marts. Data marts are focused on delivering business objectives for departments in the organization. And the data warehouse is a conformed dimension of the data marts. Hence a unified view of the enterprise can be obtain from the dimension modeling on a local departmental level.
Figure 1. Kimball's Data Warehousing Design Methodology Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the development of the data warehouse can start with data from the online store. Other subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS) data can be added later if management decides it is necessary. The data mart is the creation of a data warehouse's subject area.
Figure 2. Inmon's Data Warehouse Design Methodology There are pros and cons to both approaches. And there are third ways that can be unique to an enterprise's needs.
Data Life Cycle Management Data becomes active as soon as it is of interest to an organization. Data life cycle begins with a business need for acquiring data. Active data are referenced on a regular basis during day-to-day business operations. Over time, this data loses its importance and is accessed less often, gradually losing its business value, and ending with its archival or disposal.
Figure 1. Data Life Cycle in Enterprises Active Data Active data is of business use to an organization. The ease of access for business users to active data is an absolute necessity in order to run an efficient business. The simple, but critical principle, that all data moves through life-cycle stages is key to improving data management. By understanding how data is used and how long it must be retained, companies can develop a strategy to map usage patterns to the optimal storage media, thereby minimizing the total cost of storing data over its life cycle. The same principles apply when data is stored in a relational database, although the challenge of managing and storing relational data is compounded by complexities inherent in data relationships. Relational databases are a major consumer of storage and are also among the most difficult to manage because they are accessed on a regular basis. Without the ability to manage relational data effectively, relative to its use and storage requirements, runaway database growth will result in increased operational costs, poor performance, and limited availability for the applications that rely on these databases. The ideal solution is to manage data stored in relational databases as part of an overall enterprise data management solution.
Inactive Data Data are put out to pasture once they are no longer active. i.e. there are no longer needed for critical business tasks or analysis. Prior to the mid-nineties, most enterprises achieved data in Microfilms and tape backups. There are now technologies for data archival such as Storage Area Networks (SAN), Network Attached Storage (NAS) and Hierarchical Storage Management. These storage systems can maintain referential integrity and business context.
Data Warehouse Project Management Listen, Listen, Listen to the Stakeholders By Hari Mailvaganam A paramount determining factor in the success of data warehousing is the input of stakeholders. Data warehousing is very unique to an organization, its business processes, systems architecture and decision support needs. Project management for data warehousing allows for large amounts of user input and at all phases of the project. There are commercial software products tailored for data warehouse project management. A good project plan lists the critical tasks that must be performed and when each task should be started and completed. It identifies who is to perform the tasks, describes deliverables to be created and identifies milestones for measuring progress. There are a number of publications on data warehousing project management. The standard bearer publication is Ralph Kimball's Data Warehouse Lifecycle Toolkit. Over the years I have worked with different project management methodologies and processes for data warehousing. In this article I have listed a summary of a project management methodology that I have developed that also ties in with the Rational Unified Process (RUP). This is especially useful if the data warehousing project is part of a greater development effort which follows RUP. The program management framework for data warehousing follows tested methodologies. The methodology described below follows the project management workflow of the Unified Process.
Figure 1. Rational Unified Process; Copyright IBM The project plan should include presentations at regular intervals, say monthly, to management and stakeholders. The presentation will include: • • • •
Discussion on any challenges and determine if project is on schedule; Review of activities and priorities to be achieved before next meeting; Contingency plans to make up time and address problems; An open question-and-answer period followed by a summary.
Phases of the Unified Process The phases in project management following UP, Figure 1, are: 1. Inception 2. Elaboration 3. Construction 4. Transition The project lifecycle is composed over time into four sequential phases, each concluded by a major milestone, Figure 2. The RFP is produces at the end of the inception phase.
Figure 2. Phases in the Unified Process; © IBM
1. Inception Phase • • • • • • • • • • • • • • • • • • • •
Project planning and evaluation. Requirements gathering. Define features the system must support. Identify the stakeholders who oversee the system use. Users and other applications (i.e. actors) that will interface with the system. Define framework to identify business rules. List the events that the system must be aware of and respond to. List the constraints and risks place on the project. Define product selection criteria. Define project management environment. Prepare and obtain approval of sponsors for project mandate and approach. Define data warehouse objectives. Define query library/location matrix. Define query libraries. Establish meta-data. Prepare migration specifications. Develop query usage monitoring. Technical infrastructure assessment. Current technology assessment. Release of RFP.
The project plan must also place importance to: • • • • • •
Change control. Risk assessment. Training. Scope agreements. Resources. Communication.
2. Elaboration Phase • • • • • • • • •
Review of RFPs and selecting winning candidate. Determine size of the data warehouse. Determine complexity and cleanliness of the data. Identify number of source databases and their characteristics. Select DSS tools. Determine network requirements. Conduct training. Integration of data. Plan quality assurance testing procedures.
3. Construction Phase • •
Implement components. Systems integration.
• • •
Conduct testing. Plan deployment. Develop support material.
4. Transitions Phase • • • • • • • • • •
Roll-out of modules. Managing change-control requests. Conduct beta testing to validate the new system against user expectations. Perform converting operational databases. Evaluate performance and benchmarking. Evaluate that deployment baselines are complete. Obtain stakeholder feedback on deployment baselines consistency with evaluation criteria. Fine-tune product based on feedback. Release product to users. Product release milestones.
Data Modeling and Mining Modeling for data mining By Hari Mailvaganam Data mining is conducted against data accumulated in OLTP repositories, data warehouses, data marts and archived data. The steps for data mining follows the following pattern: • • • • • •
data extraction data cleansing modeling data applying data mining algorithm pattern discovery data visualization
Data extraction and data cleansing can be eased with good data lifecycle management policies. Very often a data warehousing project will ensure that data extraction and metadata standards are pre-defined in an organization. Data models for operational and archived data are different from data mining models. Data stored referentially in operational systems are designed for transactional speed.
Figure 1. Data Extraction for Data Mining In data mining a unified table view is created where data of interest is stored. Most data mining vendors offer the ability to extract data from repositories and transfer to the data mining database. The table view below shows an example for an retailer's data mining database.
Table 1. Table View of Data Mining Database Not all of the data found in the data mining table view will have relevance. An example is the first column in Table 1, which has identifier values. Other data may hold hidden patterns that can be discovered after relevancy is captured, often with external data sources. An example from Table 1 is the telephone number column. At first glance it would seem this data set would be insignificant to the data mining process. However, useful information can be obtained from telephone numbers, such as telephone exchange location or cell phone usage. This can be obtained from external data sources which will grade and score the telephone number data set in Table 1.
Figure 2. The Steps taken in Data Mining