P137-fang

  • Uploaded by: Uma devi
  • 0
  • 0
  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View P137-fang as PDF for free.

More details

  • Words: 2,582
  • Pages: 8
TEACHING DATA WAREHOUSING AND DATA MINING IN A GRADUATE PROGRAM OF INFORMATION TECHNOLOGY* Roger Fang, Sama Tuladhar Department of Computer and Information Science Arkansas Tech University Russellville, AR 72801 {roger.fang, stu12460}@atu.edu ABSTRACT Data warehousing and data mining are technologies that deliver critical and optimally useful information to facilitate performance analysis of business organizations. These technologies are not only an emerging trend in information technology but also a booming market in a range of industries. In light of this continuously growing demand, schools are accelerating to prepare students with these technologies. This paper describes the key components that comprise a course which would introduce both data warehousing and data mining technologies to a graduate program of information technology. 1. INTRODUCTION Although it been more than three decades since the data warehousing and data mining technologies were recognized as one of the key essentials for large corporations to manage their business in an efficient manner, the implementation only began in early nineties. Expertise in data mining and warehousing was scarce in the University setting with very few Universities offering courses on the subject. Meanwhile, in the corporate world, these technologies have been embraced by various vendors and organizations of all kinds. In the last decade, the industry has constructed thousands of data warehouses and attracted more and more attention to adopting these technologies in establishing business intelligence and decision support systems. According to one of the popular IT job-search websites, www.computerjobs.com, there are about 900 data warehousing job openings out of a total of 11,600 computer jobs, which currently amounts to about 8% of all computer jobs in the US market. Moreover, Monster.com lists more than one thousand data-warehouse-related job openings. In fact, these numbers have steadily increased in the past few years. To meet ___________________________________________ *

Copyright © 2006 by the Consortium for Computing Sciences in Colleges. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the CCSC copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Consortium for Computing Sciences in Colleges. To copy otherwise, or to republish, requires a fee and/or specific permission. 137

JCSC 21, 5 (May 2006) the growing demand of the data warehousing and mining workforce, academic support through teaching and research in this emerging area becomes a critical factor. This paper presents the key components of a successful data warehousing and mining course that we implemented in spring 2005 for the Master of Science in Information Technology program. 2. THE INTENDED LEARNING OUTCOMES Although this is an introductory course, the contents are far beyond those of a traditional database concept course. Students must already have the knowledge of relational data modeling, normalization theory, query design techniques, and some basic DBMS operations from a fundamental database course. To build on to this foundation, this course provides students an overall experience of understanding the data warehousing and mining technologies, which includes concepts, design methods, physical implementation and applications. In particular, the following learning outcomes are defined to enable students to: 1) Define and describe a data warehouse and its applications; 2) Distinguish online analytical processes (OLAP) and online transactional processes (OLTP); 3) Generate pivot tables by using Excel, Access, and Analysis Manager of SQL Server; 4) Design a multidimensional model for a data warehouse; 5) Extract, transform, and load data from an operational data source to a data warehouse; 6) Perform OLAP querying using multidimensional expressions (MDX); and 7) Use the classification and clustering algorithms to perform data mining activities.

3. KEY COMPONENTS OF THE COURSE This course uses Anahory and Murray’s “Data Warehousing in the Real World” [1] as the main text and the Analysis Services of Microsoft SQL Server 2000 as the tool platform. The Books Online of SQL Server and the accompanied tutorial of the Analysis Services are found to be very useful as auxiliary learning guides to this course. Along with these teaching materials seven key components are identified and implemented throughout this course to help students achieve the above learning outcomes. 3.1 Data Warehouse Concepts And Definition The most frequently asked questions by students could be as simple as “Is data warehouse a database?” Providing a clear definition of what a data warehouse is and how it is different from a general database is critical to motivate students into studying this subject. The list below shows some examples we studied in class that define or describe a data warehouse. • It is a copy of transaction data specifically structured for querying, analysis, and reporting. [3] • It is a database designed to support decision making in an organization. Data from the production databases are copied to the data warehouse so that queries can be performed without disturbing the performance or the stability of the production systems. [4]

138

CCSC:Mid-South Conference • It is a database designed for optimizing the storage and delivery of massive quantities of data, oftentimes in a different format than the transactional data. [5] • It is a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions. [2] 3.2 OLAP vs. OLTP To further distinguish data warehouse systems from traditional database systems, it is necessary to compare the online analytic process (OLAP) with online transactional process (OLTP). An OLTP system supports the day-to-day operation of an organization’s primary business such as the ATM of banks, POS (point-of-sale) of chained retailers, flight ticketing of travel agencies, online auction like ebay.com, and many back-office functions such as financial, payroll, general ledger, etc. Users of an OLTP system almost always deal with one record at a time and perform the same predefined routine tasks repeatedly. An OLAP system is different from an OLTP system in its capability of handling large amount of integrated data to process ad hoc queries required to discover trends and analyze critical factors for a business. Examples of OLAP queries that are discussed in class include: • Find the sales of each category of drinks by regions each year for the past 15 years. • Sorts store city by store sales and returns the top n cities whose cumulative total of store sales of the last six years is at least 25% of all stores. • What books are customers likely to buy together? • Which companies are likely to file bankruptcy? • Do the locations for our stores have elements in common? 3.3 Multidimensional Views & Pivot Tables While traditional OLTP queries select and display data in a two-dimensional columnand-row structure, a typical OLAP query manipulates its input and output data in multiple dimensions. Taking the first query above as an example, the best way to present its output is through the use of a multidimensional view as illustrated in Figure 1 below. Figure 1. Multidimensional view of an OLAP query result.

Category

Carbonated beverages

Beer & Wine Juice Hot beverages

$352,609.35

Water 1990

Year

2003

...

Milk

Asia

the 2003 sales amount of beer and wine in North America

Europe

N. America Region

139

JCSC 21, 5 (May 2006) The real power of a multidimensional view relies on its slicing (i.e., viewing by fixing a value on one axis, see Figure 2), dicing (i.e., viewing by fixing values on all but one dimension), and drill-down (e.g., viewing the sales amount of beer and wine of different brands, prices, or manufacturers, etc.) analysis capability to support virtually any type of OLAP information request along these dimensions. C a te g o r y

C a te g o r y

C a te g o r y

R e g io n Y ear (a ) ‘R e g io n v s . Y e a r ’ a n a ly s is fo r a g iv e n p r o d u c t t

R e g io n Year (b ) ‘C a t e g o r y v s . Y e a r ’ a n a ly s is fo r a g iv e n re g io n .

R e g io n Y ear ( c ) ‘C a te g o r y v s . R e g io n ’ a n a ly s is fo r a g iv e n y e a r.

Figure 2. Slicing analysis on a multidimensional view. The PivotTables of Microsoft Excel, Access, or Analysis Services of SQL Server is a perfect tool for students to exercise the OLAP multidimensional view technique. For instance, Figure 3(a) is an Excel pivot table that presents a slice (time = 1998) of store-sales analysis of Product Family versus Store Country. Using Analysis Services of SQL Server, a similar but more complex analysis can be performed. The pivot table in Figure 3(b) is an N-dimensional view of 1998 (slice 1) sales of Product Family versus Store Country of female customers (slice 2) with a high school degree (slice 3) and a yearly income between $50K and $70K (slice 4). Using Excel or Analysis Services, students can also rill down on rows or columns or both on a pivot table to a finer level for more detailed analysis. This is demonstrated in Figure 3(c), where the store sales data of ood product family in Figure 3(b) is drilled down to the lowest level, Brand Name. Similar drill-downs can also be performed along the Store Country dimension in order to view the sales amount of each individual state/province, county, city, or even every single store of a desired country.

(a(

140

CCSC:Mid-South Conference

(b)

(c) Figure 3. (a) An Excel pivot table shows a 1998 slice of store sales in a Product vs. Store Country view. (b) An Analysis Services pivot table presents an N-dimensional Product vs. Store Country sales analysis with slices on the four circled values. (c) "Drilling-Down" on the Product dimension for a more detailed analysis. For novice users of pivot tables, Collins [6] has a well written introductory article on one of Microsoft Business Solution web pages. Students can also take an additional twohour Microsoft online tutorial [7], to get familiar with this powerful OLAP tool.

141

JCSC 21, 5 (May 2006) 3.4 Dimensional Modeling The golden rule of designing an OLTP database is that tables must be normalized to ensure storage efficiency. However, data in data warehouses are mostly unnormalized in a multidimensional structure to give better performance for OLAP queries. This also necessitates a new data modeling technique called dimensional modeling [8] for data warehouse design. A dimensional model uses fact tables and dimension tables to form a star schema as shown in Figure 4. A fact table is normally joined to several dimension tables to store numerical performance measurements of a business such as sales price, sales units, costs, profits, number of employees, average salary, etc. Dimension tables contain textual descriptors of the business entities like product, store, time, customer, employee, supplier, shipper, order, etc., to name a few. Store Dimension Store key (PK) Region Country State Type

1

*

Time Dimension Time key (PK) Year Quarter Month Week of year Day of week Day

1

*

Store Sales Facts Time key (FK) Product key (FK) Store key (FK) Sales price Sales cost Sales profit

1

* *

1

Product Dimension Product key (PK) Product Family Department Category Subcategory Promotion Dimension Promotion ID (PK) Promotion Name Media Type Cost Start_date End_date

Figure 4. A star schema to support the multidimensional view shown in Figure 1.

3.5 The ETL Process The TL stands for extract, transform, and load. This is the process through which students learn how to populate data into a dimensional model. In this course, we designed a typical ETL process exercise that includes the following activities: • Source data are given in text files, Excel, and an Access database. Students extract and integrate specified data from these sources into a taging area in the SQL Server data warehouse. • In the staging area, students must scrub, transform, and aggregate data in order to fix data errors, convert data types, merge or split data values, calculate and summarize data, etc. • Load data into the dimension and fact tables from the staging area. Some dimensions like Time table may not be populated directly from any source. In this case scripts or stored procedures must be prepared and used. 142

CCSC:Mid-South Conference

3.6 Querying a Multidimensional Cube Using MDX Once data is in place, views (also called cubes) can be defined to start querying the data. Students can use pivot tables, which are described previously, or write Multidimensional Expressions (MDX) in Analysis Services to create OLAP queries. Writing MDX is a very challenging but more controllable approach than direct manipulation of pivot tables. Though it is derived from the standard SQL, its complex syntax with a rich set of add-on functions requires students to have a good SQL programming skill plus a complete MDX reference that can be found in Books Online of SQL Server to use this language. The SELECT statement below is a typical MDX query which generates the multidimensional pivot table shown in Figure 3(b). SELECT {([Store].[Store Country].members)} on columns, {([Product].[Product Family].members)} on rows FROM Sales --Sales is a defined cube (view) of the data warehouse WHERE ( [measures].[Store Sales], --list only Store Sales measurement (fact) [Time].[year].[1998], [Education Level].[High School Degree], [Gender].[F], [Yearly Income].[$50K - $70K])

3.7 Data Mining Data mining could be the most exciting part of an OLAP project because students can find many new or unexpected data relationships and patterns that are totally hidden in the data warehouse. The last three examples listed in section 3.2 are representative queries that require data mining technique because their answer can only be discovered, not searched out directly, from the data warehouse. Without very intensive study of mining algorithms, students are given opportunities to explore the data mining agic by using the classification and clustering data mining model and tools of the Analysis Services. 4. CONCLUSIONS The overall student feedback from the class of spring 2005 indicates that this was a successful course. However, the challenge of teaching such a course remains. For the future work, we plan to improve and enhance the curriculum in the following directions: • Extending the course to cover topics like physical storage, security implementation, data warehouses tuning and maintenance, real-time data warehousing, web-mining, etc. • Introducing more tools like Oracle Warehouse Builder and Oracle Data Miner or Darwin, IBM Intelligent Miner, Enterprise Miner of SAS, etc. • Applying real world case studies to illustrate the data warehousing and mining model design principles and methodology.

143

JCSC 21, 5 (May 2006) ACKNOWLEDGEMENTS The authors would like to thank Ezra Weisner and Akshat Saharia for their valuable time in reviewing and proofreading this paper. REFERENCES [1] S. Anahory, D. Murray, Data Warehousing in the Real World, p.4, 1997, Addison Wesley. [2] W. Inmon, Building the Data Warehouse, p.31-35, 3rd ed, 2002, John Wiley & Sons. [3] G. Marakas, Modern Data Warehousing, Mining, and Visualization, p.5, 2002, Prentice Hall [4] Computer Desktop Encyclopedia, http://www.computerlanguage.com/webexamples.htm [5] C. Seidman, Data Mining with Microsoft SQL Server 2000, p.351, 2001, Microsoft Press. [6] J. C. Collins, Microsoft Excel Pivot Tables, June 3, 2005, http://www.microsoft.com/businesssolutions/excel_pivot_tables_collins.mspx [7] PivotTable I, II, and III of Excel Training Courses, Microsoft Office Online, http://office.microsoft.com/training/training.aspx?AssetID=RC010136191033 [8] R. Kimball, M. Ross, The Data Warehouse Toolkit, Chapter 1, 2nd ed, 2002, John Wiley.

144

More Documents from "Uma devi"

P137-fang
October 2019 6
Disaster Management.docx
October 2019 15
Poems
May 2020 13
Story Books
May 2020 6
October 2019 14