1
Not found your data? Data Mining Hai na!
Compiled by:
Patel Dvipal (422) Shah Pooja(433) 4th SEM, C.E
GOVERENMENT ENGINEERING COLLEGE GHANDHINAGAR
2
DATA MINING-The Knowledge Discovery in Database
PREFECE
"""" “You have no choice but to operate in a world shaped by globalization and the information revolution. There are two options: adapt or die.” -Andy Grove, Chairman, Intel ALL of first I want to thank CONVERGENCE for giving me the opportunity to show our ability in front of the students and experts in paper present The last few years have seen a growing recognition of information as a key business tool. Those who successfully gather, analyze, understand, and act upon the information are among the winners in this new “information age”. In any business or work just gathering the information is not sufficient, they need to store it for future purposes. Data Base Management Systems are great tools to define and response to some questions or quarry, Still there is some questions or some data which we want, is not directly accessible from the database. Just seeing the Data Base you can not make some decisions and may not predict the market as well as customers. At this point Data Mining is very important for the user, Tough the Data Mining is not a magic wand but it can find the “hidden” information from your database, it can predict the market as well as the customer up to certain level of accuracy. Data Mining also able to take the result depending on multiple database, may be in different DBMS or different companies. We have tries to give detail introduction for Data Mining, Main features of it and also tries to give detail s of development of Data Mining. We hope that after reading this report one can better understand the Data Mining and can develop the application or say Data Mining software which can give the facilities for mining the Data Base in real meaning. Because in real meaning your software should have Artificial Intelligence too detect some models or methods for Data Mining.
3
DATA MINING-The Knowledge Discovery in Database
INDEX CONTENTS
ABSTRACT INTRODUCTION TO DATA MINING LEARNING FROM PAST MISTAKES? INTODUCTION TO DATA WAREHOUSES DATA MINING AND DATA WAREHOUSING
FOUNDATION OF DATA MINING ARCHITECTURE OF DATA MINING PHYSICAL STRUCTURE OF DATA WAREHOUSING CHARACTERISTICS OF DATA WAREHOUSING TYPES OF DATA MINING HOW DATA MINING WORKS? GOALS OF DATA MINING INTEGRATED DATA MINING AND CAMPAIGN MANAGEMENT THE INTEGRATED DATA MINING AND CAMPAIGN MANAGEMENT PROCESS BENEFITS OF INTEGRATING DATA MINING AND COMPAIGN MANAGEMENT TEN STEPS OF DATA MINING EVALUTING BENEFITS OF DATA MINING MODEL DATA MINING SUITE SCOPE OF DATA MINING PROFITABLE APPLICATION OF DATA MINING TYPICAL FUNCTIONALITY OF DATA WAREHOUSES WHAT DATA MINING CAN’T DO? DIFFICULTIES IN WORKING WITH DATA WAREHOUSING GLOSSARY CONCLUSION BIBLOGRAPHY
4
DATA MINING-The Knowledge Discovery in Database
ABSTRACT Data Mining gains its name, and to some degree its popularity, by playing off of a meaning that the data that you have stored is much like a “ mountain” and that buried within the mountain (just as buried within your data) are certain “gems” of great value. The problem is that there are also lots of non-valuable rocks and rubble in the mountain that need to be mined through and discarded in order to get to that which is valuable. The trick is that both for mountains of rock and mountains of data you need some power tools to unearth the value of the data. For rock, this means earthmovers and dynamite; for data, this means powerful computers and data mining software. Data Mining is a process for organizations, which uncover patterns hidden in their data that can be used to predict the behavior of customers, products and processes. Here the Database can be global, or more than one database may be on different DBMS, but the Data Mining can extract the all database and gives you the results which you want. This process gives you the information from the database may be it is not visible directly. Data Mining can give the some results, some combinations or some specific characteristics of customer, product or processes, which is further useful to next working. It can be said that there is some Artificial Intelligence in the Data Mining. Data Mining is the tool, which can give your data the intelligence for any particular models or work. The Building of Data Mining software is very easy if you go through proper steps. The data mining is often referred as K.D.D Knowledge Discovery in Database. Because in the process of Data Mining we are mining the data or we are initiated the process of knowledge discovery in database.
5
DATA MINING-The Knowledge Discovery in Database
Introduction to Data Mining Discovering hidden value in your data warehouse Databases today can range in size into the terabytes - more than 1,000,000,000,000 bytes of data. Within these masses of data lies hidden information of strategic importance. But when there are so many trees, how do you draw meaningful conclusions about the forest? The newest answer is Data Mining, which is being used both to increase revenues and to reduce costs. The potential returns are enormous. Innovative organizations worldwide are already using data mining to locate and appeal to highervalue customers, to reconfigure their product offerings to increase sales, and to minimize losses due to error or fraud. Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?" Data Mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. The first and simplest analytical step in data mining is to describe the data – summarize its statically attributes(such as means and standard derivations), visually review it using charts and graphs, and look for potentially meaningful links among variables(such as values that often occur together). Collecting, exploring and selecting the right data are critically important.
6
DATA MINING-The Knowledge Discovery in Database But data description alone can not provide action plan. You must build a predictive model based on patterns determined from non results, then test the model on result out side the original sample. A good model should never be confused with reality, but it can be a useful guide to understanding your business. The final step is to empirically verify the model. For example, from a database of customers who have already responded to a particular offer. You have built a model predicting which prospects are likeliest to respond to the se offer. Can you rely on this prediction? The data mining is often referenced as K.D.D. Knowledge Discovery in Data Base because in the process of data mining we are mining the data or we are initiated the process of knowledge discovery in data base. The knowledge discovery data base comprises six phases: • Data Selection • Data Cleansing • Enrichment • Data Transformation or Data Encoding • Reporting and display of discovered information • Data Mining As an example consider a transaction data base maintained by a specialty consumer goods retailer. Suppose the client data includes a customer name, zip code, phone-number, date of purchase, item-codes, qty, and total amount. A variety of new knowledge can be discovered by K.D.D. processing on this client data base. Data mining must be preceded by significant data preparation before it can yield useful information that can directly influence business decisions. The result of data mining may be reported in a variety of formats such as listing, graphic outputs, summary tables, or visualizations.
7
DATA MINING-The Knowledge Discovery in Database Learning from Past Mistakes? :: “Those who can not remember the past are condensed to repeat it”. -G.Santayna Data Mining works the same way as a human being does. It uses historical information (experience) to learn from the past. However, in order for the data mining technology to pull the “gold” out of your database, you do have to tell it what the gold looks like (what business problem you would like to solve). It then uses the description of that “gold” to look for similar examples in the database, and uses these pieces of information from the past to develop a predictive model of what will happen in the future.
Is Data Mining replace skilled purposes? Data mining dose not replace skilled business analysts or manages, but rather gives them a powerful new tool to improve the job they are doing . Any company that knows its business and its customers is already aware of many important, high pay off patterns that its employees have observed over the years. What data mining can do is confirm such empirical observations and find new, sable patterns that yield steady incremental improvement.
8
DATA MINING-The Knowledge Discovery in Database
Introduction to data warehousing Data warehousing is integration of information to boost the organization's decision support system. Data warehousing is subject oriented, integrated, nonvolatile, time-variant collection of data in support of management’s decision. It provides architecture to build an environment so user can access every piece of information of the organization. It is a way to design very large database with historical and summarized data. These Non-volatile data is collected from heterogeneous sources and analyzed by data warehouse components. So the data stored in warehouse are generally read only or not modified frequently. They support high performance demands on organizati9ons data and in formations. Several types of applications –OLAP, DSS, and Data mining applications – are supported OLAP is a term used to describe the analysis of complex data from the data warehouse. In the hands of skilled knowledge workers , OLAP tools used distributed computing capabilities for analysis that require more storage and processing power then can be economically and efficiently locate on individually desktop. DSS also known as EIS(Executive Information Systems) support organizations leading decision makers with higher level data for complex and important decisions. Traditional data bases support On-Line Transaction processing (OLTP) which includes insertions, updates and delusions, while also supporting g information, query requirement. Traditional, relational data bases are optimized to process query that may touch a small part of data base and transitions that deal with insertions or updates of a few tuples per relation to process. Thus, they can’t be optimized four OLAP, DSS or Data Mining. By contrast data warehouses are designed precisely to support efficient extraction, processing and presentation for analytic and decision making purposes. Data warehouses generally contains large amount of data from multiple sources that may include data bases from different data models and some times files acquired from independent systems and platforms.
9
DATA MINING-The Knowledge Discovery in Database
Data mining and Data warehousing In modern organizations, users of data are often completely removed from the data sources. Many people only need Read-access to data, but still need a very rapid access to a larger volume of data then can conveniently downloaded to the desktop. Often such data comes from multiple access data bases. Because many of analysis performed are recurrent and predictable, software venders and system support staff have begun to design system to support this system. At present there is a need to provide decisions from middle management upward with information at the correct level of detail to support Decision making. Data warehousing, OLAP(On-Line Analytical processing), and Data mining provide this functionality. The data to be mined is first abstracted from enterprise data warehouse into a data mining or data marts. There is a some real benefit if your data is already part of a data warehouse. The problems of cleansing data for a data warehouse and for a data mining are very similar. If the data has already been cleansed for data warehouse, then it most likely will not need further cleaning in order to be mined. The data mining data base may be logical rather than physical subset of your data warehouse provided that the data warehouse DBMS can support the additional resource demands of data mining. If it can not , then you will be better off with a separate data mining data base. A data warehouse is not a requirement of data mining. Setting up a large data warehouse that consolidate data from multiple sources, resolves data integrity problems and loads the data into a query data base can be enormous task, sometimes taking years and costing million of dollars. You could, however, mined data from one or more operational or transactional data bases by simply extracting it into a real-only data base. The goal of data warehouse is to support decision making with data. Data mining can be used in conjunction to help with certain types of decisions. Data mining can be applied to operational data bases with individual transactions. To make data mining more efficient, the data warehouse should have aggregated or summarized the collection of data. Data mining helps extracting meaningful new patterns that can not be found necessary by merely querying or processing data in the data warehouse. Data mining applications should Therefore be strongly consider early, during design of data warehouse. Also, data mining tools should be designed to facilitate their use in conjunction with data warehouses. In fact, for very large data bases running into terabytes of data, successful use of data base mining applications will applications will depend first on the construction of a data warehouse.
10
DATA MINING-The Knowledge Discovery in Database
Data mining and OLAP One of the most common questions from data processing professionals is about the difference between data mining and OLAP (On-Line Analytical Processing). As we shall see, they are very different tools that can complement each other. OLAP is part of the spectrum of decision support tools. Traditional query and report tools describe what is in a database. OLAP goes further; it’s used to answer why Certain things are true. The user forms a hypothesis about a relationship and verifies it with a series of queries against the data. For example, an analyst might want to determine the factors that lead to loan defaults. He or she might initially hypothesize that people with low incomes are bad credit risks and analyze the data base with OLAP to verify (or disprove) this assumption. If that hypothesis were not borne out by the data, the analyst might then look at high debt as the determinant of risk. If the data did not support this guess either, he or she might then try debt and income together as the best predictor of bad credit risks. In other word, the OLAP analyst generates a series of hypothetical patterns and relationships and uses queries against the database to verify them or disprove them. OLAP analysis essentially a deductive process. But what happens when the number of variables being analyzed is in dozens of even hundred? It becomes much more difficult and time-consuming to find a good hypothesis (let alone be confident that there is not a better explanation than the one found), and analysis the database with OLAP to verify or disprove it. Data mining is different from OLAP because rather than verify hypothetical patterns, it uses the data itself to uncover such patterns. It is essentially an inductive process. For example, suppose the analyst who wanted to identify the risk factors for loan default were to use a data mining tool. The data mining tool might discover that people with high debt and low incomes were bad credit risks (as above) , but it might go further and also discover a pattern the analyst did not think to try, such as that age is also a determinant of risk. Here is where data mining and OLAP can complement each other. Before acting on the pattern, the analyst needs to know what the financial implications would be of using the discovered pattern to govern who gets credits. The OLAP tool can allow the analyst to answer those kinds of questions. Furthermore, OLAP is also complementary in the early stages of the knowledge discovery process because it can help you explore your data, for instance by focusing attention on important variables, identifying exceptions, or finding interactions. This is important because your understand your data, the more effective the knowledge discovery process will be.
11
DATA MINING-The Knowledge Discovery in Database
The Foundations of Data Mining Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: • Massive data collection • Powerful multiprocessor computers • Data mining algorithms Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods. In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining. From the user’s point of view, the four steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately and quickly.
12
DATA MINING-The Knowledge Discovery in Database Evolutionary Step Data Collection (1960s)
Business Question
Enabling Technologies
Product Providers
Characteristics
"What was my total Computers, tapes, revenue in the last IBM, CDC disks five years?"
Retrospective, static data delivery
Data Access (1980s)
Relational Oracle, databases "What were unit sales Sybase, (RDBMS), in New England last Informix, Structured Query March?" IBM, Language (SQL), Microsoft ODBC
Retrospective, dynamic data delivery at record level
Data Warehousing & Decision Support (1990s)
On-line analyticPilot, "What were unit salesprocessing Comshare, in New England last(OLAP), Arbor, March? Drill down tomultidimensional Cognos, Boston." databases, dataMicro warehouses strategy
Retrospective, dynamic data delivery at multiple levels
Data Mining (Emerging Today)
Pilot, Advanced Lockheed, "What’s likely to Prospective, algorithms, IBM, SGI, happen to Boston proactive multiprocessor numerous unit sales next information computers, startups month? Why?" delivery massive databases (nascent industry) Table 1. Steps in the Evolution of Data Mining.
The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with highperformance relational database engines and broad data integration efforts, make these technologies practical for current data warehouse environments.
13
DATA MINING-The Knowledge Discovery in Database An Architecture for Data Mining To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for advanced analysis in a large data warehouse.
Figure 1 - Integrated Data Mining Architecture The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access.
14 An OLAP (On-Line Analytical Processing) server enables a more sophisticated enduser business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business – summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure.
DATA MINING-The Knowledge Discovery in Database An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions. This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users’ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans.
15
DATA MINING-The Knowledge Discovery in Database Physical structure of data warehouse : Data warehouse is a central repository for data. There are three different basic architectures for constructing a data warehouse. In first type there is only central location to store data, which we call data warehouse physical storage media. In this type of construction, data is gathered from heterogeneous, data sources, like different types of files, local database system and from other external sources. As the data is stored in a central place its' access is very easy and simple. But disadvantage of this construction is the loss of performance. In second type of construction data is decentralized. As the data cannot be stored physically together but logically it is consolidated in data warehouse environment. In this construction department wise data and site wise data is stored at their local place. Local application and other generated data is stored in local database but information about data, called metadata (data about data) is stored in central site. This local database can also maintain their metadata locally for their local work as well as central site. This local database with metadata is called "Data Marts". An advantage of this architecture is that the logical data warehouse is only virtual. Central data warehouse is not storing any actual data but information of data so any user who wants to access data can make query to central site and this central site prepare resultant data for user. This entire process to collect data from physical database is transparent to user. Third and last type of construction creates a hierarchical view of data. Here the central data warehouse is also storing actual data and data marts on next level store copy or summary of physical central data warehouse. Local data marts store the data, which is related to related to their local site only. The advantage of distributed and hierarchical construction are (1) Retrieval time of data from data warehouse is less and (2) volume of data is also reduced. Although data is integrated through metadata so anyone from anywhere can access data and processing is divided in different physical machines. For better response of data retrieval, scalable data
16 warehouse architecture is very important. Data warehouse response is also depending on metadata so design of metadata is also very important for every data warehouse.
DATA MINING-The Knowledge Discovery in Database Issues in Integration of data in data warehouse : As discussed above, you can physically design your data warehouse as using any of three construction type. But to integrate data in a data warehouse require some procedure like data extraction and data migration, data cleansing / data scrubbing and data integration.
Data extraction and data migration : To extract data from operational databases, files and other external sources, extraction tools are required. This process should be detailed and documented correctly. If this process is not properly documented then it will create problems while integration with other data and also create difficulties at later stage. So data extraction should provide high level of integration and make efficient metadata for data warehouse. Data migration is a task to convert the data from one system to another. It should provide type checking of integrity constraints in data warehouse. It should also find out inconsistency and missing values while converting metadata for entire process so one can easily identified problem in migration process.
Data cleansing / Data scrubbing : Data warehouse collect data from heterogeneous sources in organization. These data are integrated in such a manner so any end-user can access that data very easily. For facilitate end-user, DWA (Data Warehouse Administration) must be aware about right approach of warehouse. To collect data from different operating system, from different network, different application files like C, COBOL, FORTARN and different operational databases. So our first step is to design a platform on which we can access data from every system and put them together in a warehouse. Before transferring data from one
17 system to another, data must be standardized. This standard is always related to format of data, structure of data and information collection.
DATA MINING-The Knowledge Discovery in Database Characteristics of Data warehousing : •multidimensional conceptual view •generic dimensionality •unlimited dimensions and aggregation levels •unrestricted cross- dimensional operations •dynamic sparser matrix handling •client-server architecture •multi user support •accessibility •transparency •intuitive data manipulation •consistent reporting performance •flexible reporting Because they encompass large volume of data, data warehousing are generally an order of magnitude larger than the source databases. The sheer volume of data is an issue that has been deal with through enterprise, virtual data warehouse, and data marts: •Enterprise-wide data warehouses are huge projects requiring massive investment of time and resources. •Virtual data warehouse provide views of optional databases that are materialized for efficient access. •Data marts are targeted to a subset of the organization, such as department, and are more tightly focused.
18
DATA MINING-The Knowledge Discovery in Database Types of Data Mining: The term “knowledge” is very broadly interpreted as involving some degree of intelligence. Knowledge is often classified as inductive and deductive. Knowledge can be represented in many forms: in unstructured sense, it can be represented by rules, or prepositional logic. In a structured form, it may be represented in decision trees, semantic, neural networks or hierarchical classes or frames. The knowledge discover during data mining can be described in five ways as follows. 1. Association rules-These rules correlate the presence of a set of items with another range of values for another set of variables. Examples: (1) when a female retail shopper buys a handbag, she is likely to buy shoes. (2) An X-ray image containing characteristics a and b is likely to also exhibit characteristic c. 2. Classification hierarchies-The goal is to work from an existing set of events or transactions to create a hierarchy of classes. Examples (1) A population may be divided into five ranges of credit worthiness based on a history of previous credit transactions. (2) A model may be developed for the factors that determine these desirability of location of a store on a 1-10 scale.(3) Mutual funds may be classified based on performance data using characteristics such as growth, income, and stability. 3. Sequential patterns- A sequence of actions or events is sought. Example: If a patient underwent cardiac bypass surgery for blocks arties and an aneurysm and later developed high blood urea within year of surgery, he is likely to suffer from kidney failure within next 18 months. Detection
19 of sequential pattern is equivalent to detecting association among events with certain relationship. 4. Patterns within time series-Similarities can be detected within positions of time series. Three examples follow with the stock market price data as a time series: (1) stocks of a utility company ABC Power and a financial company XYZ securities show the same pattern during 1998 in terms of closing stock price. (2) Two products show the same selling pattern in summer but a different one in winter. (3) A pattern in solar magnetic wind may be used to predict changes in earth atmospheric conditions.
DATA MINING-The Knowledge Discovery in Database 5. Categorization and segmentation-A given population of events or items can be partitioned (segmented) into sets of “similar” elements. Examples: (1) An entire population of treatment data on a disease may be divided into groups based on similarities of side effects produced. (2) The adult population may be categorized into five groups from “most likely to buy” to “list likely to buy” a new product. (3) The web excise a collection of users against a set of document may be analyzed in terms of the keywords of documents to reviles clusters categorized of users.
20
DATA MINING-The Knowledge Discovery in Database
How Data Mining Works How exactly is data mining able to tell you important things that you didn't know or what is going to happen next? The technique that is used to perform these feats in data mining is called modeling. Modeling is simply the act of building a model in one situation where you know the answer and then applying it to another situation that you don't. For instance, if you were looking for a sunken Spanish galleon on the high seas the first thing you might do is to research the times when Spanish treasure had been found by others in the past. You might note that these ships often tend to be found off the coast of Bermuda and that there are certain characteristics to the ocean currents, and certain routes that have likely been taken by the ship’s captains in that era. You note these similarities and build a model that includes the characteristics that are common to the locations of these sunken treasures. With these models in hand you sail off looking for treasure where your model indicates it most likely might be given a similar situation in the past. Hopefully, if you've got a good model, you find your treasure. This act of model building is thus something that people have been doing for a long time, certainly before the advent of computers or data mining technology. What happens on computers, however, is not much different than the way people build models. Computers are loaded up with lots of information about a variety of situations where an answer is known and then the data mining software on the computer must run through that data and distill the characteristics of the data that should go into the model. Once the model is built it can then be used in similar situations where you don't know the answer. For example, say that you are the director of marketing for a telecommunications company and you'd like to acquire some new long distance phone customers. You could just randomly go out and mail coupons to the general population - just as you could
21 randomly sail the seas looking for sunken treasure. In neither case would you achieve the results you desired and of course you have the opportunity to do much better than random - you could use your business experience stored in your database to build a model. As the marketing director you have access to a lot of information about all of your customers: their age, sex, credit history and long distance calling usage. The good news is that you also have a lot of information about your prospective customers: their age, sex, credit history etc. Your problem is that you don't know the long distance calling usage of these prospects (since they are most likely now customers of your competition). You'd like to concentrate on those prospects who have large amounts of long distance usage. You can accomplish this by building a model. Table 2 illustrates the data used for building a model for new customer prospecting in a data warehouse.
DATA MINING-The Knowledge Discovery in Database
General information demographic data) Proprietary information customer transactions)
Customers Prospects (e.g. Known Known (e.g. Known
Target
Table 2 - Data Mining for Prospecting The goal in prospecting is to make some calculated guesses about the information in the lower right hand quadrant based on the model that we build going from Customer General Information to Customer Proprietary Information. For instance, a simple model for a telecommunications company might be: 98% of my customers who make more than $60,000/year spend more than $80/month on long distance This model could then be applied to the prospect data to try to tell something about the proprietary information that this telecommunications company does not currently have access to. With this model in hand new customers can be selectively targeted. Test marketing is an excellent source of data for this kind of modeling. Mining the results of a test market representing a broad but relatively small sample of prospects can provide a foundation for identifying good prospects in the overall market. Table 3 shows another common scenario for building models: predict what is going to happen in the future. Yesterday Today Tomorrow Static information and current plans (e.g. demographic data,Known Known Known marketing plans) Dynamic information (e.g.Known Known Target
22 customer transactions) Table 3 - Data Mining for Predictions If someone told you that he had a model that could predict customer usage how would you know if he really had a good model? The first thing you might try would be to ask him to apply his model to your customer base - where you already knew the answer. With data mining, the best way to accomplish this is by setting aside some of your data in a vault to isolate it from the mining process. Once the mining is complete, the results can be tested against the data held in the vault to confirm the model’s validity. If the model works, its observations should hold for the vaulted data.
DATA MINING-The Knowledge Discovery in Database
Goals of Data Mining •
Prediction:- Data mining can show how certain attributes within the data will behave in the future. Examples of predictive data mining include the analysis of buying transactions to predict what consumer will buy under certain discount, how much sales volume store would generate in given period whether deleting product line would yield more profits, business logic is used coupled with data mining. In scientific context, certain scientific wave patterns may predict an earthquake with high probability.
•
Identification:- Data patterns can be used to identify the existence of an item, an event, or an activity. For example, intruders trying to break a system may be identified by the programs executed, files accessed, and CPU time per session. In biological applications, existence of a gene may be identified by certain sequences of nucleotide symbols in the DNA sequence. The area known as authentication is a from of identification. It ascertains whether a user is indeed a specific user or one from an authorized class; it involves a comparison of parameters or images or signals against a database.
•
Classification:- Data Mining can partition data so that different classes or categories can be identified based on combinations of parameters. For example, customers in super market can be categorized in discount seeking shoppers , shoppers in a rush, loyal regular shoppers , and infrequent shoppers, this classification Is used in analysis of customer buying transactions as post mining activity. Classification based on common domain knowledge is used as input to decompose mining problem and make it simpler. For instance, health foods, party foods, school lunch foods are distinct categories in business super market. It makes sense to analyze relationship within and across categories as separate problems. Search categorization used to encode data appropriately before subjecting it to further data mining.
23
•
Optimization:- One eventual goal of data mining may be to optimize use of limited resources such as time, space, money or materials and to maximize output variables such as sales or profits under given set of constraints. This goal of data mining resembles objective function used in operations research problems that deals with optimization under constraint.
DATA MINING-The Knowledge Discovery in Database
Integrating Data Mining and Campaign Management The closer Data Mining and Campaign Management work together, the better the business results. Today, Campaign Management software uses the scores generated by the Data Mining model to sharpen the focus of targeted customers or prospects, thereby increasing response rates and campaign effectiveness. Unfortunately, the use of a model within Campaign Management today is often a manual, time-intensive process. When someone in marketing wants to run a campaign that uses model scores, he or she usually calls someone in the modeling group to get a file containing the database scores. With the file in hand, the marketer must then solicit the help of someone in the information technology group to merge the scores with the marketing database. • • •
This disjointed process is fraught with problems: The large numbers of campaigns that run on a daily or weekly basis can be difficult to schedule and can swamp the available resources. The process is error prone; it is easy to score the wrong database or the wrong fields in a database. Scoring is typically very inefficient. Entire databases are usually scored, not just the segments defined for the campaign. Not only is effort wasted, but the manual process may also be too slow to keep up with campaigns run weekly or daily.
The solution to these problems is the tight integration of Data Mining and Campaign Management technologies. Integration is crucial in two areas: First, the Campaign Management software must share the definition of the defined campaign segment with the Data Mining application to avoid modeling the entire
24 database. For example, a marketer may define a campaign segment of high-income males between the ages of 25 and 35 living in the northeast. Through the integration of the two applications, the Data Mining application can automatically restrict its analysis to database records containing just those characteristics. Second, selected scores from the resulting predictive model must flow seamlessly into the campaign segment in order to form targets with the highest profit potential.
DATA MINING-The Knowledge Discovery in Database The integrated Data Mining and Campaign Management process This section examines how to apply the integration of Data Mining and Campaign Management to benefit the organization. The first step creates a model using a Data Mining tool. The second step takes this model and puts it to use in the production environment of an automated database marketing campaign. Step 1: Creating the model An analyst or user with a background in modeling creates a predictive model using the Data Mining application. This modeling is usually completely separate from campaign creation. The complexity of the model creation typically depends on many factors, including database size, the number of variables known about each customer, the kind of Data Mining algorithms used and the modeler’s experience. Interaction with the Campaign Management software begins when a model of sufficient quality has been found. At this point, the Data Mining user exports his or her model to a Campaign Management application, which can be as simple as dragging and dropping the data from one application to the other. This process of exporting a model tells the Campaign Management software that the model exists and is available for later use. Step 2: Dynamically scoring the data defined
Dynamic scoring allows you to score an alreadycustomer segment within your Campaign
25 Management tool rather than in the Data Mining tool. Dynamic scoring both avoids mundane, repetitive manual chores and eliminates the need to score an entire database. Instead, dynamic scoring marks only relevant customer subsets and only when needed. Scoring only the relevant customer subset and eliminating the manual process shrinks cycle times. Scoring data only when needed assures "fresh," up-to-date results. Once a model is in the Campaign Management system, a user (usually someone other than the person who created the model) can start to build marketing campaigns using the predictive models. Models are invoked by the Campaign Management System. When a marketing campaign invokes a specific predictive model to perform dynamic scoring, the output is usually stored as a temporary score table. When the score table is available in the data warehouse, the Data Mining engine notifies the Campaign Management system and the marketing campaign execution continues.
DATA MINING-The Knowledge Discovery in Database Here is how a dynamically scored customer segment might be defined: Where Length_of_service = 9 and Average_balance > 150 and In_Model (promo9). score > 0.80
In this example: • Length of service =9 limits the application of the model to those customers in the ninth month of their 12-month contracts, thus targeting customers only at the most vulnerable time. (In reality, there is likely a variety of contract lengths to consider this when formulating the selection criteria.) • Average balance > 150 selects only customers spending, on average, more than $150 each month. The marketer deemed that it would unprofitable to send the offer to less valuable customers. • Promo9 is the name of a logged predictive model that was created with a Data Mining application. This criterion includes a threshold score, 0.80, which a customer must surpass to be considered "in the model." This third criteria limits the campaign to just those customers in the model, i.e. those
26 customers most likely to require an inducement to prevent them switching to a competitor.
Data Mining and Campaign Management in the real world
DATA MINING-The Knowledge Discovery in Database Ideally, marketers who build campaigns should be able to apply any model logged in the Campaign Management system to a defined target segment. For example, a marketing manager at a cellular telephone company might be interested in high-value customers likely to switch to another carrier. This segment might be defined as customers who are nine months into a twelve-month contract, and whose average monthly balance is more than $150. The easiest approach to retain these customers is to offer all of them a new high-tech telephone. However, this is expensive and wasteful since many customers would remain loyal without any incentive.
27
DATA MINING-The Knowledge Discovery in Database The Benefits of integrating Data Mining and Campaign Management For marketers: • Improved campaign results through the use of model scores that further refine customer and prospect segments. Records can be scored when campaigns are ready to run, allowing the use of the most recent data. "Fresh" data and the selection of "high" scores within defined market segments improve direct marketing results. •
Accelerated marketing cycle times that reduce costs and increase the likelihood of reaching customers and prospects before competitors.
Scoring takes place only for records defined by the customer segment, eliminating the need to score an entire database. This is important to keep pace with continuously running marketing campaigns with tight cycle times. Accelerated marketing "velocity" also increases the number of opportunities used to refine and improve campaigns. The end of each campaign cycle presents another chance to assess results and improve future campaigns. •
Increased accuracy through the elimination of manually induced errors. The Campaign Management software determines which records to score and when.
For statisticians: •
Less time spent on mundane tasks of extracting and importing files, leaving more time for creative – building and interpreting models. Statisticians have greater impact on corporate bottom line.
28 As a database marketer, you understand that some customers present much greater profit potential than others. But, how will you find those high-potential customers in a database that contains hundreds of data items for each of millions of customers? Data Mining software can help find the "high-profit" gems buried in mountains of information. However, merely identifying your best prospects is not enough to improve Instead, to reduce costs and improve results, the marketer could use a predictive model to select only those valuable customers who would likely defect to a competitor unless they receive the offer.
DATA MINING-The Knowledge Discovery in Database The ten Steps of Data Mining :: Here is process for extracting hidden knowledge from your data warehouse, your customer information file, or any other company database. 1. Identify The Objective Before you begin, be clear on what you hope to accomplish with your analysis. Know in advance the business goal of the data mining. Establish whether or not the goal is measurable. Some possible goals are to - Find sales relationships between specific products or services - Identify specific parching patterns over time - Identify potential types of customers - Find product sales trends. 2. Select The Data Once you have defined your goal, your next step is to select the data to meet this goal. This may be a subset of your data warehouse or a data mart that contains specific product information. It may be your customer information file. Segment as much as possible the scope of the data to be mined. Here are some key issues. -
Are the data adequate to describe the phenomena the data mining analysis is attempting to model?
29 -
Can you enhance internal customer records with external lifestyle and demographic data? Are the data stable-will the mined attributes be the same after the analysis? If you are merging databases and you find a common field for linking them? How current and relevant are the data to the business goal?
DATA MINING-The Knowledge Discovery in Database 3.
Prepare The Data Once you’ve assembled the data, you must decide which attributes to convert into usable formats. Consider the input of domain experts-creators and users of the data. - Establish strategies for handling missing data, extraneous noise, and outliers - Identify redundant variables in the dataset and decide which fields to exclude - Decide on a log or square transformation, if necessary - Visually inspect the dataset to get a feel for the database - Determine the distribution frequencies of the data You can postpone some of these decisions until you select a datamining tool. For example, if you need a neural network or polynomial network you may have to transform some of your fields.
4. Audit The Data Evaluate the structure of your data in data in order to determine the appropriate tools. - What is the radio of categorical/binary attributes in the database? - What is the nature and structure of the database? - What is the overall condition of the dataset? - What is the distribution of the dataset? Balance the objective assessment of the structure of your data against your user’ need to understand the findings. Neural nets, for example, don’t explain their results 5. Select The Tools
30
Two concerns drive he selection of the appropriate data mining tool- your business objectives and your data structure. Both should guide you to the same tool. Consider these questions when evaluating a set of potential tools. -
Is the data set heavily categorical? What platforms do your candidate tools support? Are the candidate tools ODBC-compliant? What data format can the tools import?
No single tool is likely to provide the answer to your data mining project,. Some tools integrate several technologies into a suite of statistical analysis programs, a neural network, and a symbolic classifier.
DATA MINING-The Knowledge Discovery in Database 6. Format The Solution In conjunction with your data audit, your business objective and the selection of your tool determine the format of your solution. The Key questions are - What is the optimum format of the solution- decision tree, rules, C code, and SQL syntax? - What are the available format options? - What is thee goal of the solution? - What do the end-users need-graphs, reports, code? 7. Construct The Model At this point that the data mining processing begins. Usually the first step is to use the random number seed to split the data into a training set and a test set and construct and evaluate a model. The generation of the classification rules, decision trees, clustering sub-groups, score, code, weights and evaluation data/error rates takes place at this stage. Resolve these issues: - Are error rates at acceptable level? Can you improve them? - What extraneous attributes did you find? Can you purge them? - Is additional data or a different methodology necessary? - Will you have to train and test a new data set? 8. Validate The Findings
31 Share and discuss the results of the analysis with the business client or domain expert. Ensure that the findings are correct and appropriate to the business objectives. - Do the findings make sense? - Do you have to return any prior steps to improve results? - Can use other data mining tools to replicate the findings? 9. Deliver The Findings Provide a final report to the business unit or client. The report should source code, and rules, some of the issues are: - Will additional data improve the analysis? - What strategic insight did you discover and how is it applicable? - What proposals can result from the data mining analysis? - Do you findings meet the business objective?
DATA MINING-The Knowledge Discovery in Database 10. Integrate The Solution Share the findings with all interested end-users in the appropriate business units. You might wind up incorporating the results of the analysis into the company’s business procedures. Some of the data mining solutions may involve - SQL syntax for distribution to end-users - C code incorporated into a production system - Rules integrated into a decision support system. Although data mining tools automate database analysis, they can lead to faulty findings and erroneous conclusions if you’re not careful. Bear in mind that data mining is a business process with a specific goal- to extract a competitive insight from historical records in a database.
32
DATA MINING-The Knowledge Discovery in Database Evaluating the Benefits of a Data Mining Model Other representations of the model often incorporate expected costs and expected revenues to provide the most important measure of model quality: profitability. A profitability graph like the one shown below can help determine how many prospects to include in a campaign. In this example, it is easy to see that contacting all customers will result in a net loss. However, selecting a threshold score of approximately 0.8 will maximize profitability. For a closer look at how the use of model scores can improve profitability, consider an example campaign with the following assumptions: • Database size: 2,000,000 • Maximum possible response: 40,000
33 • •
Cost to reach one customer: $1.00 Profit margin from a positive response: $40.00
As the table below shows, a random sampling of the full customer/prospect database produces a loss regardless of the campaign target size. However, by targeting customer using a Data Mining model, the marketer can select a smaller target that includes a higher percentage of good prospects. This more focused approach generates a profit until the target becomes too large and includes too many poor prospects.
Campaign Size 100,000 400,000 1,000,000 2,000,000
Cost $100,000 $400,000 $1,000,000 $2,000,000
Random Selection Response Revenue 2,000 $80,000 8,000 $320,000 20,000 $800,000 40,000 $1,600,000
Targeted Selection Net Response Revenue ($20,000) 4,000 $160,000 ($80,000) 30,000 $1,200,000 ($200,000) 35,000 $1,400,000 ($400,000) 40,000 $1,600,000
Net $60,000 $800,000 $400,000 ($400,000)
DATA MINING-The Knowledge Discovery in Database The data mining suite The Data Mining Suite is truly unique, providing the most powerful, complete and comprehensive solution for enterprise-wide, large scale decision support. It leads the world of discovery with the exceptional ability to directly mine large multi-table SQL databases. TM
The Data Mining Suite works directly on large SQL repositories with no need for sampling or extract files. It accesses large volumes of multi-table relational data on the server, incrementally discovers powerful patterns and delivers automatically generated English text and graphs as explainable documents on the intranet. The Data Mining Suite is based on a solid foundation with a total vision for decision support. The three-tiered, server-based implementation provides
34
highly scalable discovery on huge SQL databases with well over 90% of the computations performed directly on the server, in parallel if desired.
DATA MINING-The Knowledge Discovery in Database The Data Mining Suite relies on the genuinely unique mathematical foundation we pioneered to usher in a new level of functionality for decision support. This mathematical foundation has given rise to novel algorithms that work directly on very large datasets, delivering unprecedented power and functionality. The power of these algorithms allows us to discover rich patterns of knowledge in huge databases that could have never been found before. With server-based discovery, the Data Mining Suite performs over 90% of the analyses on the server, with SQL, C programs and Java. Discovery takes place simultaneously along multiple dimensions on the server, and is not limited by the power of the client. The system analyzes both relational and multi-dimensional data, discovering highly refined patterns that reveal the real nature of the dataset. Using built-in advanced mathematical techniques, these findings are carefully merged by the system and the results are delivered to the user in plain English, accompanied by tables and graphs that highlight the key patterns. The Data Mining Suite pioneered multi-dimensional data mining. Before this, OLAP had usually been a multi-dimensional manual endeavor, while data mining had been a single dimensional automated activity. The Rule-based Influence Discovery System bridged the gap between OLAP and data mining. This dramatic new approach forever changed the way corporations use decision support. No longer are OLAP and data mining viewed as separate activities, but are fused to deliver maximum benefit. The patterns discovered by the system include multi-dimensional influences and contributions, OLAP affinities and associations, comparisons, trends and variations. The richness of these patterns delivers unparalleled business benefits to users, allowing them to make better decisions than ever before. TM
The Data Mining Suite also pioneered the use of incremental patternbase population. With incremental data mining, the system automatically
35
discovers changes in patterns as well as the patterns of change. For instance, each month sales data is mined and the changes in the sales trends as well as the trends of change in how products sell together are added to the patternbase. Over time, this knowledge becomes a key strategic asset to the corporation.
DATA MINING-The Knowledge Discovery in Database
The Data Mining Suite currently consists of these modules: • • • • • •
Rule-based Influence Discovery Dimensional Affinity Discovery Trend Discovery Module Incremental Pattern Discovery Forensic Discovery The Predictive Modeler
These truly unique products are all designed to work together, d in concert with the Knowledge Access Suite . TM
Rule-based Influence Discovery The Rule-based Influence Discovery System is aware of both influences and contributions along multiple dimensions and merges them in an intelligent manner to produce very rich and powerful patterns that can not be obtained by either OLAP or data mining alone. The system performs multi-table, dimensional data mining at the server level, providing the best possible results. The Rule-based Influence Discovery System is not a multi-dimensional repository, but a data mining system. It accesses granular data in a large database via standard SQL and reaches for multi-dimensional data via a ROLAP approach of the user's choosing. Dimensional Affinity Discovery The Affinity Discovery System automatically analyzes large datasets and finds association patterns that describe how various items "group together" or "happen together". Flat affinity just tells us how items group together, without providing logical conditions for the association. Dimensional (OLAP) affinity is more powerful and describes the dimensional conditions
36 under which stronger item groupings take place. The Affinity Discovery System includes a number of useful features that make it a unique industrial strength product. These features include hierarchy and cluster definitions, exclusion lists, unknown-value management, among others.
DATA MINING-The Knowledge Discovery in Database The OLAP Discovery System The OLAP Discovery System is aware of both influences and contributions along multiple dimensions and merges them in an intelligent manner to produce very rich and powerful patterns that can not be obtained by either OLAP or data mining alone. The system merges OLAP and data mining at the server level, providing the best possible results. The OLAP Discovery System is not an OLAP engine or a multi-dimensional repository, but a data mining system. It accesses granular data in a large database via standard SQL and reaches for multi-dimensional data via an OLAP/ROLAP engine of the user's choosing. Incremental Pattern Discovery Incremental Pattern Discovery deals with temporal data segments that gradually become available over time, e.g. once a week, once a month, etc. Data is periodically supplied to the Incremental Discovery System in terms of a "data snap-shot" which corresponds to a given time-segment, e.g. monthly sales figures. Patterns in the data snap-shot are found on a monthly basis and are added to the pattern-base. As new data becomes available (say once a month) the system automatically finds new patterns, merges them with the previous patterns, stores them in the pattern-base and notes the differences from the previous time-periods. Trend Discovery Trend Discovery with the Data Mining Suite uncovers time-related patterns that deal with change and variation of quantities and measures. The system expresses trends in terms of time-grains, time-windows, slopes and shapes. The time-grain defines the smallest grain of time to be considered, e.g.
37 a day, a week or a month. Time-windows define how time grains are grouped together, e.g. we may look at daily trends with weekly windows, or we may look at weekly grains with monthly windows. Slopes define how quickly a measure is increasing or decreasing, while shapes give us various categories of trend behavior, e.g. smoothly increasing vs. erratically changing.
DATA MINING-The Knowledge Discovery in Database Forensic Discovery Forensic Discovery with the Data Mining Suite relies on automatic anomaly detection. The system first identifies what is usual and establishes a set of norms through pattern discovery. The transactions or activities that deviate from the norm are then identified as unusual. Business users can discover where unusual activities may be originating and the proper steps can be taken to remedy and control the problems. The automatic discovery of anomalies is essential in that the ingenious tactics used to spread activities within multiple transactions can usually not be guessed beforehand Predictive Modeler The Data Mining Suite Predictive Modeler makes predictions and forecasts by using the rules and patterns which the data mining process generates. While induction performs pattern discovery to generate rules, the Predictive Modeler performs pattern matching to make predictions based on the application of these rules. The predictive models produced by the system have higher accuracy because the discovery process works on the entire dataset and need not rely on sampling. The output from the seven component products of the Data Mining Suite is stored within the pattern-base and is accessible with PQL: The Pattern Query Language. Readable English text and graphs are automatically generated in ASCII and HTML formats for the delivery on the inter/intranet.
38
DATA MINING-The Knowledge Discovery in Database
The Data Mining Suite is Unique The Reasons for the Multi-faceted Power The products in the Data Mining Suite deliver the most advanced and scalable technologies within a user friendly environment. The specific reasons draw on the solid mathematical foundation, which Information Discovery, Inc. pioneered and a highly scalable implementation. Click here to see what makes The Knowledge Access Suite. So unique. TM
The Data Mining Suite is distinguished by the following unique capabilities: Direct Access to Very Large SQL Databases The Data Mining Suite works directly on very large SQL databases and does not require samples, extracts and/or flat files. This alleviates the problems associated with flat files which lose the SQL engine's power (e.g. parallel execution) and which provide marginal results. Another advantage of working on an SQL database is that the Data Mining Suite has the ability to deal with both numeric and non-numeric data uniformly. The Data Mining Suite does not fix the ranges in numerical data beforehand, but finds ranges in the data dynamically by itself. Multi-Table Discovery The Data Mining Suite discovers patterns in multi-table SQL databases without having to join and build an extract file. This is a key issue in mining large databases. The world is full of multi-table databases which can not be joined and meshed into a single view. In fact, the theory of normalization came about because data needs to be in more than one table. Using single tables is an affront to all the work of E.F. Codd on database design. If you challenge the DBA in a really large database to put things in a single table you will either get a laugh or a blank stare -- in many cases the database size will balloon beyond control. In fact,
39 there are many cases where no single view can correctly represent the semantics of influence because the ratios will always be off regardless of how you join. The Data Mining Suite leads the world of discovery with the unique ability to mine large multi-table databases.
DATA MINING-The Knowledge Discovery in Database No Sampling or Extracts Sampling theory was invented because one could not have access to the underlying population being analyzed. But a warehouse is there to provide such access. General and Powerful Patterns The format of the patterns discovered by the Data Mining Suite is very general and goes far beyond decision trees or simple affinities. The advantage to this is that the general rules discovered are far more powerful than decision trees. Decision trees are very limited in that they cannot find all the information in the database. Being rule-based keeps the Data Mining Suite from being constrained to one part of a search space and makes sure that many more clusters and patterns are found -- allowing the Data Mining Suite to provide more information and better predictions. Language of Expression The Data Mining Suite has a powerful language of expression, going several times beyond what most other systems can handle. For instance, for logical statements it can express statements such as "IF Destination State = Departure State THEN..." or "IF State is not Arizona THEN ...". Surprisingly most other data mining systems can not express these simple patterns. And the Data Mining Suite pioneered dimensional affinities such as IF Day = Saturday WHEN PaintBrush is purchased ALSO Paint is purchased". Again most other systems cannot handle this obvious logic. Uniform Treatment of Numeric and Non-numeric Data The Data Mining Suite is unique in its ability to deal with various data types in a uniform manner. It can smoothly deal with a large number of non-numeric values and also automatically discovers ranges within numeric data. Moreover, the Data Mining Suite does not fix the ranges in numerical data but discovers interesting
40 ranges by itself. For example, given the field Age, the Data Mining Suite does not expect this to be broken into 3 segments of (1-30), (31-60), (61 and above). Instead it may find two ranges such as (27-34) and (48-61) as important in the data set and will use these in addition to the other ranges.
DATA MINING-The Knowledge Discovery in Database Use of Data Dependencies Should a data mining system be aware of the functional (and other dependencies) that exist in a database? "Yes" and very much so. The use of these dependencies can significantly enhance the power of a discovery system -- in fact ignoring them can lead to confusion. The Data Mining Suite takes advantage of data dependencies. Server-based Architectures The Data Mining Suite has a three level client server architecture whereby the user interface runs on a thin intranet client and the back-end process for analysis is done on a Unix server. The majority of the processing time is spent on the server and these computations run both by using parallel SQL and non-SQL calls managed by the Data Mining Suite itself. Only about 50% of the computations on the server are SQL-based and the other statistical computations are already managed by the Data Mining Suite program itself, at times by starting separate processes on different nodes of the server. System Initiative The Data Mining Suite uses system initiative in the data mining process. It forms hypothesis automatically based on the character of the data and converts the hypothesis into SQL statements forwarded to the RDBMS for execution. The Data Mining Suite then selects the significant patterns filter the unimportant trends. Transparent Discovery and Predictions The Data Mining Suite provides explanations as to how the patterns are being derived. This is unlike neural nets and other opaque techniques in which the mining process is a mystery. Also, when performing predictions, the results are transparent. Many business users insist on understandable and transparent results.
41 Not Noise Sensitive The Data Mining Suite is not sensitive to noise because internally it uses fuzzy logic analysis. As the data gathers noise, the Data Mining Suite will only reduce the level on confidence associated with the results provided. However, it will still produce the most significant findings from the data set.
DATA MINING-The Knowledge Discovery in Database Analysis of Large Databases The Data Mining Suite has been specifically tuned to work on databases with an extremely large number of rows. It can deal with data sets of 50 to 100 million records on parallel machines. It derives its capabilities from the fact that it does not need to write extracts and uses SQL statements to perform its process. Generally the analyses performed in the Data Mining Suite are performed on about 50 to 120 variables and 30 to 100 million records directly. It is, however, easier to increase the number of records based on the specific optimization options with the Data Mining Suite to deal with very large databases. These unique features and benefits make the Data Mining Suite the ideal solution for large-scale Data Mining in business and industry.
42
DATA MINING-The Knowledge Discovery in Database What Data Mining can’t do? Data mining is a tool, not a magic wand. It won’t sit in your database watching what happens and send you e-mails to get your attention when it sees an interesting pattern. It doesn’t eliminate the need to know your business, to understand your data, or to understand analytical methods. Data mining assists business analysts with finding patterns and relationships in the data-it does not tell you the value of the patterns to organization. Furthermore, the patterns uncovered by data mining must be verified in the real world. Remember that the predictive relationships found via data mining are not necessarily causes of an action or behavior. For example, data mining might determine that males with incomes between $50,000 and $65,000 who subscribe to certain magazines are likely purchasers of a product you want to sell. While you can take advantages of this pattern, say by aiming your marketing at people who fit the pattern, you should not assume that any of these factors cause them to buy your product. To ensure meaningful results, it’s vital that you understand your data. The quality of your output will often be sensitive to outliers (data values that are very different from the typical values in your database), irrelevant columns or columns that vary together (such as age and date of birth), the way you encode your data, and the data you leave in and the data you exclude. Algorithms vary in their sensitivity to such data issues, but it is unwise to depend on a data-mining product to make all the right decisions on its own. Data mining will not automatically discover solutions without guidance. Rather than setting the vague goal, “Help improve the response to my direct mail solicitation”, you might use data mining to find the characteristics of people who (1) respond to your solicitation, or (2) respond AND make large purchase. The patterns data mining finds for those two goals may be very different.
43 Although a good data mining tool shelters you from the intricacies of statistical techniques, it requires you to understand the working of the tools you choose and the algorithms on which they are based. The choices you make in setting up your data mining tool and the optimizations you choose will affect the accuracy and speed of your models.
DATA MINING-The Knowledge Discovery in Database Difficulties of Implementing Data Warehouses Some significant operational issues arise with data warehousing: construction, administration, and quality control. Project management- the design, construction, and implementation of the warehouse- is an important an challenging consideration that should not be underestimated. The building of an enterprise-wide warehouse in a large organization is a major undertaking, potentially taking years from conceptualization to implementation. Because of the difficulty and amount of lead time required for such an undertaking, the widespread development and deployment of data marts may provide an attractive, especially to those organizations with urgent needs for OLAP, DSS, and/or data mining support. The administration of a data warehouse is an intensive enterprise, proportional to the size and complexity of the warehouse. An organization that attempts to administer a data warehouse must realistically understand the complex nature of its administration. Although designed for read- access, a data warehouse is no more a static structure than any of its information sources. Sources databases can expected to evolve. The warehouse schema and acquisition component must be expected to be updated to handle these evolutions. A significant issue in data warehouse is the quality control of data. Both quality and consistency of data are major concerns. Although the data passes through a cleaning function during acquisition, quality and consistency remain significant issues for the database administrator. Melding data from heterogeneous and disparate sources is a major challenge given differences in naming, domains definitions, identification numbers, and the like. Every time a sources database changes, the warehouse administrator must consider the possible interactions with other elements of the warehouse. Usage projection should be estimated conservatively prior to construction of the data warehouse and should be revised continually to reflect currents requirements. As utilization patterns become clear and change over time, storage and access paths can be
44 turned to remain optimized for support of the organization’s use of its warehouse. This activity should be continue through out the life of the house in order to remain ahead of demand. The warehouse should also be designed to accommodate addition and attrition of data sources with out major redesign.
DATA MINING-The Knowledge Discovery in Database Administration of a data warehouse will require far broader skills than are needed for traditional database administration. A team of highly skilled technical experts with overlapping areas of expertise will likely be needed, rather than a single individual. Like database administration, data warehouse administration is only partly technical; a large part of the responsibility requires working efficiently with all the members of the organization with an interest in the data warehouse. However difficult that can be at times for database administrators, it is that much more challenging for data warehouse administrators, as the scope of their responsibilities is considerably broader. Design of the management function and selection of the management team for a database warehouse are crucial. Managing the data warehouse in a large organization will surely be a major task. Many commercial tools are already available to support management functions. Effective data warehouse management will certainly be a team function, requiring a wide set of technical skills, careful coordination, and effective leadership. Just as we must prepare for the evolution of the warehouse, we must also recognize that the skills of the management team will, of necessity, evolve with it.
45
DATA MINING-The Knowledge Discovery in Database The Scope of Data Mining Data mining derives its name from the similarities between searching for valuable business information in a large database — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities: •
Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.
•
Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.
Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions.
46 Databases can be larger in both depth and breadth: • More columns. Analysts must often limit the number of variables they examine when doing hands-on analysis due to time constraints. Yet variables that are discarded because they seem unimportant may carry information about unknown patterns. High performance data mining allows users to explore the full depth of a database, without reselecting a subset of variables. • More rows. Larger samples yield lower estimation errors and variance, and allow users to make inferences about small but important segments of a population.
DATA MINING-The Knowledge Discovery in Database A recent Granter Group Advanced Technology Research Note listed data mining and artificial intelligence at the top of the five key technology areas that "will clearly have a major impact across a wide range of industries within the next 3 to 5 years."2 Gartner also listed parallel architectures and data mining as two of the top 10 new technologies in which companies will invest during the next 5 years. According to a recent Gartner HPC Research Note, "With the rapid advance in data capture, transmission and storage, large-systems users will increasingly need to implement new and innovative ways to mine the after-market value of their vast stores of detail data, employing MPP [massively parallel processing] systems to create new sources of business advantage (0.9 probability)."3 The most commonly used techniques in data mining are: • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.
47
DATA MINING-The Knowledge Discovery in Database Profitable Applications A wide range of companies have deployed successful applications of data mining. While early adopters of this technology have tended to be in informationintensive industries such as financial services and direct mail marketing, the technology is applicable to any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, well-integrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied (such as customer prospecting, retention, campaign management, and so on). Some successful application areas include: •
A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the next few months. The data needs to include competitor market activity as well as information about the local health care systems. The results can be distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the key attributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from throughout the organization to be applied in specific sales situations.
•
A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product. Using a small test mailing, the attributes of customers with an affinity for the product can be identified. Recent projects have indicated more than a 20-fold decrease in costs for targeted mailing campaigns over conventional approaches.
48 •
A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services. Using data mining to analyze its own customer experience, this company can build a unique segmentation identifying the attributes of high-value prospects. Applying this segmentation to a general business database such as those provided by Dun & Bradstreet can yield a prioritized list of prospects by region.
DATA MINING-The Knowledge Discovery in Database •
A large consumer package goods company can apply data mining to improve its sales process to retailers. Data from consumer panels, shipments, and competitor activity can be applied to understand the reasons for brand and store switching. Through this analysis, the manufacturer can select promotional strategies that best reach their target customer segments.
Each of these examples have a clear common ground. They leverage the knowledge about customers implicit in a data warehouse to reduce costs and improve the value of customer relationships. These organizations can now focus their efforts on the most important (profitable) customers and prospects, and design targeted marketing strategies to best reach them. .
49
DATA MINING-The Knowledge Discovery in Database Functionality of Data Warehouses Data warehouses exist to facilitate complex, data-intensive, and frequent ad hoc queries. Accordingly, data warehouses must provide far greater and more efficient query support than is demanded of transitional databases. The data warehouse access component supports enhanced spreadsheet functionality, efficient query processing, structured queries, ad hoc queries, data mining, and materialized views. In particular, enhanced spreadsheet functionality includes support for state-of-the-art spreadsheet applications(e.g., MS Excel) as well as for OLAP applications programs. These offer preprogrammed functionalities such as the following: • Roll-up: Data is summarized with increasing generalization (weekly to quarterly to annually). • Drill-down: Increasing levels of detail are revealed (the complement of roll-up). • Pivot: Cross tabulation (also referred as rotation) is performed. • Slice and dice: Performing projection operations on the dimensions. • Sorting: Data is sorted by ordinal value. • Selection: Data is available by value or range. • Derived (computed) attributes: Attributes are computed by operations on stored and derived values.
50
DATA MINING-The Knowledge Discovery in Database Glossary of Data Mining Terms Analytical model:A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Data cleansing;The process of ensuring that all values in a dataset are consistent and correctly recorded. Anomalous data:Data that result from errors (for example, data entry keying errors) or that represent unusual events. Anomalous data should be examined carefully because it may carry important information Artificial neural networks:Non-linear predictive models that learn through training and resemble biological neural networks in structure. Data visualization:The visual interpretation of complex relationships in multidimensional data. CHAID:Chi Square Automatic Interaction Detection. A decision tree technique used for classification of a dataset. Provides a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. Segments a dataset by using chi square tests to create multi-way splits. Preceded, and requires more data preparation than, CART. Data mining:The extraction of hidden predictive information from large databases.
51 CART::Classification and Regression Trees. A decision tree technique used for classification of a dataset. Provides a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. Segments a dataset by creating 2-way splits. Requires less data preparation than CHAID. Data warehouse:A system for storing and delivering massive quantities of data.
DATA MINING-The Knowledge Discovery in Database Multidimensional database:A database designed for on-line analytical processing. Structured as a multidimensional hypercube with one axis per dimension. Nearest neighbor:A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called a k-nearest neighbor technique. OLAP:On-line analytical processing. Refers to array-oriented database applications that allow users to view, navigate through, manipulate, and analyze multidimensional databases Predictive model:A structure and process for predicting the values of specified variables in a dataset. Prospective data analysis:Data analysis that predicts future trends, behaviors, or events based on historical data. RAID :Redundant Array of Inexpensive Disks. A technology for the efficient parallel storage of data for high-performance computer systems
52
DATA MINING-The Knowledge Discovery in Database Conclusion: Data mining offers great promise in helping organizations uncover patterns hidden in their data that can be used to predict the behavior of customers, products and processes. However, data mining tools need to be guided by users who understand the business, the data, and the general nature of the analytical methods involved. Realistic expectations can yield rewarding results across a wide range of applications, from improving revenues to reducing costs. Building models is only one step in knowledge discovery. It’s vital to properly collect and prepare the data, and to check the models against the real world. The “best” model is often found after models of several different types ,or by trying different technologies or algorithms. Choosing the right data mining products means finding a tool with good basic capabilities, an interface that matches the skill level of the people who’ll be using it, and features relevant to your specific business problems. After you have narrowed down the list of potential solutions, get a hands-on trial of the likeliest ones. Comprehensive data warehouses that integrate operational data with customer, supplier, and market information have resulted in an explosion of information. Competition requires timely and sophisticated analysis on an integrated view of the data. However, there is a growing gap between more powerful storage and retrieval systems and the users’ ability to effectively analyze and act on the information they contain. Both relational and OLAP technologies have tremendous capabilities for navigating massive data warehouses, but brute force navigation of data is not enough. A new technological leap is needed to structure and prioritize information for specific end-user problems. The data mining tools can make this leap. Quantifiable business benefits have been proven through the integration of data mining with current information systems, and new products are on the horizon that will bring this integration to an even wider audience of users.
53
DATA MINING-The Knowledge Discovery in Database
BIBLOGRAPHY Introduction to Data Mining and Knowledge Discovery By: Two crow corporation Fundamentals of Database systems By: Ramez Elmasri www.datamining.com www.data-mine.com www.threatling.com www.research.microsoft.com/datamine www.dwinfocenter.org