Data Mining
• • • • •
What is Data Mining? What is Data Warehousing? What is the need of Data Mining? Data Mining Architecture Data Mining Algorithms
Data mining • Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses
Data Warehousing • A system for storing and delivering massive quantities of data. • The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis (such as data mining) on the information without slowing down the operational systems
Data Mining • Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledgedriven decisions • Data mining tools can answer business questions that traditionally were too time consuming to resolve
Uses • Retailing • Weather Forecasting • Traffic Congestion
Algorithms • Data mining algorithms traditionally fall into one of four broad categories • Classification • Clustering • Association • Sequence discovery
• Classification, or supervised induction, is perhaps the
most common of all data mining activities. The objective of classification is to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior. • This induced model consists of generalizations over the records of a training data set, which help distinguish predefined classes. • The hope is that this model can then be used to predict the classes of other unclassified records. .
• Common tools used for classification are neural networks, decision trees and if-then-else rules that need not have a tree structure.
• Neural networks involve the development of mathematical structures with the ability to learn.
• Decision trees classify data into a finite number of classes, based on the values of the variables. DTs are comprised of essentially a hierarchy of if-then statements and are thus significantly faster than neural nets
• Rule induction —The extraction of useful if-then rules from data based on statistical significance. if-then statements used here need not be hierarchical
• Clustering partitions the database into segments in which each segment member shares similar qualities
• Associations establish relationships about items that occur together in a given record
• Sequence Discovery can be looked at as the identification of associations over time. When appropriate information is available (for instance, the identity of a customer in a retail shop), a temporal analysis can be conducted to identify behavior over time.