Data Mining

  • Uploaded by: Chethan.M
  • 0
  • 0
  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Data Mining as PDF for free.

More details

  • Words: 1,416
  • Pages: 9
8530521.doc

Created by Chethan.M

Data Mining Goal of Data Mining  Simplification and automation of the overall statistical process, from data source(s) to model application  Changed over the years — Replace statistician ? Better models, less grunge work — Many different data mining algorithms / tools available — Statistical expertise required to compare different techniques — Build intelligence into the software

Data Mining Is…  Decision Trees  Nearest Neighbor Classification Neural Networks  Rule Induction  K-means Clustering

Data Mining is Not...     

Data warehousing SQL / Ad Hoc Queries / Reporting Software Agents Online Analytical Processing (OLAP) Data Visualization

Why data-mining now? Data mining is an increasingly popular topic (If the number of new textbooks is anything to go by). Two main reasons:  With computers now mediating most aspects of our lives, there has been a

large increase in the accumulation of electronic data.  With computers being increasingly up to the demands of complex modeling, it is getting easier to process larger datasets. Why Mine Data? Commercial Viewpoint  Data volumes are too large for classical analysis approaches:  Large number of records  High dimensional data  Leverage organization’s data assets  Only a small portion of the collected data is ever analyzed  Data that may never be analyzed continues to be collected, at a great expense, out of fear that something which may prove important in the future is missing.  Lots of data is being collected and warehoused  Web data, e-commerce

ISiM

8530521.doc

Created by Chethan.M

purchases at department/grocery stores Bank/Credit Card transactions  Computers have become cheaper and more powerful  Competitive Pressure is Strong  Provide better, customized services for an edge (e.g. In Customer Relationship Management)  

Scientific Viewpoint  Data collected and stored at enormous speeds (GB/hour)    

remote sensors on a satellite telescopes scanning the skies micro arrays generating gene expression data scientific simulations generating terabytes of data

 Traditional techniques infeasible for raw data  Data mining may help scientists

 In classifying and segmenting data  In Hypothesis Formation Origins of Data Mining  Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems  Traditional Techniques may be unsuitable due to  Enormity of data  High dimensionality of data  Heterogeneous, distributed nature of data

Mining Large Data Sets - Motivation  There is often information “hidden” in the data that is not readily evident  Human analysts may take weeks to discover useful information  Much of the data is never analyzed at all

What is Data Mining? ----- Many Definitions  Data processing using sophisticated data search capabilities and statistical algorithms to discover patterns and correlations in large preexisting databases; a way to discover new meaning in data.  Non-trivial extraction of implicit, previously unknown and potentially useful information from data.  Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns.

ISiM

8530521.doc

Created by Chethan.M

 The process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical tools.  The automated extraction of predictive information from (large) databases.  A step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data.  Data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories.

What is (not) Data Mining? What is not Data Mining? Look up phone number in phone directory Query a Web search engine for information about “Amazon”

ISiM

What is Data Mining? Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

8530521.doc

Created by Chethan.M

Statistics/AI

Machine Learning/ Pattern Recognition

Data Mining Database systems Data Mining Types:  Predictive data mining: This produces the model of the system described by the given data. It uses some variables or fields in the data set to predict unknown or future values of other variables of interest.  Descriptive data mining: This produces new, nontrivial information based on the available data set. It focuses on finding patterns describing the data that can be interpreted by humans.

Defining `data' By `data', we mean sets of variable values, e.g.,  Annual rainfall in Sussex for the last twenty years;  Age, salary and IQ for all members of Sussex faculty.

Records Values are organised in combinations called records. Each record has a particular context, e.g., age, salary and IQ specifically for the Informatics HoD. Combinations may also be called vectors (esp. in neural-networks) and data-points (esp. in statistics). A single record is a datum.

ISiM

8530521.doc

Created by Chethan.M

Tabulation Data are often presented in a tabulated form, with one datum per row, and one variable per column. NAME

smith bloggs bush ...

AGE SALARY IQ

42 29 50

36K 30K 60K

130 140 120

Where data are used for prediction, the to-be-predicted variable normally appears in the final column (and is often called `class').

Basic data-types Data may be classified according to the number and character of variables involved. • • • •

Univariate/discrete: one variable with integer/symbolic values. Univariate/continuous: one variable with real/continuous values. Multivariate/discrete: more than one variable with integer/symbolic values. Multivariate/continuous: more than one variable with real/continuous values.

Data Mining Tasks...      

Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive]

ISiM

8530521.doc

Created by Chethan.M

Classification: Definition

 Given a collection of records (training set )  Each record contains a set of attributes, one of the attributes is the class.  Find a model for class attribute as a function of the values of other attributes.  Goal: previously unseen records should be assigned a class as accurately as possible.

 A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Explicit and implicit structure A dataset is a body of data. Any dataset has explicit structure, ie., the numbers/values in the records. Generally, there is also implicit structure. Data mining is the task of identifying and modeling implicit structure, either as an end in itself or as a means of obtaining new information.

Example: A-level grades Dataset containing average A-level grades for the past ten years. Explicit structure is the mapping between years and average grades. (Explicit structure = `what you see')

ISiM

8530521.doc

Created by Chethan.M

There is also implicit structure---a gradual increase in values over time. (Average grades are increasing by approx 3% per year.) Classification Example

ic or

al

g

te ca

Tid Refund

ic

al

r go

te Marital Ca

us

o nu

ti

n co

s as

cl

Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

Refund Marital Status

Taxable Income Cheat

No

Single

75K

?

Yes

Married

50K

?

No

Married

150K

?

Yes

Divorced 90K

?

No

Single

40K

?

No

Married

80K

?

10

Training Set

Test Set

Learn Classifier

10

Model

Challenges of Data Mining       

Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data

ISiM

8530521.doc

   

Created by Chethan.M

Statistical methods Case-based reasoning Neural networks Decision trees

DM & DW: Data Warehousing + Data Mining = Increased performance of decision making process + Knowledgeable decision makers

Data Mining Applications  Data Mining For Financial Data Analysis  Data Mining For Telecommunications Industry  Data Mining For The Retail Industry  Data Mining In Healthcare and Biomedical Research  Data Mining In Science and Engineering Reference:

ISiM

8530521.doc

Created by Chethan.M

1. Introduction to Data Mining by Tan, Steinbach, Kumar 2. Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber 3. Kurt Thearling, Ph.D. An Introduction to Data Mining. www.thearling.com

Data Mining: Concepts and Techniques

ISiM

Related Documents

Data Mining
May 2020 23
Data Mining
October 2019 35
Data Mining
November 2019 32
Data Mining
May 2020 21
Data Mining
May 2020 19
Data Mining
November 2019 34