Lesson2 Data Pre Processing By Rk

  • Uploaded by: api-3839683
  • 0
  • 0
  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Lesson2 Data Pre Processing By Rk as PDF for free.

More details

  • Words: 6,170
  • Pages: 92
INST 766: DATA MINING LESSON 2: DATA PREPROCESSING

Lesson 2: Data Preprocessing

INST 766: Data Mining

Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation

Lesson 2: Data Preprocessing

INST 766: Data Mining

2

What is Data? Attributes

An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Concept : thing to be learned Instance: individual example of concept Attributes: Measuring aspects of an instance. Lesson 2: Data Preprocessing

Objects

Collection of data objects and their attributes

Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

10

INST 766: Data Mining

3

What is Data(2)? Attribute Values: Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different – ID has no limit but age has a maximum and minimum value Lesson 2: Data Preprocessing

INST 766: Data Mining

4

What is Data(3)? Types of Attributes: There are different types of attributes: Categorical : having discrete classes (e.g., red, blue, green) (Categorical can be either nominal (unordered) or ordinal (ordered).)

Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 110), grades, height in {tall, medium, short}

Continuous: having any numerical value (e.g., quantity sold) Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts Lesson 2: Data Preprocessing

INST 766: Data Mining

5

What is Data(4)? Properties of Attribute Values: The type of an attribute depends on which of the following properties it possesses: Distinctness: = ≠ Order: < > Addition: + Multiplication: */ Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties Lesson 2: Data Preprocessing

INST 766: Data Mining

6

Attribute Type

Description

Examples

Operations

Nominal

The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ≠)

zip codes, employee ID numbers, eye color, sex: {male, female}

mode, entropy, contingency correlation, χ2 test

Ordinal

The values of an ordinal attribute provide enough information to order objects. (<, >)

hardness of minerals, {good, better, best}, grades, street numbers

median, percentiles, rank correlation, run tests, sign tests

Interval

For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - )

calendar dates, temperature in Celsius or Fahrenheit

mean, standard deviation, Pearson's correlation, t and F tests

For ratio variables, both differences and ratios are meaningful. (*, /)

temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current

geometric mean, harmonic mean, percent variation

Ratio

Lesson 2: Data Preprocessing

INST 766: Data Mining

7

Attribute Level

Transformation

Comments

Nominal

Any permutation of values

If all employee ID numbers were reassigned, would it make any difference?

Ordinal

An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function.

Interval

new_value =a * old_value + b where a and b are constants

An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree).

Ratio Lesson 2: Data Preprocessing

new_value = a * old_value

Length can be measured in meters or feet. INST 766: Data Mining

8

What is Data(5)? Discrete and Continuous Attributes: Discrete Attribute Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. Lesson 2: Data Preprocessing

INST 766: Data Mining

9

What is Data(6)? Types of data sets: Record Data Matrix Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential Data Genetic Sequence Data Lesson 2: Data Preprocessing

INST 766: Data Mining

10

What is Data(7)? Important Characteristics of Structured Data: Dimensionality Curse of Dimensionality Sparsity Only presence counts Resolution Patterns depend on the scale

Lesson 2: Data Preprocessing

INST 766: Data Mining

11

What is Data(8)? Record Data: Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

10

Lesson 2: Data Preprocessing

INST 766: Data Mining

12

What is Data(9)? Data Matrix: If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Projection of x Load

Projection of y load

Distance

Load

Thickness

10.23

5.27

15.22

2.7

1.2

12.65

6.25

16.22

2.2

1.1

Lesson 2: Data Preprocessing

INST 766: Data Mining

13

What is Data(10)? Document Data: Each document becomes a `term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document.

team

coach

pla y

ball

score

game

wi n

lost

timeout

season

Document 1

3

0

5

0

2

6

0

2

0

2

Document 2

0

7

0

2

1

0

0

3

0

0

Document 3

0

1

0

0

1

2

2

0

3

0

Lesson 2: Data Preprocessing

INST 766: Data Mining

14

What is Data(11)? Transaction Data: A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

TID

Items

1 2 3 4 5

Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk

Lesson 2: Data Preprocessing

INST 766: Data Mining

15

What is Data(12)? Graph Data: Examples: Generic graph and HTML Links

2 1

5 2 5

Lesson 2: Data Preprocessing

Data Mining
  • Graph Partitioning
  • Parallel Solution of Sparse Linear System of Equations
  • N-Body Computation and Dense Linear System Solvers

    INST 766: Data Mining

    16

    What is Data(13)? Chemical Data: Benzene Molecule: C6H6

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    17

    What is Data(14)? Ordered Data: Sequences of transactions Items/Events

    An element of the sequence Lesson 2: Data Preprocessing

    INST 766: Data Mining

    18

    What is Data(15)? Ordered Data: Genomic sequence data GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    19

    What is Data(16)? Ordered Data: Spatio-Temporal Data

    Average Monthly Temperature of land and ocean

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    20

    What is Data(17)? Data Quality: What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems?

    Examples of data quality problems: Noise and outliers missing values duplicate data Lesson 2: Data Preprocessing

    INST 766: Data Mining

    21

    What is Data(18)?

    Noise:

    Noise refers to modification of original values Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen

    Two Sine Waves Lesson 2: Data Preprocessing

    Two Sine Waves + Noise INST 766: Data Mining

    22

    What is Data(19)? Outliers: Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    23

    Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    24

    Knowledge Discovery Process flow, according to CRISP-DM

    see www.crisp-dm.org for more information

    Monitoring

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    25

    Knowledge Discovery Process, in practice

    Monitoring

    Data Preparation Data Preparation estimated to take 70-80% of the time and effort

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    26

    Why Data Preprocessing? Data in the real world is dirty (due to their huge size ie: several giga bytes or more) incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Lesson 2: Data Preprocessing

    INST 766: Data Mining

    27

    Major Tasks in Data Preprocessing HOW CAN THE DATA BE PRE PROCESSED SO AS TO IMPROVE THE EFFICIENCY AND EASY OF MINING PROCESS?

    Data cleaning (This technique can be applied to remove and correct the inconsistence in the data) Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration (Techniques helps to merge data from multiple source into coherent data store) Integration of multiple databases, data cubes, or files

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    28

    Major Tasks in Data Preprocessing

    Data transformation

    (normalization techniques can be applied. Since normalization may improve the accuracy and efficiency of data mining algorithms)

    Normalization and aggregation

    Data reduction (helps to reduce the data size by aggregating, eliminating redundant data)

    Obtains reduced representation in volume but produces the same or similar analytical results

    Data discretization Part of data reduction but with particular importance, especially for numerical data Lesson 2: Data Preprocessing

    INST 766: Data Mining

    29

    Forms of data preprocessing

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    30

    Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    31

    Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    32

    Reasons for Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    33

    How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    34

    Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data Lesson 2: Data Preprocessing

    INST 766: Data Mining

    35

    How to Handle Noisy Data? Binning method first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions Lesson 2: Data Preprocessing

    INST 766: Data Mining

    36

    Binning: Simple Discretization Methods Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    37

    Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 *

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    38

    Cluster Analysis

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    39

    Regression y Y1

    y=x+1

    Y1’

    X1

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    x

    40

    Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    41

    Data Integration Data integration: combines data from multiple sources into a coherent store

    Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-# Eg. How can computer can sure that customer_id on one database is the same as customer_No from another database. Ans. Data warehouse and database has metadata that can be used to avoid error in the schema integration.

    Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units Eg. Customer weight might be stored in pounds, on some database and K.G.s. on some database similarly, Price can be stored in Dollars, Rupees, Birr even from different services such break fast and taxes. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    42

    Handling Redundant Data in Data Integration Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases One attribute may be a “derived” attribute in another table, e.g., annual revenue Redundant data may be able to be detected by correlation analysis Eg. For given 2 attributes, such as analysis can measure how strongly one attribute implies to other, based on available data set.

    The correlation between attribute A and B can be measured by −

    rA, B



    ( A − A)( B − B ) ∑ = (n −1)σ Aσ B

    -

    -

    where, n is is the number of tuples, A and B are respective mean values of A and B, and σ A and σ B are the respective standard deviations of A and B. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    43

    Handling Redundant Data in Data Integration If the resulting correlation equation is greater than 0 then A and B are positively correlated Ie. if measure the values of A increases as the value B increase The Higher the value the more each attribute implies the other The high value indicated that A/B may redundancy may be removed.

    If the result value is 0 (zero) then A and B are independent and there is no correlation between them. If the resulting value is less than 0 (zero) ie. Negative it indicates that A and B are negatively correlated. This leads, the value one attribute increase and other attribute decreases. Each attribute discourage each other

    Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    44

    Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    45

    Data Transformation Smoothing: remove noise from data eg: Using techniques such as Binning, Clustering and Regrassion. Aggregation: summarization, data cube construction : Eg. Daily sales data may be aggregated for monthly, weekly and annual sales This technique is basically used for constructing a data cube for analysis of the data for multiple granularity. Generalization: concept hierarchy climbinglow level or primitive data are replaced by higher level concepts through the use of hierarchy concept. Eg. The attribute like street can be generalized as higher level concepts like city, country. Similarly, Age can be mapped to higher level concepts called middle age, young age, old age. (It is like superficial data) Lesson 2: Data Preprocessing

    INST 766: Data Mining

    46

    Data Transformation Normalization: scaled to fall within a small, specified range I » Eg:. -1.0 to 1.0 or 0.0 to 1.0 min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction (Derived Attributes) New attributes constructed from the given ones

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    47

    Data Transformation: Normalization min-max normalization v − minA v' = (new _ maxA − new _ minA) + new _ minA maxA − minA See Employee Salary Attribute Example

    z-score normalization

    v' =

    v −meanA stand _ devA

    See Example How it transfer for Employee Salary

    normalization by decimal scaling v v' = j 10

    Where j is the smallest integer such that Max(| v ' |)<1

    See How it uses for transformation Lesson 2: Data Preprocessing

    INST 766: Data Mining

    48

    Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    49

    Data Reduction Strategies Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    50

    Data Reduction Strategies : Data Cube Aggregation(1) The lowest level of a data cube the aggregated data for an individual entity of interest e.g., a customer in a phone calling data warehouse.

    Assuming we have collected from electronic shop that consists of sales per quarter fro the year 2000,2001,2002 If we are interested to find out annual sales (ie. total per year) rather than total per quarter This can be done through aggregated so that resulting data can be summarized the total sales per year instead of per quarter. The resulting data in smaller in size without loss of any information necessary for analysis task. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    51

    Figure: Sales data for a given branch of AllElectronics for the year 1997 to 1999. On the left, the sales are shown per quarter. On the right, the data are aggregated to provide the annual sales. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    52

    Figure: A data cube for sales at AllElectronics.

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    53

    Data Reduction Strategies: Data Cube Aggregation(2) Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Here each cell holds an aggregate data to the data point in multidimensional space. Concept of hierarchy is existed for each attribute allowing data at multiple level of abstraction Eg. Hierarchy of Branch may allow to group into region based on the address.

    Data cube provides fast access to precomputed summarized data, so it helps

    Reference appropriate levels Use the smallest representation which is enough to solve the task

    Queries regarding aggregated information should be answered using data cube, when possible Lesson 2: Data Preprocessing

    INST 766: Data Mining

    54

    Data Reduction Strategies: Dimensionality Reduction(1) Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features Eg. Data set may consists of hundreds of attributes many of which may be irrelevant to the mining task and redundant Eg. While deciding to customers as to whether customer is likely to purchase popular CD from video library If we notify the sale database (attribute from sales database) that customer’s telephone_no is likely to be irrelevant as compare to attributes such as age and music_taste. This can be difficult and time consuming when the behavior of the data is unknown. Adding irrelevant attributes in data set make slowdown the mining process. reduce # of patterns in the patterns, easier to understand Lesson 2: Data Preprocessing

    INST 766: Data Mining

    55

    Data Reduction Strategies: Dimensionality Reduction(2) It reduce the data size by removing the irrelevant attributes. Typically, methods of attribute subset selection are applied. The goal of attribute subset selection is to find minimum set of attributes such that resulting probability distribution data classes are closer to original data set which is obtained through

    . Benefits of reduced set of attributes all attributes

    Reduce the number of attributes appearing in discovered pattern Helps pattern easier to understand.

    HOW CAN WE FIND A GOOD SUBSET OF THE ORIGINAL ATTRIBUTES ?.

    Heuristic methods (due to exponential # of choices): heuristic method that a reduce search space are commonly used for attribute subset selection It always finds best choice to lead optimal solution and these are greedy It (stastical method) assumes that attributes are independent of another In addition, many attribute evaluation measure can be used such as information gain measure is used to build decision tree for classification Lesson 2: Data Preprocessing

    INST 766: Data Mining

    56

    Heuristic Feature Selection Methods There are 2d possible sub-features of d features Several heuristic feature selection methods: Best single features under the feature independence assumption: choose by significance tests. Best step-wise feature selection: The best single-feature is picked first Then next best feature condition to the first, ... Step-wise feature elimination: Repeatedly eliminate the worst feature Best combined feature selection and elimination: Optimal branch and bound: Use feature elimination and backtracking Lesson 2: Data Preprocessing

    INST 766: Data Mining

    57

    Techniques of basic heuristic methods of attribute sub set Selection

    Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination Decision-tree induction

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    58

    STEPWISE FORWARD SET OF ATTRIBUTES



    The best of the original attributes is determined and added to the set At each step/iteration, the best of the remaining original attributes is added to the set . Eg. For forward selection

    Initial attribute set {A1,A2,A3,A4,A5,A6} 2. Start with empty set of attribute { } 3. The Best original attribute are determined and added to set {A1} {A1,A4} Reduced attribute set is {A1,A4,A6} Lesson 2: Data Preprocessing INST 766: Data Mining

    59

    STEPWISE BACKWARD ELIMINATION 1. This procedure starts with Full set of attributes

    2. At each step it removes the worst attributes from the remaining in the set. Eg. Initial attributes {A1,A2,A3,A4,A5,A6} {A1,A3,A4,A5,A6} {A1,A4,A5,A6} Reduced attribute set {A1,A4,A6}

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    60

    Figure: Greedy (heuristic) methods for attribute subset selection

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    61

    Data Reduction Strategies : Data Compression String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    62

    Data Reduction Strategies: Data Compression

    Compressed Data

    Original Data lossless

    Original Data Approximated

    Lesson 2: Data Preprocessing

    y s s lo

    INST 766: Data Mining

    63

    Data Reduction Strategies: Wavelets Discrete wavelet transform (DWT): linear signal processing, multiresolutional analysis Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space DWT uses a hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast computational speed. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    64

    Data Reduction Strategies: Wavelets Method: 1) The length, L, of the input data vector must be an integer power of 2. This condition can be met by padding the data vector and zeros as necessary. 3) Each transform involves applying two functions. The first applies some data smoothing, such as a sum or weighted average. The second performs a weighted difference, which acts to bring out the detailed features of the data. 5) The two functions are applied to pairs of the input data, resulting in two sets of data of length L/2. In general, these represents a smoothed or low frequency version of the input data, and the high frequency content of it, respectively. 7) The two functions are recursively applied to the sets of data obtained in the previous loop, until the resulting data sets obtained are of length 2. 9) A selection of values from the data sets obtained in the above iterations are designated the wavelet coefficient of the transformed data. Note: The matrix must be orthonormal, so that the matrix inverse is just its transpose. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    65

    Figure: Examples of wavelet families. The number next to a wavelet name is the number of vanishing moments of the wavelet. This is a set of mathematical relationships that the coefficients must satisfy and is related to the number of coefficients. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    66

    Data Reduction Strategies: PCA Principal Component Analysis (PCA) : Also known by names Karhunen-Love, or K-L method.

    Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) Each data vector is a linear combination of the c principal component vectors Works for numeric data only Used when the number of dimensions is large

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    67

    Data Reduction Strategies: PCA PCA Procedure:

    The input data are normalized, so that each attribute falls within the same range. This step helps ensure that attributes with large domains will not dominate attributes with smaller domains. PCA computes c orthonormal vectors that provide a basis for the normalized input data. These are unit vectors that each point in a direction perpendicular to the others. These vectors are referred to as the principal components. The input data are linear combination of the principal components. The principal components are sorted in order of decreasing “significance” or strength. The principal component essentially serve as a new set of axes for the data, providing important information about the variance among the data, the second axis shows the next highest variance, and so on (See figure on the next page). Since the components are sorted according to decreasing order of “significance” the size of the data can be reduced by eliminating the weaker components, that is , those with low variance. Using the strongest principal components, it should be possible to reconstruct a good approximation of the original data. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    68

    Figure: Principal component analysis. Y1 and Y2 are the first two principle component of the given data that was mapped to X1 and X2. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    69

    Data Reduction Strategies: Numerosity Reduction Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods Do not assume models Major families: histograms, clustering, sampling

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    70

    Data Reduction Strategies: Numerosity Reduction Linear regression: Y = α + β X Two parameters , α and β specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = αab βacχad δbcd Lesson 2: Data Preprocessing

    INST 766: Data Mining

    71

    Histograms A popular data reduction technique Divide data into buckets and store average (sum) for each bucket Can be constructed optimally in one dimension using dynamic programming Related to quantization problems.

    40 35 30 25 20 15 10 5 0

    Lesson 2: Data Preprocessing

    10000

    30000

    50000

    INST 766: Data Mining

    70000

    9000072

    Discretization: Equal-Width Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Count 2

    2

    4

    2

    0

    2

    2

    [64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85] Equal Width, bins Low <= value < High

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    73

    Discretization: Equal-Width may produce clumping

    Count

    1 [0 – 200,000) … …. Salary in a corporation

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    [1,800,000 – 2,000,000]

    74

    Discretization: Equal-Height Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Count 4

    4

    4 2

    [64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85] Equal Height = 4, except for the last bin

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    75

    Discretization: Equal-height advantages Generally preferred because avoids clumping In practice, “almost-equal” height binning is used which avoids clumping and gives more intuitive breakpoints Additional considerations: don’t split frequent values across bins create separate bins for special values (e.g. 0) readable breakpoints (e.g. round breakpoints)

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    76

    Histogram V-Optimal Histogram: is the one with the least variance. It is the most accurate and practical. Histogram variance is weighted sum of the original values that each bucket represents, where bucket weight is equal to the number of values in the bucket.

    MaxDiff Histogram: the difference between each pair of adjacent values. A bucket boundary is established between each pair for pairs having the β -1 largest differences, where β is user-specified. MaxDiff is the most accurate and practical.

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    77

    Clustering Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared” Can have hierarchical clustering and be stored in multidimensional index tree structures There are many choices of clustering definitions and clustering algorithms. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    78

    Sampling Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time). Lesson 2: Data Preprocessing

    INST 766: Data Mining

    79

    Sampling

    R O W SRS le random t p u o m i h t s i ( w e l samp ment) e c a l p re SRSW R

    Raw Data Lesson 2: Data Preprocessing

    INST 766: Data Mining

    80

    Sampling Raw Data

    Lesson 2: Data Preprocessing

    Cluster/Stratified Sample

    INST 766: Data Mining

    81

    Figure: Sampling can be used for data reduction

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    82

    Hierarchical Reduction Use multi-resolution structure with different degrees of reduction Hierarchical clustering is often performed but tends to define partitions of data sets rather than “clusters” Parametric methods are usually not amenable to hierarchical representation Hierarchical aggregation An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram Lesson 2: Data Preprocessing

    INST 766: Data Mining

    83

    Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: ☛ divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis Range Freq Range Freq 10 56 68 10 – 20 3 40 – 50 3 12 45 41 20 – 30 1 50 – 60 1 23 43 13 30- 40 0 60 – 70 1

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    84

    Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    85

    Discretization and Concept hierarchy Discretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.

    Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    86

    Figure: A concept hierarchy for the attribute price.

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    87

    Discretization and concept hierarchy generation for numeric data Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Entropy-based discretization Segmentation by natural partitioning

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    88

    Entropy-Based Discretization Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is E (S , T ) =

    | S 1| |S|

    Ent ( S 1) +

    |S 2| | S|

    Ent ( S 2)

    The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization. The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,

    Ent ( S ) − E (T , S ) > δ

    Experiments show that it may reduce data size and improve classification accuracy

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    89

    Figure : Automatic Generation of concept hierarchy for profit based on3-4-5 rule. Lesson 2: Data Preprocessing

    INST 766: Data Mining

    90

    Concept hierarchy generation for categorical data Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    91

    Specification of a set of attributes Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy.

    country

    15 distinct values

    province_or_ state city

    65 distinct values 3567 distinct values

    street

    674,339 distinct values

    Lesson 2: Data Preprocessing

    INST 766: Data Mining

    92

  • Related Documents

    Data Pre Processing
    June 2020 11
    Pre Processing
    August 2019 36
    Data Processing
    June 2020 12
    Data Processing
    November 2019 22
    Data Processing
    November 2019 26