INST 766: DATA MINING LESSON 2: DATA PREPROCESSING
Lesson 2: Data Preprocessing
INST 766: Data Mining
Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation
Lesson 2: Data Preprocessing
INST 766: Data Mining
2
What is Data? Attributes
An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Concept : thing to be learned Instance: individual example of concept Attributes: Measuring aspects of an instance. Lesson 2: Data Preprocessing
Objects
Collection of data objects and their attributes
Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
INST 766: Data Mining
3
What is Data(2)? Attribute Values: Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different – ID has no limit but age has a maximum and minimum value Lesson 2: Data Preprocessing
INST 766: Data Mining
4
What is Data(3)? Types of Attributes: There are different types of attributes: Categorical : having discrete classes (e.g., red, blue, green) (Categorical can be either nominal (unordered) or ordinal (ordered).)
Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 110), grades, height in {tall, medium, short}
Continuous: having any numerical value (e.g., quantity sold) Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts Lesson 2: Data Preprocessing
INST 766: Data Mining
5
What is Data(4)? Properties of Attribute Values: The type of an attribute depends on which of the following properties it possesses: Distinctness: = ≠ Order: < > Addition: + Multiplication: */ Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties Lesson 2: Data Preprocessing
INST 766: Data Mining
6
Attribute Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ≠)
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, χ2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best}, grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
Ratio
Lesson 2: Data Preprocessing
INST 766: Data Mining
7
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function.
Interval
new_value =a * old_value + b where a and b are constants
An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree).
Ratio Lesson 2: Data Preprocessing
new_value = a * old_value
Length can be measured in meters or feet. INST 766: Data Mining
8
What is Data(5)? Discrete and Continuous Attributes: Discrete Attribute Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. Lesson 2: Data Preprocessing
INST 766: Data Mining
9
What is Data(6)? Types of data sets: Record Data Matrix Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential Data Genetic Sequence Data Lesson 2: Data Preprocessing
INST 766: Data Mining
10
What is Data(7)? Important Characteristics of Structured Data: Dimensionality Curse of Dimensionality Sparsity Only presence counts Resolution Patterns depend on the scale
Lesson 2: Data Preprocessing
INST 766: Data Mining
11
What is Data(8)? Record Data: Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Lesson 2: Data Preprocessing
INST 766: Data Mining
12
What is Data(9)? Data Matrix: If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Projection of x Load
Projection of y load
Distance
Load
Thickness
10.23
5.27
15.22
2.7
1.2
12.65
6.25
16.22
2.2
1.1
Lesson 2: Data Preprocessing
INST 766: Data Mining
13
What is Data(10)? Document Data: Each document becomes a `term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document.
team
coach
pla y
ball
score
game
wi n
lost
timeout
season
Document 1
3
0
5
0
2
6
0
2
0
2
Document 2
0
7
0
2
1
0
0
3
0
0
Document 3
0
1
0
0
1
2
2
0
3
0
Lesson 2: Data Preprocessing
INST 766: Data Mining
14
What is Data(11)? Transaction Data: A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
TID
Items
1 2 3 4 5
Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk
Lesson 2: Data Preprocessing
INST 766: Data Mining
15
What is Data(12)? Graph Data: Examples: Generic graph and HTML Links
2 1
5 2 5
Lesson 2: Data Preprocessing
Data Mining Graph Partitioning Parallel Solution of Sparse Linear System of Equations N-Body Computation and Dense Linear System Solvers
INST 766: Data Mining
16
What is Data(13)? Chemical Data: Benzene Molecule: C6H6
Lesson 2: Data Preprocessing
INST 766: Data Mining
17
What is Data(14)? Ordered Data: Sequences of transactions Items/Events
An element of the sequence Lesson 2: Data Preprocessing
INST 766: Data Mining
18
What is Data(15)? Ordered Data: Genomic sequence data GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG
Lesson 2: Data Preprocessing
INST 766: Data Mining
19
What is Data(16)? Ordered Data: Spatio-Temporal Data
Average Monthly Temperature of land and ocean
Lesson 2: Data Preprocessing
INST 766: Data Mining
20
What is Data(17)? Data Quality: What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems?
Examples of data quality problems: Noise and outliers missing values duplicate data Lesson 2: Data Preprocessing
INST 766: Data Mining
21
What is Data(18)?
Noise:
Noise refers to modification of original values Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen
Two Sine Waves Lesson 2: Data Preprocessing
Two Sine Waves + Noise INST 766: Data Mining
22
What is Data(19)? Outliers: Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
Lesson 2: Data Preprocessing
INST 766: Data Mining
23
Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation
Lesson 2: Data Preprocessing
INST 766: Data Mining
24
Knowledge Discovery Process flow, according to CRISP-DM
see www.crisp-dm.org for more information
Monitoring
Lesson 2: Data Preprocessing
INST 766: Data Mining
25
Knowledge Discovery Process, in practice
Monitoring
Data Preparation Data Preparation estimated to take 70-80% of the time and effort
Lesson 2: Data Preprocessing
INST 766: Data Mining
26
Why Data Preprocessing? Data in the real world is dirty (due to their huge size ie: several giga bytes or more) incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Lesson 2: Data Preprocessing
INST 766: Data Mining
27
Major Tasks in Data Preprocessing HOW CAN THE DATA BE PRE PROCESSED SO AS TO IMPROVE THE EFFICIENCY AND EASY OF MINING PROCESS?
Data cleaning (This technique can be applied to remove and correct the inconsistence in the data) Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration (Techniques helps to merge data from multiple source into coherent data store) Integration of multiple databases, data cubes, or files
Lesson 2: Data Preprocessing
INST 766: Data Mining
28
Major Tasks in Data Preprocessing
Data transformation
(normalization techniques can be applied. Since normalization may improve the accuracy and efficiency of data mining algorithms)
Normalization and aggregation
Data reduction (helps to reduce the data size by aggregating, eliminating redundant data)
Obtains reduced representation in volume but produces the same or similar analytical results
Data discretization Part of data reduction but with particular importance, especially for numerical data Lesson 2: Data Preprocessing
INST 766: Data Mining
29
Forms of data preprocessing
Lesson 2: Data Preprocessing
INST 766: Data Mining
30
Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation
Lesson 2: Data Preprocessing
INST 766: Data Mining
31
Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data
Lesson 2: Data Preprocessing
INST 766: Data Mining
32
Reasons for Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. Lesson 2: Data Preprocessing
INST 766: Data Mining
33
How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
Lesson 2: Data Preprocessing
INST 766: Data Mining
34
Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data Lesson 2: Data Preprocessing
INST 766: Data Mining
35
How to Handle Noisy Data? Binning method first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions Lesson 2: Data Preprocessing
INST 766: Data Mining
36
Binning: Simple Discretization Methods Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky. Lesson 2: Data Preprocessing
INST 766: Data Mining
37
Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 *
Lesson 2: Data Preprocessing
INST 766: Data Mining
38
Cluster Analysis
Lesson 2: Data Preprocessing
INST 766: Data Mining
39
Regression y Y1
y=x+1
Y1’
X1
Lesson 2: Data Preprocessing
INST 766: Data Mining
x
40
Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation
Lesson 2: Data Preprocessing
INST 766: Data Mining
41
Data Integration Data integration: combines data from multiple sources into a coherent store
Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-# Eg. How can computer can sure that customer_id on one database is the same as customer_No from another database. Ans. Data warehouse and database has metadata that can be used to avoid error in the schema integration.
Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units Eg. Customer weight might be stored in pounds, on some database and K.G.s. on some database similarly, Price can be stored in Dollars, Rupees, Birr even from different services such break fast and taxes. Lesson 2: Data Preprocessing
INST 766: Data Mining
42
Handling Redundant Data in Data Integration Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases One attribute may be a “derived” attribute in another table, e.g., annual revenue Redundant data may be able to be detected by correlation analysis Eg. For given 2 attributes, such as analysis can measure how strongly one attribute implies to other, based on available data set.
The correlation between attribute A and B can be measured by −
rA, B
−
( A − A)( B − B ) ∑ = (n −1)σ Aσ B
-
-
where, n is is the number of tuples, A and B are respective mean values of A and B, and σ A and σ B are the respective standard deviations of A and B. Lesson 2: Data Preprocessing
INST 766: Data Mining
43
Handling Redundant Data in Data Integration If the resulting correlation equation is greater than 0 then A and B are positively correlated Ie. if measure the values of A increases as the value B increase The Higher the value the more each attribute implies the other The high value indicated that A/B may redundancy may be removed.
If the result value is 0 (zero) then A and B are independent and there is no correlation between them. If the resulting value is less than 0 (zero) ie. Negative it indicates that A and B are negatively correlated. This leads, the value one attribute increase and other attribute decreases. Each attribute discourage each other
Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
Lesson 2: Data Preprocessing
INST 766: Data Mining
44
Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation
Lesson 2: Data Preprocessing
INST 766: Data Mining
45
Data Transformation Smoothing: remove noise from data eg: Using techniques such as Binning, Clustering and Regrassion. Aggregation: summarization, data cube construction : Eg. Daily sales data may be aggregated for monthly, weekly and annual sales This technique is basically used for constructing a data cube for analysis of the data for multiple granularity. Generalization: concept hierarchy climbinglow level or primitive data are replaced by higher level concepts through the use of hierarchy concept. Eg. The attribute like street can be generalized as higher level concepts like city, country. Similarly, Age can be mapped to higher level concepts called middle age, young age, old age. (It is like superficial data) Lesson 2: Data Preprocessing
INST 766: Data Mining
46
Data Transformation Normalization: scaled to fall within a small, specified range I » Eg:. -1.0 to 1.0 or 0.0 to 1.0 min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction (Derived Attributes) New attributes constructed from the given ones
Lesson 2: Data Preprocessing
INST 766: Data Mining
47
Data Transformation: Normalization min-max normalization v − minA v' = (new _ maxA − new _ minA) + new _ minA maxA − minA See Employee Salary Attribute Example
z-score normalization
v' =
v −meanA stand _ devA
See Example How it transfer for Employee Salary
normalization by decimal scaling v v' = j 10
Where j is the smallest integer such that Max(| v ' |)<1
See How it uses for transformation Lesson 2: Data Preprocessing
INST 766: Data Mining
48
Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation
Lesson 2: Data Preprocessing
INST 766: Data Mining
49
Data Reduction Strategies Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
Lesson 2: Data Preprocessing
INST 766: Data Mining
50
Data Reduction Strategies : Data Cube Aggregation(1) The lowest level of a data cube the aggregated data for an individual entity of interest e.g., a customer in a phone calling data warehouse.
Assuming we have collected from electronic shop that consists of sales per quarter fro the year 2000,2001,2002 If we are interested to find out annual sales (ie. total per year) rather than total per quarter This can be done through aggregated so that resulting data can be summarized the total sales per year instead of per quarter. The resulting data in smaller in size without loss of any information necessary for analysis task. Lesson 2: Data Preprocessing
INST 766: Data Mining
51
Figure: Sales data for a given branch of AllElectronics for the year 1997 to 1999. On the left, the sales are shown per quarter. On the right, the data are aggregated to provide the annual sales. Lesson 2: Data Preprocessing
INST 766: Data Mining
52
Figure: A data cube for sales at AllElectronics.
Lesson 2: Data Preprocessing
INST 766: Data Mining
53
Data Reduction Strategies: Data Cube Aggregation(2) Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Here each cell holds an aggregate data to the data point in multidimensional space. Concept of hierarchy is existed for each attribute allowing data at multiple level of abstraction Eg. Hierarchy of Branch may allow to group into region based on the address.
Data cube provides fast access to precomputed summarized data, so it helps
Reference appropriate levels Use the smallest representation which is enough to solve the task
Queries regarding aggregated information should be answered using data cube, when possible Lesson 2: Data Preprocessing
INST 766: Data Mining
54
Data Reduction Strategies: Dimensionality Reduction(1) Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features Eg. Data set may consists of hundreds of attributes many of which may be irrelevant to the mining task and redundant Eg. While deciding to customers as to whether customer is likely to purchase popular CD from video library If we notify the sale database (attribute from sales database) that customer’s telephone_no is likely to be irrelevant as compare to attributes such as age and music_taste. This can be difficult and time consuming when the behavior of the data is unknown. Adding irrelevant attributes in data set make slowdown the mining process. reduce # of patterns in the patterns, easier to understand Lesson 2: Data Preprocessing
INST 766: Data Mining
55
Data Reduction Strategies: Dimensionality Reduction(2) It reduce the data size by removing the irrelevant attributes. Typically, methods of attribute subset selection are applied. The goal of attribute subset selection is to find minimum set of attributes such that resulting probability distribution data classes are closer to original data set which is obtained through
. Benefits of reduced set of attributes all attributes
Reduce the number of attributes appearing in discovered pattern Helps pattern easier to understand.
HOW CAN WE FIND A GOOD SUBSET OF THE ORIGINAL ATTRIBUTES ?.
Heuristic methods (due to exponential # of choices): heuristic method that a reduce search space are commonly used for attribute subset selection It always finds best choice to lead optimal solution and these are greedy It (stastical method) assumes that attributes are independent of another In addition, many attribute evaluation measure can be used such as information gain measure is used to build decision tree for classification Lesson 2: Data Preprocessing
INST 766: Data Mining
56
Heuristic Feature Selection Methods There are 2d possible sub-features of d features Several heuristic feature selection methods: Best single features under the feature independence assumption: choose by significance tests. Best step-wise feature selection: The best single-feature is picked first Then next best feature condition to the first, ... Step-wise feature elimination: Repeatedly eliminate the worst feature Best combined feature selection and elimination: Optimal branch and bound: Use feature elimination and backtracking Lesson 2: Data Preprocessing
INST 766: Data Mining
57
Techniques of basic heuristic methods of attribute sub set Selection
Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination Decision-tree induction
Lesson 2: Data Preprocessing
INST 766: Data Mining
58
STEPWISE FORWARD SET OF ATTRIBUTES
•
The best of the original attributes is determined and added to the set At each step/iteration, the best of the remaining original attributes is added to the set . Eg. For forward selection
Initial attribute set {A1,A2,A3,A4,A5,A6} 2. Start with empty set of attribute { } 3. The Best original attribute are determined and added to set {A1} {A1,A4} Reduced attribute set is {A1,A4,A6} Lesson 2: Data Preprocessing INST 766: Data Mining
59
STEPWISE BACKWARD ELIMINATION 1. This procedure starts with Full set of attributes
2. At each step it removes the worst attributes from the remaining in the set. Eg. Initial attributes {A1,A2,A3,A4,A5,A6} {A1,A3,A4,A5,A6} {A1,A4,A5,A6} Reduced attribute set {A1,A4,A6}
Lesson 2: Data Preprocessing
INST 766: Data Mining
60
Figure: Greedy (heuristic) methods for attribute subset selection
Lesson 2: Data Preprocessing
INST 766: Data Mining
61
Data Reduction Strategies : Data Compression String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole
Lesson 2: Data Preprocessing
INST 766: Data Mining
62
Data Reduction Strategies: Data Compression
Compressed Data
Original Data lossless
Original Data Approximated
Lesson 2: Data Preprocessing
y s s lo
INST 766: Data Mining
63
Data Reduction Strategies: Wavelets Discrete wavelet transform (DWT): linear signal processing, multiresolutional analysis Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space DWT uses a hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast computational speed. Lesson 2: Data Preprocessing
INST 766: Data Mining
64
Data Reduction Strategies: Wavelets Method: 1) The length, L, of the input data vector must be an integer power of 2. This condition can be met by padding the data vector and zeros as necessary. 3) Each transform involves applying two functions. The first applies some data smoothing, such as a sum or weighted average. The second performs a weighted difference, which acts to bring out the detailed features of the data. 5) The two functions are applied to pairs of the input data, resulting in two sets of data of length L/2. In general, these represents a smoothed or low frequency version of the input data, and the high frequency content of it, respectively. 7) The two functions are recursively applied to the sets of data obtained in the previous loop, until the resulting data sets obtained are of length 2. 9) A selection of values from the data sets obtained in the above iterations are designated the wavelet coefficient of the transformed data. Note: The matrix must be orthonormal, so that the matrix inverse is just its transpose. Lesson 2: Data Preprocessing
INST 766: Data Mining
65
Figure: Examples of wavelet families. The number next to a wavelet name is the number of vanishing moments of the wavelet. This is a set of mathematical relationships that the coefficients must satisfy and is related to the number of coefficients. Lesson 2: Data Preprocessing
INST 766: Data Mining
66
Data Reduction Strategies: PCA Principal Component Analysis (PCA) : Also known by names Karhunen-Love, or K-L method.
Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) Each data vector is a linear combination of the c principal component vectors Works for numeric data only Used when the number of dimensions is large
Lesson 2: Data Preprocessing
INST 766: Data Mining
67
Data Reduction Strategies: PCA PCA Procedure:
The input data are normalized, so that each attribute falls within the same range. This step helps ensure that attributes with large domains will not dominate attributes with smaller domains. PCA computes c orthonormal vectors that provide a basis for the normalized input data. These are unit vectors that each point in a direction perpendicular to the others. These vectors are referred to as the principal components. The input data are linear combination of the principal components. The principal components are sorted in order of decreasing “significance” or strength. The principal component essentially serve as a new set of axes for the data, providing important information about the variance among the data, the second axis shows the next highest variance, and so on (See figure on the next page). Since the components are sorted according to decreasing order of “significance” the size of the data can be reduced by eliminating the weaker components, that is , those with low variance. Using the strongest principal components, it should be possible to reconstruct a good approximation of the original data. Lesson 2: Data Preprocessing
INST 766: Data Mining
68
Figure: Principal component analysis. Y1 and Y2 are the first two principle component of the given data that was mapped to X1 and X2. Lesson 2: Data Preprocessing
INST 766: Data Mining
69
Data Reduction Strategies: Numerosity Reduction Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods Do not assume models Major families: histograms, clustering, sampling
Lesson 2: Data Preprocessing
INST 766: Data Mining
70
Data Reduction Strategies: Numerosity Reduction Linear regression: Y = α + β X Two parameters , α and β specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = αab βacχad δbcd Lesson 2: Data Preprocessing
INST 766: Data Mining
71
Histograms A popular data reduction technique Divide data into buckets and store average (sum) for each bucket Can be constructed optimally in one dimension using dynamic programming Related to quantization problems.
40 35 30 25 20 15 10 5 0
Lesson 2: Data Preprocessing
10000
30000
50000
INST 766: Data Mining
70000
9000072
Discretization: Equal-Width Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Count 2
2
4
2
0
2
2
[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85] Equal Width, bins Low <= value < High
Lesson 2: Data Preprocessing
INST 766: Data Mining
73
Discretization: Equal-Width may produce clumping
Count
1 [0 – 200,000) … …. Salary in a corporation
Lesson 2: Data Preprocessing
INST 766: Data Mining
[1,800,000 – 2,000,000]
74
Discretization: Equal-Height Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Count 4
4
4 2
[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85] Equal Height = 4, except for the last bin
Lesson 2: Data Preprocessing
INST 766: Data Mining
75
Discretization: Equal-height advantages Generally preferred because avoids clumping In practice, “almost-equal” height binning is used which avoids clumping and gives more intuitive breakpoints Additional considerations: don’t split frequent values across bins create separate bins for special values (e.g. 0) readable breakpoints (e.g. round breakpoints)
Lesson 2: Data Preprocessing
INST 766: Data Mining
76
Histogram V-Optimal Histogram: is the one with the least variance. It is the most accurate and practical. Histogram variance is weighted sum of the original values that each bucket represents, where bucket weight is equal to the number of values in the bucket.
MaxDiff Histogram: the difference between each pair of adjacent values. A bucket boundary is established between each pair for pairs having the β -1 largest differences, where β is user-specified. MaxDiff is the most accurate and practical.
Lesson 2: Data Preprocessing
INST 766: Data Mining
77
Clustering Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared” Can have hierarchical clustering and be stored in multidimensional index tree structures There are many choices of clustering definitions and clustering algorithms. Lesson 2: Data Preprocessing
INST 766: Data Mining
78
Sampling Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time). Lesson 2: Data Preprocessing
INST 766: Data Mining
79
Sampling
R O W SRS le random t p u o m i h t s i ( w e l samp ment) e c a l p re SRSW R
Raw Data Lesson 2: Data Preprocessing
INST 766: Data Mining
80
Sampling Raw Data
Lesson 2: Data Preprocessing
Cluster/Stratified Sample
INST 766: Data Mining
81
Figure: Sampling can be used for data reduction
Lesson 2: Data Preprocessing
INST 766: Data Mining
82
Hierarchical Reduction Use multi-resolution structure with different degrees of reduction Hierarchical clustering is often performed but tends to define partitions of data sets rather than “clusters” Parametric methods are usually not amenable to hierarchical representation Hierarchical aggregation An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram Lesson 2: Data Preprocessing
INST 766: Data Mining
83
Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: ☛ divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis Range Freq Range Freq 10 56 68 10 – 20 3 40 – 50 3 12 45 41 20 – 30 1 50 – 60 1 23 43 13 30- 40 0 60 – 70 1
Lesson 2: Data Preprocessing
INST 766: Data Mining
84
Outline What is Data? Why Preprocess the Data? Data Cleaning Data Integration Data Transformation Data Reduction Discretization and concept hierarchy generation
Lesson 2: Data Preprocessing
INST 766: Data Mining
85
Discretization and Concept hierarchy Discretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.
Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
Lesson 2: Data Preprocessing
INST 766: Data Mining
86
Figure: A concept hierarchy for the attribute price.
Lesson 2: Data Preprocessing
INST 766: Data Mining
87
Discretization and concept hierarchy generation for numeric data Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Entropy-based discretization Segmentation by natural partitioning
Lesson 2: Data Preprocessing
INST 766: Data Mining
88
Entropy-Based Discretization Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is E (S , T ) =
| S 1| |S|
Ent ( S 1) +
|S 2| | S|
Ent ( S 2)
The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization. The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,
Ent ( S ) − E (T , S ) > δ
Experiments show that it may reduce data size and improve classification accuracy
Lesson 2: Data Preprocessing
INST 766: Data Mining
89
Figure : Automatic Generation of concept hierarchy for profit based on3-4-5 rule. Lesson 2: Data Preprocessing
INST 766: Data Mining
90
Concept hierarchy generation for categorical data Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes
Lesson 2: Data Preprocessing
INST 766: Data Mining
91
Specification of a set of attributes Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy.
country
15 distinct values
province_or_ state city
65 distinct values 3567 distinct values
street
674,339 distinct values
Lesson 2: Data Preprocessing
INST 766: Data Mining
92