Data Structures Notes for Lecture 17 Techniques of Data Mining By Samaher Hussein Ali 2007-2008
Anomaly Detection What are anomalies/outliers? The set of data points that are considerably different than the remainder of the data Variants of Anomaly/Outlier Detection Problems 1. Given a database D, find all the data points x ∈ D with anomaly scores greater than some threshold t. 2. Given a database D, find all the data points x ∈ D having the top-n largest anomaly scores f(x). 3. Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D.
Applications: Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection. Definition: Anomaly detection Builds models of normal behavior (called profiles), which it uses to detect new patterns that significantly deviate from the profiles. Such deviations may represent actual intrusions or simply be new behaviors that need to be added to the profiles. The main advantage of anomaly detection is that it may detect novel intrusions that have not yet been observed. Typically, a human analyst must sort through the deviations to ascertain which represent real intrusions. A limiting factor of anomaly detection is the high percentage of false positives. New patterns of intrusion can be added to the set of signatures for misuse detection. Also we can define anomaly detection as There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data. Anomaly detection builds models of normal behavior and automatically detects significant deviations from it. Supervised or unsupervised learning can be used. In a supervised approach, the model is developed based on training data that are known to be “normal.” In an unsupervised approach, no information is given about the training data. Anomaly detection research has included the application of classification algorithms, statistical approaches, clustering, and outlier analysis. The techniques used must be efficient and scalable, and capable of handling data of high volume, dimensionality, and heterogeneity. 1
Anomaly Detection Schemes A. General Steps 1. Build a profile of the “normal” behavior • Profile can be patterns or summary statistics for the overall population 2. Use the “normal” profile to detect anomalies • Anomalies are observations whose characteristics differ significantly from the normal profile B. Types of anomaly detection schemes 1. Graphical & Statistical-based 2. Distance-based 3. Model-based
1. Graphical Approaches The following figure explains these method
2. Statistical-based – Likelihood Approach (A) Assume the data set D contains samples from a mixture of two probability distributions: •
M (majority distribution)
•
A (anomalous distribution)
(B) General Approach: •
Initially, assume all the data points belong to M
•
Let Lt(D) be the log likelihood of D at time t
•
For each point xt that belongs to M, move it to A 1. Let Lt+1 (D) be the new log likelihood. 2. Compute the difference, ∆ = Lt(D) – Lt+1 (D) 2
3. If ∆ > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A (C) Data distribution, D = (1 – λ) M + λ A (D) M is a probability distribution estimated from data •
Can be based on any modeling method (naïve Bayes, maximum entropy, etc)
(E) A is initially assumed to be uniform distribution (F) Likelihood at time t:
(G) Limitations of Statistical Approaches •
Most of the tests are for a single attribute
•
In many cases, data distribution may not be known
•
For high dimensional data, it may be difficult to estimate the true distribution
3. Distance-based Approaches (A) Data is represented as a vector of features (B) Three major approaches •
Nearest-neighbor based
•
Density based
•
Clustering based
Know, we explain the basic idea of clustering based approach as follow: 1. Cluster the data into groups of different density 2. Choose points in small cluster as candidate outliers 3. Compute the distance between candidate points and non-candidate clusters. •
If candidate points are far from all other non-candidate points, they are outliers
3