Download course of outliers detection and more Lecture notes Data Mining in PDF only on Docsity!
CEN 481 Introduction to Data Mining
Week 12
OUTLIER DETECTION
Fall 2019
Instructor: Dr. H. Esin ÜNAL
What Are Outliers?
Outlier (Anomaly) : A data object that deviates significantly from
the normal objects as if it were generated by a different mechanism
Ex.: Unusual credit card purchase, sports: Michael Jordon,
Wayne Gretzky, ...
Outliers are different from the noise data
Noise is random error or variance in a
measured variable
Noise should be removed before outlier
detection
Outliers are interesting: It violates the mechanism that
generates the normal data
What Are Outliers?
Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical care
Public safety and security
Industry damage detection
Image processing
Sensor/Video network surveillance
Intrusion detection
Types of Outliers Outliers can be classified into three categories: Global outliers Contextual (or Conditional) outliers Collective outliers
- Contextual Outliers In a given data set, a data object is a contextual outlier if it deviates significantly with respect to a specific context of the object. Example: The temperature today is 28 ◦C. Is it exceptional (i.e., an outlier)?” It depends whether it is on summer or winter. In contextual outlier detection, the context has to be specified as part of the problem definition. Generally, in contextual outlier detection, the attributes of the data objects in question are divided into two groups: Contextual attributes : defines the context, e.g., time & location Behavioral attributes : characteristics of the object, used to evaluate whether the object is an outlier in the context to which it belongs., e.g., temperature, humidity, and pressure.
- Contextual Outliers
Unlike global outlier detection, in contextual outlier detection,
whether a data object is an outlier depends on not only the
behavioral attributes but also the contextual attributes.
Can be viewed as a generalization of local outliers —whose density
significantly deviates from its local area.
Contextual outlier analysis provides flexibility to users in that one
can examine outliers in different contexts, which can be highly
desirable in many applications.
- Collective Outliers Unlike global or contextual outlier detection, in collective outlier detection Consider not only behavior of individual objects, but also that of groups of objects Need to have the background knowledge on the relationship among data objects, such as a distance or similarity measure on objects. A data set may have multiple types of outlier. One object may belong to more than one type of outlier Global outlier detection is the simplest. Context outlier detection requires background information to determine contextual attributes and contexts. Collective outlier detection requires background information to model the relationship among objects to find groups of outliers.
Challenges of Outlier Detection
Modeling normal objects and outliers properly
Hard to enumerate all possible normal behaviors in an application
The border between normal and outlier objects is often a gray area
Application-specific outlier detection
Choice of distance measure among objects and the model of
relationship among objects are often application-dependent
E.g., clinic data: a small deviation could be an outlier; while in
marketing analysis, larger fluctuations
Dependency on the application type makes it impossible to
develop a universally applicable outlier detection method.
Outlier Detection Methods
Two ways to categorize outlier detection methods:
Based on whether user- labeled examples of outliers can be
obtained:
Supervised, semi-supervised vs. unsupervised methods
Based on assumptions about normal data and outliers :
Statistical, proximity-based, and clustering-based methods
Outlier Detection: Supervised Methods
Supervised methods model data normality and abnormality.
Modeling outlier detection as a classification problem:
The task is to learn a classifier that can recognize outliers.
Samples examined by domain experts used for training & testing
Methods for learning a classifier for outlier detection effectively:
Model normal objects and report those not matching the model as
outliers, or
Model outliers and treat those not matching the model as normal
Outlier Detection: Unsupervised Methods
In some application scenarios, objects labeled as “normal” or “outlier”
are not available. Thus, an unsupervised learning method has to be used.
Unsupervised outlier detection methods make an implicit assumption:
The normal objects are somewhat “clustered.”
They can form multiple groups, where each group has distinct features. An outlier is expected to be far away from any groups of normal objects
Weakness : Cannot detect collective outlier
effectively
Normal objects may not share any strong patterns (uniformly distributed), but the collective outliers may share high similarity in a small area
Outlier Detection: Unsupervised Methods As an example to this weakness: In some intrusion or virus detection normal activities are very diverse and many do not fall into high-quality clusters. Unsupervised methods may have a high false positive rate (=FP/N) i.e. they may mislabel many normal objects as outliers (intrusions or viruses in these applications), and let many actual outliers go undetected. Due to the high similarity between intrusions and viruses (i.e., they have to attack key resources in the target systems), modeling outliers using supervised methods may be far more effective. Many clustering methods can be adapted for unsupervised methods: The main idea: Find clusters, then outliers not belonging to any cluster Problem 1 : Hard to distinguish noise from outliers Problem 2 : It is often costly to find clusters first and then find outliers. Processing a large population of non-target data entries before touching the real meat. Newer methods: tackle outliers directly
Outlier Detection: Statistical Methods
Statistical methods (also known as model-based methods)
assume that the normal data follow some statistical model (a
stochastic model)
The data not following the model are outliers.
Example (right figure) :
First use Gaussian distribution to model the
normal data
For each object y in region R, estimate gD(y), the
probability of y fits the Gaussian distribution
If gD(y) is very low, y is unlikely generated by the
Gaussian model, thus an outlier.
Outlier Detection: Proximity-Based Methods
An object is an outlier if the nearest neighbors of the object are far away,
i.e., the proximity of the object is significantly deviates from the proximity
of most of the other objects in the same data set
Example (right figure):
Model the proximity of an object using its 3 nearest neighbors Objects in region R are substantially different from other objects in the data set. For the two objects in R, their second and third nearest neighbors are dramatically more remote than those of any other objects. Thus the objects in R are outliers.