Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

course of outliers detection, Lecture notes of Data Mining

how to detect outliers in data mining

Typology: Lecture notes

2019/2020

Uploaded on 01/07/2020

fatima-aabadi
fatima-aabadi 🇹🇷

1 document

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CEN 481 Introduction to Data Mining
Week 12
OUTLIER DETECTION
Fall 2019
Instructor: Dr. H. Esin ÜNAL
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download course of outliers detection and more Lecture notes Data Mining in PDF only on Docsity!

CEN 481 Introduction to Data Mining

Week 12

OUTLIER DETECTION

Fall 2019

Instructor: Dr. H. Esin ÜNAL

What Are Outliers?

 Outlier (Anomaly) : A data object that deviates significantly from

the normal objects as if it were generated by a different mechanism

Ex.: Unusual credit card purchase, sports: Michael Jordon,

Wayne Gretzky, ...

 Outliers are different from the noise data

Noise is random error or variance in a

measured variable

Noise should be removed before outlier

detection

 Outliers are interesting: It violates the mechanism that

generates the normal data

What Are Outliers?

Applications:

Credit card fraud detection

Telecom fraud detection

Customer segmentation

Medical care

Public safety and security

Industry damage detection

Image processing

Sensor/Video network surveillance

Intrusion detection

Types of Outliers Outliers can be classified into three categories:  Global outliersContextual (or Conditional) outliersCollective outliers

  1. Contextual Outliers In a given data set, a data object is a contextual outlier if it deviates significantly with respect to a specific context of the object.  Example: The temperature today is 28 ◦C. Is it exceptional (i.e., an outlier)?” It depends whether it is on summer or winter. In contextual outlier detection, the context has to be specified as part of the problem definition. Generally, in contextual outlier detection, the attributes of the data objects in question are divided into two groups:  Contextual attributes : defines the context, e.g., time & location  Behavioral attributes : characteristics of the object, used to evaluate whether the object is an outlier in the context to which it belongs., e.g., temperature, humidity, and pressure.
  1. Contextual Outliers

Unlike global outlier detection, in contextual outlier detection,

whether a data object is an outlier depends on not only the

behavioral attributes but also the contextual attributes.

Can be viewed as a generalization of local outliers —whose density

significantly deviates from its local area.

Contextual outlier analysis provides flexibility to users in that one

can examine outliers in different contexts, which can be highly

desirable in many applications.

  1. Collective Outliers Unlike global or contextual outlier detection, in collective outlier detection  Consider not only behavior of individual objects, but also that of groups of objects  Need to have the background knowledge on the relationship among data objects, such as a distance or similarity measure on objects. A data set may have multiple types of outlier. One object may belong to more than one type of outlier  Global outlier detection is the simplest.  Context outlier detection requires background information to determine contextual attributes and contexts.  Collective outlier detection requires background information to model the relationship among objects to find groups of outliers.

Challenges of Outlier Detection

 Modeling normal objects and outliers properly

 Hard to enumerate all possible normal behaviors in an application

 The border between normal and outlier objects is often a gray area

 Application-specific outlier detection

 Choice of distance measure among objects and the model of

relationship among objects are often application-dependent

 E.g., clinic data: a small deviation could be an outlier; while in

marketing analysis, larger fluctuations

 Dependency on the application type makes it impossible to

develop a universally applicable outlier detection method.

Outlier Detection Methods

Two ways to categorize outlier detection methods:

Based on whether user- labeled examples of outliers can be

obtained:

 Supervised, semi-supervised vs. unsupervised methods

Based on assumptions about normal data and outliers :

 Statistical, proximity-based, and clustering-based methods

Outlier Detection: Supervised Methods

 Supervised methods model data normality and abnormality.

Modeling outlier detection as a classification problem:

The task is to learn a classifier that can recognize outliers.

Samples examined by domain experts used for training & testing

Methods for learning a classifier for outlier detection effectively:

 Model normal objects and report those not matching the model as

outliers, or

 Model outliers and treat those not matching the model as normal

Outlier Detection: Unsupervised Methods

In some application scenarios, objects labeled as “normal” or “outlier”

are not available. Thus, an unsupervised learning method has to be used.

 Unsupervised outlier detection methods make an implicit assumption:

The normal objects are somewhat “clustered.”

They can form multiple groups, where each group has distinct features. An outlier is expected to be far away from any groups of normal objects

 Weakness : Cannot detect collective outlier

effectively

Normal objects may not share any strong patterns (uniformly distributed), but the collective outliers may share high similarity in a small area

Outlier Detection: Unsupervised Methods  As an example to this weakness: In some intrusion or virus detection  normal activities are very diverse and many do not fall into high-quality clusters.  Unsupervised methods may have a high false positive rate (=FP/N) i.e. they may mislabel many normal objects as outliers (intrusions or viruses in these applications), and let many actual outliers go undetected.  Due to the high similarity between intrusions and viruses (i.e., they have to attack key resources in the target systems), modeling outliers using supervised methods may be far more effective. Many clustering methods can be adapted for unsupervised methods: The main idea: Find clusters, then outliers not belonging to any cluster  Problem 1 : Hard to distinguish noise from outliers  Problem 2 : It is often costly to find clusters first and then find outliers. Processing a large population of non-target data entries before touching the real meat.  Newer methods: tackle outliers directly

Outlier Detection: Statistical Methods

 Statistical methods (also known as model-based methods)

assume that the normal data follow some statistical model (a

stochastic model)

The data not following the model are outliers.

 Example (right figure) :

 First use Gaussian distribution to model the

normal data

 For each object y in region R, estimate gD(y), the

probability of y fits the Gaussian distribution

 If gD(y) is very low, y is unlikely generated by the

Gaussian model, thus an outlier.

Outlier Detection: Proximity-Based Methods

An object is an outlier if the nearest neighbors of the object are far away,

i.e., the proximity of the object is significantly deviates from the proximity

of most of the other objects in the same data set

 Example (right figure):

 Model the proximity of an object using its 3 nearest neighbors  Objects in region R are substantially different from other objects in the data set.  For the two objects in R, their second and third nearest neighbors are dramatically more remote than those of any other objects.  Thus the objects in R are outliers.