Monday, 13 August 2018

Anomaly Detection in High Dimensional data :- Angle based outlier detection technique



Angular Based Outlier Detection (ABOD)

Before starting ABOD method let’s try to understand what is outlier, different types of methods to detect outliers and how ABOD is different from other outlier detection methods.


As per Hawkins definition “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”
There are mainly 3 types of methods:-

1) Statistical or Model-based Methods: It includes Parametric and Non-parametric approach.


2) Proximity based methods: It can be classified in 3 category


  • Cluster based methods
  • Distance based methods
  • Density based methods

3) Angle based methods


Statistical models are relatively simple way of identifying an abnormal data point. Abnormal data points are outliers which can be identified even by Box- Plot, Extreme values in normal distribution etc.
 
Model based and Proximity based approaches, however, are based on an assessment of distances in the full-dimensional Euclidean data space. In high-dimensional data, these approaches are bound to deteriorate due to the notorious “curse of dimensionality”.You can read this article to know more about it- Distance & Density Based Clustering




The notion of ABOD algorithm is to find the outlier based on the variance of the angles between the difference vectors of data objects in the dataset.This way, the effects of the “curse of dimensionality” are alleviated compared to purely distance-based approaches.

In above figure for an outlier point P the angle between PX and PY for any two X Y from the database is substantially smaller than angles of other points Q and R.Angle between farthest data point is less than the angle between nearer data points. If you think deeper, the variance ( of all the possible angles to rest of the data points)  for the farthest data points will be lesser as compared to the nearer data points. Thus the data point with less variance of angle will be considered as an outlier.
Angles are more stable than distances in high dimensional
      Object o is an outlier if most other objects are located in similar directions ( less variance of angles)
      Object o is no outlier if many other objects are located in varying directions (Higher variance of angles)


 In actual implementation, not just the angle but the distance between the point is also divided so that distance is also taken into account.( Nearby points may also have very less angle but might not be outlier)  So angular distance=
(AB,AC) - dot product of AB
AB, AC - distance between A and B, A and C
So cosine= (AB, AC)/AB*AC
cosine /distances= (AB, AC)/(AB^2*AC^2)
to calculate angle based outlier factor of A, variance of all possible cosine/distance is taken. Lower value means more outlier-ness.
Implementation of AOBD method in R
# Sub-setting the data
iris_dataset <- iris[,1:4]
# Running ABOD code

angular_distance <- abodOutlier::abod(iris_dataset, method = "complete")
# plotting the data
library(ggplot2)
gg <- ggplot(data = iris_dataset, aes(x=Sepal.Length, y= Sepal.Width)) + geom_point(aes(col=angular_distance))
plot(gg)

Here the darker points (smaller angular distance) are clearly visible as outliers(Abnormal data points).

Connectivity based Outlier Detection method
Read more about other interesting ML topics-

No comments:

Post a Comment