Tuesday, 25 September 2018

Connectivity Based Outlier Detection and its implementation in R

Identifying abnormality in any industrial process, banking fraud, ad clicks etc is one of the major challenges for data scientist. There are many ways of detecting an abnormality.

different ways of detecting abnormalities through machine learning

There are many outlier detection techniques. One of these is connectivity based outlier factor. It is an improved version of LOF (local outlier factor) technique. 

data point away in linear set of data points
 should have been picked as outlier

The idea of Connectivity based outlier algorithm is to assign degree of outlier to each data point. This degree of outlier is called connectivity based outlier factor; COF of the data point. High COF value of data point represent the high probability of being an outlier.
Let’s understand COF step by step with an example.

Below diagram shows 9 data points in the plane. As we can see there are 2 data points P1 and P2 which are away from the trend line and seems outlier. The COF value for P1 and P2 should be higher than other data points in the trend line. Here we are taking k=5 nearest neighbor for COF calculation.
Following steps to compute the COF value for a data point P1.

1)   Find k nearest neighbor (k-NN) of the data point P. (k=5)

N5 (P1) = {P2, P5, P4, P7, P6} create set of all data points nearer to P1.

2)   Find Set based nearest (SBN) path: represent k nearest data points in order s={P1,P2,……., Pk}
SBN path = {P1, P2, P5, P4, P6, P7}, arrange data points in such a way that it should create a path, like P2 is the nearest data point from P1 then P5 is the nearest data point from P2, then either P6 or P4 can be choose as nearest data from P5 then P7 is the nearest data point from P6. All chosen data points must be available in nearest neighbor data points N5 (P1) set.

3)    Find set based nearest (SBN) trail: represent sequence of edges based on SBN path e={e1,e2, …,ek}. SBN trail = {(P1, P2), (P2, P5), (P5, P4), (P5, P6), (P6, P7)} arrange set of data points with respect to edges e1, e2, e3, e4, e5 respectively.

4)    Find the cost of SBN trail: represent the distance between 2 data point (edge value) - Cost description = {3, 2, 1, 1, 1} weight of each edge.

5)   Find Average chaining distance of the data point
dist(ei) denotes distance between 2 data points, an edge, ex-

Like P1, find average chaining distance for all 5 nearest neighbor P2, P4, P5, P6 and P7.
Formula Explanation:
Total no of edges = {(P1,P2),(P1,P4),(P1,P5),(P1,P6),(P1,P7),(P2,P4),(P2,P5),(P2,P6), (P2,P7),(P4,P5),(P4,P6),(P4,P7),(P5,P6),(P5,P7),(P6,P7)} =15

k(k+1)/2 = 5(5+1)/2=15

·       Sum of all edges weight during traversal of nearest data point
Ø Edge weight from P1 to P2 = 3
Ø Edge weight from P1 to P5 = 3+2 =5
Ø Edge weight from P1 to P4 = 3+2+1 =6
Ø Edge weight from P1 to P6 = 3+2+1+1 =7
Ø Edge weight from P1 to P7 = 3+2+1+1+1 =8
Total edge weight =(3+5+6+7+8) = 29

ac-dist(P1) = 29/15 = 1.933

6)     Find COF value of the data point-

COF is the ratio of average chaining distance of data point and the average of average chaining distance of k nearest neighbor of the data point.

 Like COF(P1), find COF for all the data points available in diagram, the data points having high COF values will be considered as outliers.

Darker data points showing most outlier data points. One can compare CBOF with Angle based outlier detection techniques ( ABOD).

No comments:

Post a Comment