Identifying
abnormality in any industrial process, banking fraud, ad clicks etc is one of
the major challenges for data scientist. There are many ways of detecting an
abnormality.

different ways of detecting abnormalities through machine learning

There are many outlier detection techniques. One of these is connectivity based outlier factor. It is an improved version of LOF (local outlier factor) technique.

Like COF(P1), find COF for all the data points available in diagram, the
data points having high COF values will be considered as outliers.

different ways of detecting abnormalities through machine learning

There are many outlier detection techniques. One of these is connectivity based outlier factor. It is an improved version of LOF (local outlier factor) technique.

data point away in linear set of data points should have been picked as outlier |

The idea of
Connectivity based outlier algorithm is to assign degree of outlier to each
data point. This degree of outlier is called connectivity based outlier factor;
COF of the data point. High COF value of data point represent the high
probability of being an outlier.

Let’s
understand COF step by step with an example.

Below
diagram shows 9 data points in the plane. As we can see there are 2 data points
P1 and P2 which are away from the trend line and seems outlier. The COF value
for P1 and P2 should be higher than other data points in the trend line. Here
we are taking k=5 nearest neighbor for COF calculation.

Following
steps to compute the COF value for a data point P1.

**1)****Find k nearest neighbor (k-NN)**of the data point P. (k=5)
N

_{5 }(P1) = {P2, P5, P4, P7, P6} create set of all data points nearer to P1.

**2)**

**Find Set based nearest (SBN) path:**represent k nearest data points in order s={P

_{1},P

_{2},……., P

_{k}}

SBN path = {P1,
P2, P5, P4, P6, P7}, arrange data points in such a way that it should create a
path, like P2 is the nearest data point from P1 then P5 is the nearest data
point from P2, then either P6 or P4 can be choose as nearest data from P5 then
P7 is the nearest data point from P6. All chosen data points must be available
in nearest neighbor data points N

_{5 }(P1) set.

**3)**

**Find set based nearest (SBN) trail:**represent sequence of edges based on SBN path e={e

_{1},e

_{2}, …,e

_{k}}. SBN trail = {(P1, P2), (P2, P5), (P5, P4), (P5, P6), (P6, P7)} arrange set of data points with respect to edges e1, e2, e3, e4, e5 respectively.

**4)**

**Find the cost of SBN trail:**represent the distance between 2 data point (edge value) - Cost description = {3, 2, 1, 1, 1} weight of each edge.

**5)**

**Find Average chaining distance of the data point**

dist(e

_{i}) denotes distance between 2 data points, an edge, ex-
Like P1, find average
chaining distance for all 5 nearest neighbor P2, P4, P5, P6 and P7.

Formula Explanation:

Total
no of edges = {(P1,P2),(P1,P4),(P1,P5),(P1,P6),(P1,P7),(P2,P4),(P2,P5),(P2,P6),
(P2,P7),(P4,P5),(P4,P6),(P4,P7),(P5,P6),(P5,P7),(P6,P7)} =15

k(k+1)/2 = 5(5+1)/2=15

·
Sum of
all edges weight during traversal of nearest data point

Ø Edge weight from P1 to P2 = 3

Ø Edge weight from P1 to P5 = 3+2 =5

Ø Edge weight from P1 to P4 = 3+2+1 =6

Ø Edge weight from P1 to P6 = 3+2+1+1 =7

Ø Edge weight from P1 to P7 = 3+2+1+1+1 =8

Total edge weight =(3+5+6+7+8) = 29

ac-dist(P1) = 29/15 = 1.933

**6)**

**Find COF value of the data point-**

COF
is the ratio of average chaining distance of data point and the average of
average chaining distance of k nearest neighbor of the data point.

Darker
data points showing most outlier data points. One can compare CBOF with Angle
based outlier detection techniques ( ABOD).

## No comments:

## Post a Comment