Anomaly Detection

What is Anomaly Detection?

  • Finding a subset of instances whose characteristics are different than the remainder of the data
  • Think Outliers in Data!!

Example Applications

  • Detecting fraud purchases on credit cards
  • Detecting deforestation using remote sensing data
  • Smart Homes/Building
    • Water or electrical theft detection
    • Pip burst detection

Challenges in Anomaly Detection

  • Like finding a needle in a haystack
    • very rare
  • Number of anomalies are usually unknown
  • Method is unsupervised
    • therefore validation is hard

Output of Anomaly Detection

Continuous-Valued Output

  • In which every data instance is assigned an anomaly score

Binary-Valued Output

  • A threshold is needed to convert the anomaly score into a binary label

How do we Approach/Strategize for Anomaly Detection

  • Assuming that there are more “normal” data than “anomalous” instances

  • The general approach is to build a profile of the normal behavior and then use the normal profile to flag anomalies

Z-Score Approach

  • Think back to statistics and gaussian normal distribution

Distance-Based Approach

  • In this method, when we do is take in the data as data points and k which is the number of nearest neighbors

  • Then the approach is to compute the distance between every pair of data points and a anomaly score of a data point is given by its distance to the k-th nearest neighnor (the larger the distance, the more anomalous the data point is!)

Python Example of Distance-Based Approach

  • Compute the distance between every pair of data points
  • Anomaly score of a data point is given by its distance to the k-th nearest neighbor
    • The larger distance, the more anomalous is the data point
import pandas as pd

data = pd.read_csv('Synthetic_control_sample.csv',header=None)
data.head()

Output:

    1 2 3 4 5 6 7 8 9 50 51 52 53 54 55 56 57 58 59
0 0.448130 0.30272 0.39039 1.55520 1.36110 1.00610 0.636930 -0.51374 -0.69051 -1.5134 -0.89196 -0.73934 0.821470 1.13340 0.52480 1.5382 1.05790 0.10072 -0.079991 -0.734820
1 0.324930 0.92102 0.67606 1.56710 0.62195 0.23225 0.699970 -0.29080 -1.06150 -1.1291 -0.89692 -1.61170 0.049964 -0.20141 0.96575 1.5515 1.34120 0.54099 0.488130 0.075192
2 0.065106 0.27966 1.60660 0.90703 0.31790 -0.38201 -0.071902 -1.69230 -1.05300 -1.0928 -0.85534 -1.61720 -0.786690 -0.44217 0.61959 1.4380 1.21310 1.20440 0.411550 -0.733190
3 -0.197290 0.86487 0.91300 1.10690 1.13040 0.22366 -0.070158 -0.91154 -1.32590 -1.3727 -1.51920 -1.82530 -0.541960 -0.64238 0.20283 1.1598 1.76730 1.27050 0.200010 -0.351930
4 -0.295140 0.27611 1.36790 1.12880 0.68236 0.27383 -0.083935 -0.64006 -1.38620 -1.1307 -1.03950 -1.28350 -1.317100 -1.13480 -0.46492 0.5699 0.95651 0.87453 0.811540 -0.451390

Lets graph our data above:

  • We can plot it as a time series
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure().add_subplot(2,2,1)
plt.plot(data.iloc[5].T)
plt.figure().add_subplot(2,2,1)
plt.plot(data.iloc[10].T)
plt.figure().add_subplot(2,2,1)
plt.plot(data.iloc[45].T)
plt.figure().add_subplot(2,2,1)
plt.plot(data.iloc[50].T)

Output:

insert image

Now lets plot the first 50 time series and last 5 time series

plt.figure().add_subplot(2,1,1)
plt.plot(data.iloc[:49].T)
plt.figure().add_subplot(2,1,2)
plt.plot(data.iloc[50:55].T)

Output:

insert image

Now lets use the pairwise distances between obseravations in n-dimensional space

from scipy.spatial import distance

Y = distance.pdist(data.to_numpy(), 'correlation')
Y = distance.squareform(Y)
Y

Output:

array([[ 0.        ,  0.20237268,  1.16745543, ...,  1.07495438,
         1.02072118,  1.11979732],
       [ 0.20237268,  0.        ,  0.9807679 , ...,  1.09614197,
         1.01776417,  1.01390679],
       [ 1.16745543,  0.9807679 ,  0.        , ...,  0.94071103,
         0.97634081,  0.95014835],
       ..., 
       [ 1.07495438,  1.09614197,  0.94071103, ...,  0.        ,
         0.23309714,  0.29398105],
       [ 1.02072118,  1.01776417,  0.97634081, ...,  0.23309714,
         0.        ,  0.19070761],
       [ 1.11979732,  1.01390679,  0.95014835, ...,  0.29398105,
         0.19070761,  0.        ]])

Now lets do a

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

knn = 6
index = np.argsort(Y, axis=1)    # sort each row by increasing distance
index = index[:,knn]             
knnDist = Y[np.arange(len(index)),index]  # identify the k-th smallest distance
plt.hist(knnDist)

insert image

Identify the Outliers

outlier = np.flipud(np.argsort(knnDist))
sort_dist = np.flipud(np.sort(knnDist))
p = pd.DataFrame(np.column_stack((outlier,sort_dist)),columns=['index','score'])
p.head()

Output:

  index score
0 50.0 0.955814
1 53.0 0.945165
2 54.0 0.927089
3 51.0 0.902902
4 52.0 0.888707

Distance-based Outlier Detection (using scikit-learn)

from sklearn.neighbors import NearestNeighbors
import numpy as np
from scipy.spatial import distance

knn = 6
nbrs = NearestNeighbors(n_neighbors=knn+1, metric=distance.correlation).fit(data.to_numpy())
distances, indices = nbrs.kneighbors(data.to_numpy())
plt.hist(distances[:,knn])

insert image

outlier = np.flipud(np.argsort(distances[:,knn]))
sort_dist = np.flipud(np.sort(distances[:,knn]))

p = pd.DataFrame(np.column_stack((outlier,sort_dist)),columns=['index','score'])
p.head()

Output:

  index score
0 50.0 0.955814
1 53.0 0.945165
2 54.0 0.927089
3 51.0 0.902902
4 52.0 0.888707

Model-Based Approach

  • Fits a model to the data, most models tend to fit the general characteristics of the data

  • Then apply the model to each data instance, the more anomalous the data instance, the easier it is to isolate the instance from the model

Isolation Forest

import numpy as np
X = np.array([0.1,0.8,0.84,0.87,0.89,0.92,0.95])
plt.plot(X,np.ones(len(X)),'ro')
plt.xlim(-0.1,1.1)
plt.ylim([0.95,1.5])

insert image

from sklearn.ensemble import IsolationForest

clf = IsolationForest(n_estimators=100, max_samples=30, contamination=0.1)
clf.fit(data.to_numpy())
score = clf.predict(data.to_numpy())
score

Output:

array([ 1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,
        1, -1, -1, -1])

Excercises

Applying Anomaly Detection Algorithms to a Dataset

Load the dataset into a pandas DataFrame object

Note: the last column (6) corresponds to the true class of each data point (0: normal, 1: anomaly)


import pandas as pd

data = pd.read_csv('mammography.csv', header = None)
data.head()

Output:

  0 1 2 3 4 5 6
0 -0.5378 -0.3640 5.3141 -0.2190 0.2685 1.1080 0
1 -0.7844 -0.4702 -0.5916 -0.8596 -0.3779 -0.9457 0
2 -0.7844 -0.4702 -0.5916 -0.8596 -0.3779 -0.9457 0
3 0.3804 -0.0278 -0.4113 0.7261 3.5478 1.2421 0
4 -0.7844 -0.4702 -0.5916 -0.8596 -0.3779 -0.9457 0

Draw a scatter plot based on the 4th and 5th column in the DataFrame.

import matplotlib
%matplotlib inline

data.plot.scatter(x=3,y=4,c=6,colormap='cool')

Output:

insert image

Extract the last column of the dataframe and store it in a pandas Series object named classes. Remove the last column from the dataframe. Count the number of data points that belong to each class. You should also display the size of the remaining dataframe to prove that you have removed the last column.

classes = data[6]
data = data.drop(6,1)
print('Size of data:', data.shape)
print('Class distribution:')

print(classes.value_counts())

Output:

Size of data: (200, 6)
Class distribution:
0    195
1      5
Name: 6, dtype: int64

Anomaly Detection using Mahalanobis Distance

Anomaly Detection using Isolation Forest

  • Apply the isolaion forest method to detect outliers!!
from sklearn.ensemble import IsolationForest

clf = IsolationForest(n_estimators=200, max_samples=50, contamination=0.025, random_state=1)

clf.fit(data.values)
result = clf.predict(data.values)
result[result == 1] = 0
result[result == -1] = 1
result = pd.DataFrame(result, columns =['predicted'])
result

Output:

  Predicted
0 0
1 0
2 0
3 0
4 0
195 0
196 1
197 1
198 1
199 1

200 rows x 1 columns

Now lets check the accuracy as well as the confusion matrix of the prediction results compared to the real class of the data points

print('Accuracy =', accuracy_score(classes,result) )
cm = confusion_matrix(classes,result)
pd.DataFrame(cm)

Output:

Accuracy = 0.99

  0 1
0 194 1
1 1 4

Isolation Forest has a higher Accuracy!!!!!