Anomaly Detection
What is Anomaly Detection?
- Finding a subset of instances whose characteristics are different than the remainder of the data
- Think Outliers in Data!!
Example Applications
- Detecting fraud purchases on credit cards
- Detecting deforestation using remote sensing data
- Smart Homes/Building
- Water or electrical theft detection
- Pip burst detection
Challenges in Anomaly Detection
- Like finding a needle in a haystack
- very rare
- Number of anomalies are usually unknown
- Method is unsupervised
- therefore validation is hard
Output of Anomaly Detection
Continuous-Valued Output
- In which every data instance is assigned an anomaly score
Binary-Valued Output
- A threshold is needed to convert the anomaly score into a binary label
How do we Approach/Strategize for Anomaly Detection
-
Assuming that there are more “normal” data than “anomalous” instances
-
The general approach is to build a profile of the normal behavior and then use the normal profile to flag anomalies
Z-Score Approach
- Think back to statistics and gaussian normal distribution
Distance-Based Approach
-
In this method, when we do is take in the data as data points and k which is the number of nearest neighbors
-
Then the approach is to compute the distance between every pair of data points and a anomaly score of a data point is given by its distance to the k-th nearest neighnor (the larger the distance, the more anomalous the data point is!)
Python Example of Distance-Based Approach
- Compute the distance between every pair of data points
- Anomaly score of a data point is given by its distance to the k-th nearest neighbor
- The larger distance, the more anomalous is the data point
import pandas as pd
data = pd.read_csv('Synthetic_control_sample.csv',header=None)
data.head()
Output:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | … | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.448130 | 0.30272 | 0.39039 | 1.55520 | 1.36110 | 1.00610 | 0.636930 | -0.51374 | -0.69051 | -1.5134 | … | -0.89196 | -0.73934 | 0.821470 | 1.13340 | 0.52480 | 1.5382 | 1.05790 | 0.10072 | -0.079991 | -0.734820 |
1 | 0.324930 | 0.92102 | 0.67606 | 1.56710 | 0.62195 | 0.23225 | 0.699970 | -0.29080 | -1.06150 | -1.1291 | … | -0.89692 | -1.61170 | 0.049964 | -0.20141 | 0.96575 | 1.5515 | 1.34120 | 0.54099 | 0.488130 | 0.075192 |
2 | 0.065106 | 0.27966 | 1.60660 | 0.90703 | 0.31790 | -0.38201 | -0.071902 | -1.69230 | -1.05300 | -1.0928 | … | -0.85534 | -1.61720 | -0.786690 | -0.44217 | 0.61959 | 1.4380 | 1.21310 | 1.20440 | 0.411550 | -0.733190 |
3 | -0.197290 | 0.86487 | 0.91300 | 1.10690 | 1.13040 | 0.22366 | -0.070158 | -0.91154 | -1.32590 | -1.3727 | … | -1.51920 | -1.82530 | -0.541960 | -0.64238 | 0.20283 | 1.1598 | 1.76730 | 1.27050 | 0.200010 | -0.351930 |
4 | -0.295140 | 0.27611 | 1.36790 | 1.12880 | 0.68236 | 0.27383 | -0.083935 | -0.64006 | -1.38620 | -1.1307 | … | -1.03950 | -1.28350 | -1.317100 | -1.13480 | -0.46492 | 0.5699 | 0.95651 | 0.87453 | 0.811540 | -0.451390 |
Lets graph our data above:
- We can plot it as a time series
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure().add_subplot(2,2,1)
plt.plot(data.iloc[5].T)
plt.figure().add_subplot(2,2,1)
plt.plot(data.iloc[10].T)
plt.figure().add_subplot(2,2,1)
plt.plot(data.iloc[45].T)
plt.figure().add_subplot(2,2,1)
plt.plot(data.iloc[50].T)
Output:
Now lets plot the first 50 time series and last 5 time series
plt.figure().add_subplot(2,1,1)
plt.plot(data.iloc[:49].T)
plt.figure().add_subplot(2,1,2)
plt.plot(data.iloc[50:55].T)
Output:
Now lets use the pairwise distances between obseravations in n-dimensional space
from scipy.spatial import distance
Y = distance.pdist(data.to_numpy(), 'correlation')
Y = distance.squareform(Y)
Y
Output:
array([[ 0. , 0.20237268, 1.16745543, ..., 1.07495438,
1.02072118, 1.11979732],
[ 0.20237268, 0. , 0.9807679 , ..., 1.09614197,
1.01776417, 1.01390679],
[ 1.16745543, 0.9807679 , 0. , ..., 0.94071103,
0.97634081, 0.95014835],
...,
[ 1.07495438, 1.09614197, 0.94071103, ..., 0. ,
0.23309714, 0.29398105],
[ 1.02072118, 1.01776417, 0.97634081, ..., 0.23309714,
0. , 0.19070761],
[ 1.11979732, 1.01390679, 0.95014835, ..., 0.29398105,
0.19070761, 0. ]])
Now lets do a
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
knn = 6
index = np.argsort(Y, axis=1) # sort each row by increasing distance
index = index[:,knn]
knnDist = Y[np.arange(len(index)),index] # identify the k-th smallest distance
plt.hist(knnDist)
Identify the Outliers
outlier = np.flipud(np.argsort(knnDist))
sort_dist = np.flipud(np.sort(knnDist))
p = pd.DataFrame(np.column_stack((outlier,sort_dist)),columns=['index','score'])
p.head()
Output:
index | score | |
---|---|---|
0 | 50.0 | 0.955814 |
1 | 53.0 | 0.945165 |
2 | 54.0 | 0.927089 |
3 | 51.0 | 0.902902 |
4 | 52.0 | 0.888707 |
Distance-based Outlier Detection (using scikit-learn)
from sklearn.neighbors import NearestNeighbors
import numpy as np
from scipy.spatial import distance
knn = 6
nbrs = NearestNeighbors(n_neighbors=knn+1, metric=distance.correlation).fit(data.to_numpy())
distances, indices = nbrs.kneighbors(data.to_numpy())
plt.hist(distances[:,knn])
outlier = np.flipud(np.argsort(distances[:,knn]))
sort_dist = np.flipud(np.sort(distances[:,knn]))
p = pd.DataFrame(np.column_stack((outlier,sort_dist)),columns=['index','score'])
p.head()
Output:
index | score | |
---|---|---|
0 | 50.0 | 0.955814 |
1 | 53.0 | 0.945165 |
2 | 54.0 | 0.927089 |
3 | 51.0 | 0.902902 |
4 | 52.0 | 0.888707 |
Model-Based Approach
-
Fits a model to the data, most models tend to fit the general characteristics of the data
-
Then apply the model to each data instance, the more anomalous the data instance, the easier it is to isolate the instance from the model
Isolation Forest
import numpy as np
X = np.array([0.1,0.8,0.84,0.87,0.89,0.92,0.95])
plt.plot(X,np.ones(len(X)),'ro')
plt.xlim(-0.1,1.1)
plt.ylim([0.95,1.5])
from sklearn.ensemble import IsolationForest
clf = IsolationForest(n_estimators=100, max_samples=30, contamination=0.1)
clf.fit(data.to_numpy())
score = clf.predict(data.to_numpy())
score
Output:
array([ 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1,
1, -1, -1, -1])
Excercises
Applying Anomaly Detection Algorithms to a Dataset
Load the dataset into a pandas DataFrame object
Note: the last column (6) corresponds to the true class of each data point (0: normal, 1: anomaly)
import pandas as pd
data = pd.read_csv('mammography.csv', header = None)
data.head()
Output:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | -0.5378 | -0.3640 | 5.3141 | -0.2190 | 0.2685 | 1.1080 | 0 |
1 | -0.7844 | -0.4702 | -0.5916 | -0.8596 | -0.3779 | -0.9457 | 0 |
2 | -0.7844 | -0.4702 | -0.5916 | -0.8596 | -0.3779 | -0.9457 | 0 |
3 | 0.3804 | -0.0278 | -0.4113 | 0.7261 | 3.5478 | 1.2421 | 0 |
4 | -0.7844 | -0.4702 | -0.5916 | -0.8596 | -0.3779 | -0.9457 | 0 |
Draw a scatter plot based on the 4th and 5th column in the DataFrame.
import matplotlib
%matplotlib inline
data.plot.scatter(x=3,y=4,c=6,colormap='cool')
Output:
Extract the last column of the dataframe and store it in a pandas Series object named classes. Remove the last column from the dataframe. Count the number of data points that belong to each class. You should also display the size of the remaining dataframe to prove that you have removed the last column.
classes = data[6]
data = data.drop(6,1)
print('Size of data:', data.shape)
print('Class distribution:')
print(classes.value_counts())
Output:
Size of data: (200, 6)
Class distribution:
0 195
1 5
Name: 6, dtype: int64
Anomaly Detection using Mahalanobis Distance
Anomaly Detection using Isolation Forest
- Apply the isolaion forest method to detect outliers!!
from sklearn.ensemble import IsolationForest
clf = IsolationForest(n_estimators=200, max_samples=50, contamination=0.025, random_state=1)
clf.fit(data.values)
result = clf.predict(data.values)
result[result == 1] = 0
result[result == -1] = 1
result = pd.DataFrame(result, columns =['predicted'])
result
Output:
Predicted | |
---|---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
… | … |
195 | 0 |
196 | 1 |
197 | 1 |
198 | 1 |
199 | 1 |
200 rows x 1 columns
Now lets check the accuracy as well as the confusion matrix of the prediction results compared to the real class of the data points
print('Accuracy =', accuracy_score(classes,result) )
cm = confusion_matrix(classes,result)
pd.DataFrame(cm)
Output:
Accuracy = 0.99
0 | 1 | |
---|---|---|
0 | 194 | 1 |
1 | 1 | 4 |
Isolation Forest has a higher Accuracy!!!!!