443-970-2353 [email protected] CV Resume

Anomaly detection is used for different applications. It is a commonly used technique for fraud detection. It is also used in manufacturing to detect anomalous systems such as aircraft engines. It can also be used to identify anomalous medical devices and machines in a data center. You can read more about anomaly detection from Wikipedia.

In this blog post, I will implement anomaly detection algorithm and apply it to detect failing servers on a network.
The data is from the famous Machine Learning Coursera Course by Andrew Ng.
The lab exercises in that course are in Octave/Matlab.
I am planning to do all the programming exercises in that course with **R** and I have started with anomaly detection.

First, we will start with 2D data and detect anomalous servers based on two features. We will plot the 2D data and see the algorithm's performance on a 2D plot. You can download the data for the first part of the exercise in RData format from here.

The features measure the through-put (mb/s) and latency (ms) of response of each server. We have 307 measurements.

We will use a gaussian (normal) model to detect anomalous examples in our dataset. We use multivariate normal distribution to detect servers with very low probabilities and hence can be considered anomalous (outliers). You can read about multivariate normal distribution from Wikipedia.

The probability density function (PDF) of a multivariate normal is $$f(x)=(2\pi)^\frac{-k}{2} |\Sigma|^\frac{-1}{2}e^{-1/2(x-\mu)^{'} \Sigma^{-1}(x-\mu)}$$

$\Sigma $ is variance and $|\Sigma|$ is determinant of the variance. $\mu$ is mean and $k$ is number of columns (variables) of our data.

In [647]:

```
library(magrittr) # to use piping %>%
library(ggplot2) # for ploting
library(MASS) # to calculate the pseudo-inverse of a matrix
library(caret) # to center our data by subtracting its mean
library(reshape2) # for data manipulation
```

In [618]:

```
load("data1.RData")
```

In [619]:

```
#All the variables below are matrices
X=data1$X
Xval=data1$Xval # This is cross-validation data
yval =data1$yval # This shows which rows in Xval are anomalous
```

In [620]:

```
head(X)
```

In [621]:

```
head(Xval)
```

In [622]:

```
table(yval)
```

In [623]:

```
X=as.data.frame(X)
names(X)=c("Latency (ms)", "Throughput (mb/s)")
XX=melt(X)
```

In [624]:

```
XX%>%ggplot(aes(x=value,fill=variable, color=variable))+
geom_density(alpha = 0.3)+ggtitle('Distibution of X')
```

As we can see, normal distribution is a good approximation for our dataset and hence we do not need to perform any transformations.

Next, let's see the distribution of the cross-validation data (Xval).

In [625]:

```
Xval=as.data.frame(Xval)
names(Xval)=c("Latency (ms)", "Throughput (mb/s)")
XXval = melt(Xval)
XXval%>%ggplot(aes(x=value,fill=variable, color=variable))+
geom_density(alpha = 0.3)+ggtitle('Distibution of Xval')
```

Now, let's create a 2D plot and see the distribution of Latency and Throughput.

In [628]:

```
X%>%ggplot(aes(x=`Latency (ms)`,y=`Throughput (mb/s)`))+
geom_point(color='blue')
```

Again, the probability density finction (PDF) is given by $$PDF=(2\pi)^\frac{-k}{2} |\Sigma|^\frac{-1}{2}e^{-1/2(x-\mu)^{'} \Sigma^{-1}(x-\mu)}$$

$\Sigma $ is variance and $|\Sigma|$ is determinant of the variance.

`caret`

package to subtract the mean from the data points. This is called centering.

In [630]:

```
# Create preProcess object
preObj <- preProcess(X,method="center")
# Center the data- subtract the column means from the data points
X2 <- predict(preObj,X)
```

Now,let's calculate variance. We will make it a diagonal matrix.

In [631]:

```
X2= as.matrix(X2)
sigma2=diag(var(X2))
sigma2
```

In [632]:

```
sigma2=diag(sigma2)
sigma2
```

In [633]:

```
A=(2*pi)^(-ncol(X2)/2)*det(sigma2)^(-0.5)
B = exp(-0.5 *rowSums((X2%*%ginv(sigma2))*X2))
p=A*B
```

In [687]:

```
p= p%>%as.data.frame()
names(p)= c('probability')
p%>%ggplot(aes(probability))+geom_density(fill='skyblue')+
ggtitle('Distibution of calculated probabilities')
```

From the density plot above, we see there are some computer servers with very low probabilities which could be outliers.

We can also plot a box plot of the probabilities. Here, I am also showing the points and this helps us to see the probabilities of each server.

In [635]:

```
p%>%ggplot(aes(y=probability,x=1))+geom_boxplot()+
geom_jitter()+xlab('')+ggtitle("Box plot of calculate probabilities")
```

In [637]:

```
X= cbind(X,p)
ggplot(X, aes(x=`Latency (ms)`, y=`Throughput (mb/s)`, z=`probability`))+
geom_point()+ stat_density2d(color='red')
```

`data1.RData`

, **Xval** is the validation data and **yval** is the actual observation of the validation data (whether the servers were anomalous or not). We will use **F1 score** measure to get the threshould probability and also the accuracy of our model.

First, let's remove mean of each column from Xval.

In [638]:

```
# Create preProcess object
Xval = as.data.frame(Xval)
preObj <- preProcess(Xval,method="center")
# Center the data- subtract the column means from the data points
Xval_centered <- predict(preObj,Xval)
```

Then, calculate its variance.

In [639]:

```
Xval_centered = as.matrix(Xval_centered)
sigma2=diag(var(Xval_centered))
sigma2 = diag(sigma2)
sigma2
```

Now, let's calculate **pval** of the cross-validation data **Xval**.

In [640]:

```
A=(2*pi)^(-ncol(Xval_centered)/2)*det(sigma2)^(-0.5)
B = exp(-0.5 *rowSums((Xval_centered%*%ginv(sigma2))*Xval_centered))
pval = A*B
```

`F1 score`

, which uses `precision`

and `recall`

. `Precision`

and `rcall`

in turn are calculated using `true positive`

, `false positive`

and `false negative`

as defined below.

**tp** is the number of true positives: the ground truth label says it's an
anomaly and our algorithm correctly classied it as an anomaly.

Âˆ **fp** is the number of false positives: the ground truth label says it's not
an anomaly, but our algorithm incorrectly classied it as an anomaly.

Âˆ **fn** is the number of false negatives: the ground truth label says it's an
anomaly, but our algorithm incorrectly classied it as not being anoma-
lous.

The F1 score is computed using precision (prec) and recall (rec):

$$prec = \frac{tp}{tp+fp}$$$$rec = \frac{tp}{tp+fn}$$

$$F1 = 2.\frac{prec.rec}{prec+rec}$$

`F1 score`

through cross-validation.

In [641]:

```
bestEpsilon = 0
bestF1 = 0
F1 = 0
stepsize = (max(pval) - min(pval)) / 1000
for (epsilon in seq(from =min(pval), by= stepsize,to =max(pval))){
predictions = (pval < epsilon)*1
tp = sum((predictions == 1) & (yval == 1))
fp = sum((predictions == 1) & (yval == 0))
fn = sum((predictions == 0) & (yval == 1))
prec = tp / (tp + fp)
rec = tp / (tp + fn)
F1 = (2 * prec * rec) / (prec + rec)
if (!is.na(F1)& (F1 > bestF1)==TRUE){
bestF1 = F1
bestEpsilon = epsilon
}
}
cat("\n bestF1 =",round(bestF1,4))
cat("\n bestEpsilon =",round(bestEpsilon,4))
```

`best F1 score`

from the cross-validation is 0.875.
We can use the `bestEpsilon`

above as cut-off to classify the servers as anomalous or normal.

In [642]:

```
X$outliers= X$probability < bestEpsilon
X$outliers=as.factor(X$outliers)
head(X)
```

Let's see how many outliers we have according to our model.

In [643]:

```
table(X$outliers)
```

We see that we have six outliers and 301 normal servers.

Let's visualize which servers are outliers.

In [646]:

```
X%>%ggplot(aes(x=`Latency (ms)`,y=`Throughput (mb/s)`))+
geom_point(aes(color=outliers))+ggtitle('Anomaly Detection')
```

In [664]:

```
load("data2.RData")
```

In [665]:

```
X=data2$X # has 11 columns
Xval=data2$Xval # This is cross-validation data
yval =data2$yval # This shows which rows in Xval are anomalous
```

First, let's explore the distribution of the variables.

In [666]:

```
X%>%as.data.frame()%>%melt()%>%
ggplot(aes(x=value,color=variable))+geom_density()
```

In [667]:

```
cat("Number of variables and observations of X")
dim(X)
```

In [668]:

```
cat("Number of variables and observations of cross-validation data")
dim(Xval)
```

In [669]:

```
cat("Labels of cross-validation data")
cat("\ny=0 corresponds to normal servers and y=1 corresponds to anomalous servers")
dim(yval)
```

The Xval measures are also nearly normally distributed as shown below.

In [670]:

```
Xval%>%as.data.frame()%>%melt()%>%
ggplot(aes(x=value,color=variable))+geom_density()
```

Now, following the steps we used above, let's calculate probabilities for X and Xval.

In [671]:

```
# Create preProcess object
X = as.data.frame(X)
preObj <- preProcess(X,method="center")
# Center the data- subtract the column means from the data points
X2 <- predict(preObj,X)
```

In [672]:

```
sigma2=cov(X2)
sigma2=diag(sigma2)
sigma2=diag(sigma2)
```

Calculate probability for X:

In [515]:

```
X2= as.matrix(X2)
A=(2*pi)^(-ncol(X2)/2)*det(sigma2)^(-0.5)
B = exp(-0.5 *rowSums((X2%*%ginv(sigma2))*X2))
p = A*B
```

Next, let's calculate probabilties for the cross-validation data (Xval).

In [496]:

```
# Create preProcess object
Xval = as.data.frame(Xval)
preObj <- preProcess(Xval,method="center")
# Center the data- subtract the column means from the data points
Xval_centered <- predict(preObj,Xval)
```

In [499]:

```
Xval_centered = as.matrix(Xval_centered)
sigma2=diag(var(Xval_centered))
sigma2= diag(sigma2)
```

In [500]:

```
A=(2*pi)^(-ncol(Xval_centered)/2)*det(sigma2)^(-0.5)
B = exp(-0.5 *rowSums((Xval_centered%*%ginv(sigma2))*Xval_centered))
pval = A*B
```

Now, we can determine the best probability threshold and also the associated `F1 score`

.

In [502]:

```
bestEpsilon = 0
bestF1 = 0
F1 = 0
stepsize = (max(pval) - min(pval)) / 1000
for (epsilon in seq(from =min(pval), by= stepsize,to =max(pval))){
predictions = (pval < epsilon)*1
tp = sum((predictions == 1) & (yval == 1))
fp = sum((predictions == 1) & (yval == 0))
fn = sum((predictions == 0) & (yval == 1))
prec = tp / (tp + fp)
rec = tp / (tp + fn)
F1 = (2 * prec * rec) / (prec + rec)
if (!is.na(F1)& (F1 > bestF1)==TRUE){
bestF1 = F1
bestEpsilon = epsilon
}
}
cat("\n bestF1 =",bestF1)
cat("\n bestEpsilon =",bestEpsilon)
```

Finally, let's calculate the number of anomalous and normal servers.'

In [518]:

```
probability=p
X=cbind(X,probability)
X$outliers= X$probability < bestEpsilon
X$outliers=as.factor(X$outliers)
```

In [521]:

```
table(X$outliers)
```

**R** version of a machine Learning programming assignment with **Matlab** on
Coursera offered by Andrew Ng.