Anomaly Detection in Python and SQL (part 2)

When you’re starting out with Data Science and you’re building your skills, sooner or later you’ll want to play with Anomaly Detection, and the first thing you’ll see when you look it up is the number of ways you can begin exploring the topic. We can name some of the general approaches to Anomaly Detection by category (or complexity):

  • Basic exploratory work. Here we see familiar examples like a histogram, or a convenient box plot.
  • Statistical tools (also coming from exploratory tools). Interquartile range and z-score are usually known examples.
  • Machine learning. You will find both supervised and unsupervised algorithms. A common example is k-means clustering.
  • Deep learning and neural networks. Autoencoders are a typical example of such algorithm.

 

By now, we are buried in choices and we realize we have a long road ahead of us, and we haven’t even begun talking about what anomalies are, or what kind of anomalies we’re interested in.

As a fellow DBA, I can begin by stating the main problem is having too many choices for learning and our end goal might not be to become a master of all tools, but the ones that I can easily use. I can indeed use Anomaly Detection tools/techniques for Capacity Planning, or general monitoring of my applications, but we need to narrow our scope.

In this article, I’ll guide you through another common approach to Anomaly Detection. The first installment can be found in this article, and it covers the usage of box plots and Interquartile ranges:

https://www.idug.org/p/bl/ar/blogaid=950

On this occasion, we will explore another very common and flexible approach, known as Z-Score (and no, it doesn’t stand for Mainframe score). We will be playing with a bit of Python and afterwards, we will use pure SQL for the same analysis.

The data

The data will have to be numeric, and preferably with no missing values. If you do have missing values, there are two quick ways to deal with this:

  • Use the average value to use instead.
  • Use the previous value. This is more often seen in time series analysis.

It is also important to note that values of zero will affect the outcome and it is not recommended to insert values of zeroes to substitute for missing values.

In my example, I will take the same data from the article mentioned above, so that you can compare the results on your own. It represents a listing of processor usage from another environment, which I loaded into a Db2 table. It contains 467 observations in total, and during this article every time I display data I will be printing the header only (10 records max).

 

Introducing Z-Score

Also known as “standard score”, and it’s a measure to describe any observation (any data point) according to its relationship with the mean and standard deviation from the group of observations. It is quite a simple calculation that we can run from anywhere:

Zscore = (x – mean) / std. deviation

The list of steps is as follows:

  • Calculate the Z-Score for every observation/data point.
  • Establish upper and lower thresholds. More on this latter.
  • Evaluate according to the thresholds:
    • Values with a Z-Score lesser than the lower threshold will be considered as negative outliers.
    • Values with a Z-Score greater than the upper threshold will be considered as positive outliers.

Therefore, we have noticed a few points already:

  • We will have to evaluate all points one by one.
  • Thresholds will have to be meaningful for us, otherwise the evaluation will not represent any value to us.
  • We will end with 3 categories of observations: Normal, negative outlier and positive outlier.

 

Working the example

First, I will show how you can quickly play with Z-Score in Python. I prefer to use the Anaconda distribution since it contains many libraries that you’ll need for further play. In this case, we require the following 3 libraries:

  • We will just import the data to a DataFrame.
  • SciPy library. It will provide a function for Z-Score in only one line.
  • (Optional) Scikit-learn. Will provide scaling functions, and you can further study with Machine Learning using this library.

The flow is very simple, but we will do it step by step here:

Collect the data to a Pandas DataFrame

 

# read the data
import pandas as pd
df = pd.read_csv('VMStats.csv') #I’m reading a CSV file here
df.head()

 

We will obtain the following table, with only 1 column: (showing only the first 5 observations here)

 

Y

0

220

1

220

2

218

3

220

4

347

 

Now, we will calculate the Z-Score. Even if it’s a simple formula, the library SciPy gives us a function to apply it over the whole column at once:

# Now we use zscore on each entry
from scipy.stats import zscore
zscore_y = zscore(df['y'])
df['zscore_y'] = zscore_y
df.head(10)

 

I’ve also attached the Z-Score to its corresponding observation.

 

y

zscore_y

0

220

0.354180

1

220

0.354180

2

218

0.331935

3

220

0.354180

4

347

1.766734

5

289

1.121631

6

217

0.320813

7

223

0.387548

8

316

1.421938

9

329

1.566530

 

The next step will be to evaluate the results before we establish our thresholds. For my particular example, the lowest Z-Score was -1.87 and the highest Z-Score was 1.84.

print('The minimum Z-Score found was {}'.format(min(df['zscore_y'])))
print('The maximum Z-Score found was {}'.format(max(df['zscore_y'])))

The minimum Z-Score found was -1.870313626490579

The maximum Z-Score found was 1.8445914255634075

 

The real magic will come from business understanding, or in this case, from yourself. Setting thresholds depends completely on domain knowledge, so we will need to inspect the data and find meaningful boundaries we can set, and then determine the Z-Score that we have at those points:

This is the example of how you can print all the results back to a CSV file (original values and their Z-Score).

df.to_csv('Stats_results.csv')

 And then you will need to inspect the data to find meaningful thresholds and locate patterns (if any).

  For my example, based on the knowledge of the environment I obtained my data from, I observe these thresholds, and I will show the Z-Score for them:

(I'm displaying both the range of values that mark a "line" from normal to abnormal, and I do this for the lower margin and upper margin)

Value

Z-Score

44

-1.60337

45

-1.59225

47

-1.57001

48

-1.55888

49

-1.54776

50

-1.53664

306

1.310713

316

1.421938

324

1.510917

329

1.56653

332

1.599897

342

1.711122

 

You might have guessed already:

  • For lower anomalies, anything equal or lesser than 50 will be considered anomalous, with a Z-Score that will be set to -1.5
  • For upper anomalies, anything equal or greater than 300 will be considered anomalous, with a Z-Score that will be set to 1.5

Now I’ll try it out by assigning labels and counting how many observations we have for each label:

df['eval'] = 'normal value'
df.loc[ df['zscore_y'] <= -1.5, 'eval'] = 'negative_anomaly'
df.loc[ df['zscore_y'] >= 1.5, 'eval'] = 'positive_anomaly'
df.groupby(['eval']).count()

 

eval

count

negative_anomaly

92

normal value

359

positive_anomaly

16

(And that gives me the total of my 467 observations)

 

It is important to note that anomalies will have to be meaningful to the application and the people looking at the numbers, therefore, setting a proper threshold is the most crucial step and should be reviewed to determine if the evaluation returns proper groups or if you need to readjust the thresholds.

 

Z-Score in SQL

If you have made it this far, you’re ready to do this same evaluation in SQL. The previous analysis was necessary to know our data and establish meaningful thresholds. Once we have done this homework, as you have guessed, it’s a matter of a simple calculation and a CASE statement:

 

with zscore_params as (
  select mean, std
 from(
  select avg(cpu) as mean,stddev(cpu) as std
    from <schema>.tbanx2  -- I'm just saving the mean and std
    )
    )
  select cpu, case
  when ( (CPU - mean) / std ) > 1.5 then 'pos_outlier' --we use the thresholds found earlier in Python
  when ( (CPU - mean) / std ) < -1.5 then 'neg_outlier'
  else 'inlier'
  end as zscore_pred -- I use the same thresholds from Python
  from <schema>.tbanx2, zscore_params;

 

Of course, there are many ways to perform this same evaluation:

  • Using this same CTE.
  • Using a Stored Procedure.
  • Using an external User Defined Function.

 

Closing thoughts

From here, you can expand this simple approach and explore new worlds, such as:

  • Doing multi variate Anomaly Detection.
  • Anomaly Detection for streaming data.
  • Anomaly Detection in Time Series Analysis.

There are numerous ways to do Anomaly Detection and it can even be considered as its own branch of study, but as you have seen, many statistical tools rely on simple calculations that you can execute anywhere, and some minor knowledge of other tools, such as Python or other language, can help you improve in data analysis skills.

 

 

 

 

 

 

 

 

 

2 Likes
Recent Stories
Anomaly Detection in Python and SQL (part 2)

Db2 Client Packages

INDEX COMPRESSION in Db2 Z, A recap and overview!