When you’re starting out with Data Science and you’re building your skills, sooner or later you’ll want to play with Anomaly Detection, and the first thing you’ll see when you look it up is the number of ways you can begin exploring the topic. We can name some of the general approaches to Anomaly Detection by category (or complexity):
 Basic exploratory work. Here we see familiar examples like a histogram, or a convenient box plot.
 Statistical tools (also coming from exploratory tools). Interquartile range and zscore are usually known examples.
 Machine learning. You will find both supervised and unsupervised algorithms. A common example is kmeans clustering.
 Deep learning and neural networks. Autoencoders are a typical example of such algorithm.
By now, we are buried in choices and we realize we have a long road ahead of us, and we haven’t even begun talking about what anomalies are, or what kind of anomalies we’re interested in.
As a fellow DBA, I can begin by stating the main problem is having too many choices for learning and our end goal might not be to become a master of all tools, but the ones that I can easily use. I can indeed use Anomaly Detection tools/techniques for Capacity Planning, or general monitoring of my applications, but we need to narrow our scope.
In this article, I’ll guide you through another common approach to Anomaly Detection. The first installment can be found in this article, and it covers the usage of box plots and Interquartile ranges:
https://www.idug.org/p/bl/ar/blogaid=950
On this occasion, we will explore another very common and flexible approach, known as ZScore (and no, it doesn’t stand for Mainframe score). We will be playing with a bit of Python and afterwards, we will use pure SQL for the same analysis.
The data
The data will have to be numeric, and preferably with no missing values. If you do have missing values, there are two quick ways to deal with this:
 Use the average value to use instead.
 Use the previous value. This is more often seen in time series analysis.
It is also important to note that values of zero will affect the outcome and it is not recommended to insert values of zeroes to substitute for missing values.
In my example, I will take the same data from the article mentioned above, so that you can compare the results on your own. It represents a listing of processor usage from another environment, which I loaded into a Db2 table. It contains 467 observations in total, and during this article every time I display data I will be printing the header only (10 records max).
Introducing ZScore
Also known as “standard score”, and it’s a measure to describe any observation (any data point) according to its relationship with the mean and standard deviation from the group of observations. It is quite a simple calculation that we can run from anywhere:
Zscore = (x – mean) / std. deviation
The list of steps is as follows:
 Calculate the ZScore for every observation/data point.
 Establish upper and lower thresholds. More on this latter.
 Evaluate according to the thresholds:
 Values with a ZScore lesser than the lower threshold will be considered as negative outliers.
 Values with a ZScore greater than the upper threshold will be considered as positive outliers.
Therefore, we have noticed a few points already:
 We will have to evaluate all points one by one.
 Thresholds will have to be meaningful for us, otherwise the evaluation will not represent any value to us.
 We will end with 3 categories of observations: Normal, negative outlier and positive outlier.
Working the example
First, I will show how you can quickly play with ZScore in Python. I prefer to use the Anaconda distribution since it contains many libraries that you’ll need for further play. In this case, we require the following 3 libraries:
 We will just import the data to a DataFrame.
 SciPy library. It will provide a function for ZScore in only one line.
 (Optional) Scikitlearn. Will provide scaling functions, and you can further study with Machine Learning using this library.
The flow is very simple, but we will do it step by step here:
Collect the data to a Pandas DataFrame
# read the data
import pandas as pd
df = pd.read_csv('VMStats.csv') #I’m reading a CSV file here
df.head()
We will obtain the following table, with only 1 column: (showing only the first 5 observations here)

Y 
0 
220 
1 
220 
2 
218 
3 
220 
4 
347 
Now, we will calculate the ZScore. Even if it’s a simple formula, the library SciPy gives us a function to apply it over the whole column at once:
# Now we use zscore on each entry
from scipy.stats import zscore
zscore_y = zscore(df['y'])
df['zscore_y'] = zscore_y
df.head(10)
I’ve also attached the ZScore to its corresponding observation.

y 
zscore_y 
0 
220 
0.354180 
1 
220 
0.354180 
2 
218 
0.331935 
3 
220 
0.354180 
4 
347 
1.766734 
5 
289 
1.121631 
6 
217 
0.320813 
7 
223 
0.387548 
8 
316 
1.421938 
9 
329 
1.566530 
The next step will be to evaluate the results before we establish our thresholds. For my particular example, the lowest ZScore was 1.87 and the highest ZScore was 1.84.
print('The minimum ZScore found was {}'.format(min(df['zscore_y'])))
print('The maximum ZScore found was {}'.format(max(df['zscore_y'])))
The minimum ZScore found was 1.870313626490579 The maximum ZScore found was 1.8445914255634075 
The real magic will come from business understanding, or in this case, from yourself. Setting thresholds depends completely on domain knowledge, so we will need to inspect the data and find meaningful boundaries we can set, and then determine the ZScore that we have at those points:
This is the example of how you can print all the results back to a CSV file (original values and their ZScore).
df.to_csv('Stats_results.csv')
And then you will need to inspect the data to find meaningful thresholds and locate patterns (if any).
For my example, based on the knowledge of the environment I obtained my data from, I observe these thresholds, and I will show the ZScore for them:
(I'm displaying both the range of values that mark a "line" from normal to abnormal, and I do this for the lower margin and upper margin)
Value 
ZScore 
44 
1.60337 
45 
1.59225 
47 
1.57001 
48 
1.55888 
49 
1.54776 
50 
1.53664 
306 
1.310713 
316 
1.421938 
324 
1.510917 
329 
1.56653 
332 
1.599897 
342 
1.711122 
You might have guessed already:
 For lower anomalies, anything equal or lesser than 50 will be considered anomalous, with a ZScore that will be set to 1.5
 For upper anomalies, anything equal or greater than 300 will be considered anomalous, with a ZScore that will be set to 1.5
Now I’ll try it out by assigning labels and counting how many observations we have for each label:
df['eval'] = 'normal value'
df.loc[ df['zscore_y'] <= 1.5, 'eval'] = 'negative_anomaly'
df.loc[ df['zscore_y'] >= 1.5, 'eval'] = 'positive_anomaly'
df.groupby(['eval']).count()
eval 
count 

negative_anomaly 
92 

normal value 
359 

positive_anomaly 
16 
(And that gives me the total of my 467 observations)
It is important to note that anomalies will have to be meaningful to the application and the people looking at the numbers, therefore, setting a proper threshold is the most crucial step and should be reviewed to determine if the evaluation returns proper groups or if you need to readjust the thresholds.
ZScore in SQL
If you have made it this far, you’re ready to do this same evaluation in SQL. The previous analysis was necessary to know our data and establish meaningful thresholds. Once we have done this homework, as you have guessed, it’s a matter of a simple calculation and a CASE statement:
with zscore_params as (
select mean, std
from(
select avg(cpu) as mean,stddev(cpu) as std
from <schema>.tbanx2  I'm just saving the mean and std
)
)
select cpu, case
when ( (CPU  mean) / std ) > 1.5 then 'pos_outlier' we use the thresholds found earlier in Python
when ( (CPU  mean) / std ) < 1.5 then 'neg_outlier'
else 'inlier'
end as zscore_pred  I use the same thresholds from Python
from <schema>.tbanx2, zscore_params;
Of course, there are many ways to perform this same evaluation:
 Using this same CTE.
 Using a Stored Procedure.
 Using an external User Defined Function.
Closing thoughts
From here, you can expand this simple approach and explore new worlds, such as:
 Doing multi variate Anomaly Detection.
 Anomaly Detection for streaming data.
 Anomaly Detection in Time Series Analysis.
There are numerous ways to do Anomaly Detection and it can even be considered as its own branch of study, but as you have seen, many statistical tools rely on simple calculations that you can execute anywhere, and some minor knowledge of other tools, such as Python or other language, can help you improve in data analysis skills.