Anomaly detection with SQL on Db2
Anomaly detection is perhaps one of the hottest sub-topics in Analytics and Data Science. There are many ways to begin with these fields of study, and one of the first steps that both disciplines have in common is to know your data, and we can’t stress it enough. Nowadays we hear about anomaly detection in Machine Learning, Deep Learning, monitoring systems, and all kinds of data analysis, and while we usually tend to relate this topic to Machine Learning and advanced applications, we have several ways to tackle into this topic.
So what exactly is anomaly detection?
To put it simply, it is the process of detecting outliers in your data. Thus, this can be achieved through simple approaches like common sense using business knowledge or with statistical tools.
What we need for this case
We don’t really need to be statisticians. On this article we’ll explore one of the ways to do anomaly detection through the use of pure SQL in Db2 by the use of a built in function that was made available in Db2 12 for z/OS and Db2 LUW 11.1.
The basics of anomaly detection
Before we go into the SQL section, I’m going to give you a brief introduction into one of the techniques for outlier detection.
Given the broad range of applications, this article/tutorial can be followed with any kind of continuous data (or in this case, it will also work for discrete distributions, however, the values must be numerical and never categorical). On this article I’ll be using the CPU values from another environment, which I previously loaded into a Db2 table. For you, this could be integer data in one of your existing tables, or any data you wish to report on: CPU data, time spent for an application, number of connections…. This is why it is a hot topic.
Introducing box plots
Box plots are a very useful visualization for Descriptive statistics, and they are designed to display data distribution over a set of measurements. To better understand my data, I’ll show you the box plot for this.
This chart shows a lot of information, which is hard to explain in meetings, but very useful to have a quick look and also very useful if you group several box charts.
To generate the plot, I used Python and the open source visualization library plotly (and I recommend you to try it since everything it generates is interactive).
As you can see from the tags and components, there are 3 key terms that we’ll play with using SQL: quartiles, percentiles and IQR. But first a short explanation:
Quartile (from Wikipedia): A type of quantile which divides the number of data points into four more or less equal parts.
Percentile (Adapted from Wikipedia): A X percentile is the value below which X percentage of the observations may be found. In my data, this means that 25% percentile of the values are below 153.5 and thus, 25% of the observations are below this number.
1 quartile (or Q1) = 25% percentile
3 quartile (or Q3) = 75% percentile
IQR: The height of the box is what is called “interquartile range”. It can be calculated as the difference between the 75% percentile and the 25% percentile (or Q3 – Q1).
Use domain/business knowledge
Before we go further it’s important to note that while there are many ways to do anomaly detection, there’s no single standard to define what an anomaly should be and the decision depends on domain knowledge and the goal or application of the analysis.
Anomaly detection using IQR
Using the IQR is a dispersion based method. This means that it will detect outliers far off the IQR section. The formula to detect an outlier (anomaly) is quite simple:
- If an observation is higher than the 75% percentile by a measure of 1.5 times the IQR, it is considered as a positive outlier.
- Positive outlier = Q3 + ( 1.5 * (Q3 – Q1) )
- If an observation is lower than the 25% percentile by a measure of 1.5 times the IQR, it is considered as a negative outlier.
- Negative outlier = Q1 – ( 1.5 * (Q3 – Q1) )
As mentioned before, the decision depends on business logic and domain knowledge. It’s also possible that the decisive line for a negative outlier is a negative number, which of course is impossible in my data. Therefore, by using domain knowledge (this is, the number that would alert me or the operators), we can adjust the number of the IQR deviations to use. It can be the true definition of 1.5 times the IQR or even a fraction of the IQR. Even some Python packages calculate outliers by default and they can appear in your chart as separate data points.
Anomaly detection by IQR using SQL
Now that we’ve covered the background, it’s time to introduce two new built in functions in Db2: PERCENTILE_CONT and PERCENTILE_DISC. Both functions will return the value corresponding to the Xth percentile from a numeric type column with a key difference:
- PERCENTILE_CONT will treat the values as a continuous distribution and the resulting number may or may not be an actual value found in the observations.
- PERCENTILE_DISC will treat the values as a discrete distribution and the resulting number will always be an actual value found in the observations.
If, among the values processed, Db2 found at least one NULL value, you will obtain a warning (Data Science principle, deal with missing values through imputation or by removing them).
Side note: What’s imputation?
In Data Science, we’re first required to do a lot of Feature Engineering work before the actual analysis. Dealing with missing values is one of the first tasks performed and imputation is a popular form of treating the data so that we won’t have to delete those occurrences.
By imputing data, we replace the NULL values by using one of the following options:
An easy example, which we will use later, is to calculate both the Q1 and Q3 values from our values:
select PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY <column>) as percentile_25,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY <column>) as percentile_75
Q1 will take the argument 0.25 while Q3 will take the argument 0.75.
Extra note: If you execute using Data Studio, this will highlight the preceding statement as a syntax error, but you can safely ignore it.
SQL play for anomaly detection
The logic to follow will be very simple:
- Calculate the Q1 and Q3 values for our column of interest. Save the results as we will use them later.
- Calculate the lower and upper margins following the previously described formula. Using domain knowledge, you can either take the standard margin of the corresponding quartile and 1.5 times the IQR or adjust the fraction of IQR to use.
There are many ways to solve this, but for the sake of simplicity I will use a Common Table Expression to show you the concept.
We already got the percentile values from the previous example. Now we just need to save it in a CTE, and in the subsequent SELECT we will calculate and evaluate every processed value.
The complete statement is shown here:
with iqr as ( -- This is a CTE
(percentile_75 - percentile_25) as iqr
from( -- section that really gets the percentiles
select PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY CPU) as percentile_25,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY CPU) as percentile_75
) -- end of CTE definition
Now we need to select from the newly created CTE
We are using a 1.5 IQR threshold initially, but
this MUST be adjusted according to the business rule.
select CPU, -- Show the observation
when float(CPU) >= percentile_75 + (iqr * 1.5 ) then 'pos_outlier'
when float(CPU) <= percentile_25 - (iqr * 1.5 ) then 'neg_outlier'
end as prediction
You should be able to label existing rows as inlier (value within normal range), negative outlier and positive outlier, or create your own thresholds for evaluation. After this, you can post-process the data in any way that suits your analysis goals:
- Just display on a query.
- Save and create a visualization. Even a standard pie-chart will do the job.
- Evaluate new values as they come in your application and creating a prediction.
- Modify the existing table by adding an evaluation result for every row. In a very recent article here in IDUG, Brian Laube demonstrated a very creative use of DGTT and MERGE that can be used for this idea. The article can be found here.
As I mentioned before, you must adjust using domain knowledge. As I tested with other data sets (here I mean “the data” in a generic way, not to be confused with z/OS data sets) I saw that the lower margin for negative outliers was a negative number, which is nonsense in many cases. Here’s a quick comparison using three different thresholds:
Note: The total number of observations is 467 for my data set.
IQR fraction for evaluation
Number of inliers
Num. Negative outliers
Num. Positive outliers
1.5 times (standard definition). Example:
Negative outlier: Q1 – ( 1.5 * (IQR) )
Just 1 IQR. Example:
Negative outlier: Q1 - IQR
Half an IQR. Example:
Negative outlier: Q1 – (0.5 * IQR)
Quick evaluation for this approach
- Easy to apply and understand.
- Easy to adjust.
- We only needed SQL.
- This analysis can only detect outliers in either extreme, no values in between.
- Real time prediction can also be very expensive. This approach using a CTE will go over the full data set every time it’s executed. Not really suitable for real time evaluation on a transactional application.
Doing Analytics doesn’t require Python programming, and many of the tools we need are already within our grasp with simple SQL. Db2 in both LUW and z/OS platforms deliver new functionality that at first is not so easy to see in your applications or in your everyday reporting, but sometimes we need just a small example or a small idea to get our hands on this and explore new possibilities. Today we learned that anomaly detection is not really a Machine Learning problem, but if you’re still interested and you want to see it as such, in a future article, I will go over this same topic using Machine Learning with Isolation Forests for quick analysis and visualizations.
Gorrie, C. (2016, March 29). Three ways to detect outliers. Retrieved from http://colingorrie.github.io/outlier-detection.html.
Interquartile range. (2019, November 21). Retrieved from https://en.wikipedia.org/wiki/Interquartile_range.
Kirkman, T. W. (1996). Box Plot: Display of Distribution. Retrieved from http://www.physics.csbsju.edu/stats/box2.html.
Db2 LUW related documentation
PERCENTILE_DISC. (n.d.). Retrieved from https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061838.html.
PERCENTILE_CONT. (n.d.). Retrieved from https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061836.html.
Db2 for z/OS related documentation
PERCENTILE_DISC. (n.d.). Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPEK_12.0.0/sqlref/src/tpc/db2z_bif_percentiledisc.html.
PERCENTILE_CONT. (n.d.). Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPEK_12.0.0/sqlref/src/tpc/db2z_bif_percentilecont.html.
Laube, B. (n.d.). Db2 SQL for the DBA. Retrieved from https://www.idug.org/p/bl/ar/blogaid=943.