Anomaly detection with SQL on Db2

Anomaly detection with SQL on Db2

Anomaly detection is perhaps one of the hottest sub-topics in Analytics and Data Science. There are many ways to begin with these fields of study, and one of the first steps that both disciplines have in common is to know your data, and we can’t stress it enough. Nowadays we hear about anomaly detection in Machine Learning, Deep Learning, monitoring systems, and all kinds of data analysis, and while we usually tend to relate this topic to Machine Learning and advanced applications, we have several ways to tackle into this topic.

So what exactly is anomaly detection?

To put it simply, it is the process of detecting outliers in your data. Thus, this can be achieved through simple approaches like common sense using business knowledge or with statistical tools.

What we need for this case

We don’t really need to be statisticians. On this article we’ll explore one of the ways to do anomaly detection through the use of pure SQL in Db2 by the use of a built in function that was made available in Db2 12 for z/OS and Db2 LUW 11.1.

The basics of anomaly detection

Before we go into the SQL section, I’m going to give you a brief introduction into one of the techniques for outlier detection.

The data

Given the broad range of applications, this article/tutorial can be followed with any kind of continuous data (or in this case, it will also work for discrete distributions, however, the values must be numerical and never categorical). On this article I’ll be using the CPU values from another environment, which I previously loaded into a Db2 table. For you, this could be integer data in one of your existing tables, or any data you wish to report on: CPU data, time spent for an application, number of connections…. This is why it is a hot topic.

Introducing box plots

Box plots are a very useful visualization for Descriptive statistics, and they are designed to display data distribution over a set of measurements. To better understand my data, I’ll show you the box plot for this.

 This chart shows a lot of information, which is hard to explain in meetings, but very useful to have a quick look and also very useful if you group several box charts.

Box_plot_v2.png

To generate the plot, I used Python and the open source visualization library plotly (and I recommend you to try it since everything it generates is interactive).

 

As you can see from the tags and components, there are 3 key terms that we’ll play with using SQL: quartiles, percentiles and IQR. But first a short explanation:

Quartile (from Wikipedia): A type of quantile which divides the number of data points into four more or less equal parts.

Percentile (Adapted from Wikipedia): A X percentile is the value below which X percentage of the observations may be found. In my data, this means that 25% percentile of the values are below 153.5 and thus, 25% of the observations are below this number.

Therefore….

1 quartile (or Q1) = 25% percentile

3 quartile (or Q3) = 75% percentile

IQR: The height of the box is what is called “interquartile range”. It can be calculated as the difference between the 75% percentile and the 25% percentile (or Q3 – Q1).

Use domain/business knowledge

Before we go further it’s important to note that while there are many ways to do anomaly detection, there’s no single standard to define what an anomaly should be and the decision depends on domain knowledge and the goal or application of the analysis.

Anomaly detection using IQR

Using the IQR is a dispersion based method. This means that it will detect outliers far off the IQR section. The formula to detect an outlier (anomaly) is quite simple:

  • If an observation is higher than the 75% percentile by a measure of 1.5 times the IQR, it is considered as a positive outlier.
    • Positive outlier = Q3 + ( 1.5 * (Q3 – Q1) )
  • If an observation is lower than the 25% percentile by a measure of 1.5 times the IQR, it is considered as a negative outlier.
    • Negative outlier = Q1 – ( 1.5 * (Q3 – Q1) )

Considerations

As mentioned before, the decision depends on business logic and domain knowledge. It’s also possible that the decisive line for a negative outlier is a negative number, which of course is impossible in my data. Therefore, by using domain knowledge (this is, the number that would alert me or the operators), we can adjust the number of the IQR deviations to use. It can be the true definition of 1.5 times the IQR or even a fraction of the IQR. Even some Python packages calculate outliers by default and they can appear in your chart as separate data points.

 

Anomaly detection by IQR using SQL

Now that we’ve covered the background, it’s time to introduce two new built in functions in Db2: PERCENTILE_CONT and PERCENTILE_DISC. Both functions will return the value corresponding to the Xth percentile from a numeric type column with a key difference:

  • PERCENTILE_CONT will treat the values as a continuous distribution and the resulting number may or may not be an actual value found in the observations.
  • PERCENTILE_DISC will treat the values as a discrete distribution and the resulting number will always be an actual value found in the observations.

If, among the values processed, Db2 found at least one NULL value, you will obtain a warning (Data Science principle, deal with missing values through imputation or by removing them).

Side note: What’s imputation?

In Data Science, we’re first required to do a lot of Feature Engineering work before the actual analysis. Dealing with missing values is one of the first tasks performed and imputation is a popular form of treating the data so that we won’t have to delete those occurrences.

By imputing data, we replace the NULL values by using one of the following options:

  • The value of the global average.
  • The value from the preceding observation.
  • The value from the following observation.
Another meaningful value so that all missing values automatically fall in another category.

 

An easy example, which we will use later, is to calculate both the Q1 and Q3 values from our values:

select PERCENTILE_CONT(0.25)  WITHIN GROUP (ORDER BY <column>) as percentile_25,

          PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY <column>) as percentile_75

            FROM <schema>.<table>

 

Q1 will take the argument 0.25 while Q3 will take the argument 0.75.

Extra note: If you execute using Data Studio, this will highlight the preceding statement as a syntax error, but you can safely ignore it.

SQL play for anomaly detection

The logic to follow will be very simple:

  1. Calculate the Q1 and Q3 values for our column of interest. Save the results as we will use them later.
  2. Calculate the lower and upper margins following the previously described formula. Using domain knowledge, you can either take the standard margin of the corresponding quartile and 1.5 times the IQR or adjust the fraction of IQR to use.

There are many ways to solve this, but for the sake of simplicity I will use a Common Table Expression to show you the concept.

We already got the percentile values from the previous example. Now we just need to save it in a CTE, and in the subsequent SELECT we will calculate and evaluate every processed value.

The complete statement is shown here:

with iqr as ( -- This is a CTE

    select percentile_25,

           percentile_75,

           (percentile_75 - percentile_25) as iqr

 from( -- section that really gets the percentiles

   select PERCENTILE_CONT(0.25)  WITHIN GROUP (ORDER BY CPU) as percentile_25,

          PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY CPU) as percentile_75

            FROM <schema>.<table>

     )

     )  -- end of CTE definition

 

/* 

   Now we need to select from the newly created CTE

   We are using a 1.5 IQR threshold initially, but

   this MUST be adjusted according to the business rule.

*/

select CPU, -- Show the observation

      case

        when float(CPU) >= percentile_75 + (iqr * 1.5 ) then 'pos_outlier'

        when float(CPU) <= percentile_25 - (iqr * 1.5 ) then 'neg_outlier'

        else 'inlier'

      end as prediction

       from <schema>.<table>,iqr

 

You should be able to label existing rows as inlier (value within normal range), negative outlier and positive outlier, or create your own thresholds for evaluation. After this, you can post-process the data in any way that suits your analysis goals:

  • Just display on a query.
  • Save and create a visualization. Even a standard pie-chart will do the job.
  • Evaluate new values as they come in your application and creating a prediction.
  • Modify the existing table by adding an evaluation result for every row. In a very recent article here in IDUG, Brian Laube demonstrated a very creative use of DGTT and MERGE that can be used for this idea. The article can be found here.

As I mentioned before, you must adjust using domain knowledge. As I tested with other data sets (here I mean “the data” in a generic way, not to be confused with z/OS data sets) I saw that the lower margin for negative outliers was a negative number, which is nonsense in many cases. Here’s a quick comparison using three different thresholds:

Note: The total number of observations is 467 for my data set.

IQR fraction for evaluation

Number of inliers

Num. Negative outliers

Num. Positive outliers

1.5 times (standard definition). Example:

Negative outlier: Q1 – ( 1.5 * (IQR) )

403

64

0

Just 1 IQR. Example:

Negative outlier: Q1 - IQR

347

104

16

Half an IQR. Example:

Negative outlier: Q1 – (0.5 * IQR)

322

112

33

 

Quick evaluation for this approach

Advantages:

  • Easy to apply and understand.
  • Easy to adjust.
  • We only needed SQL.

Disadvantages:

  • This analysis can only detect outliers in either extreme, no values in between.
  • Real time prediction can also be very expensive. This approach using a CTE will go over the full data set every time it’s executed. Not really suitable for real time evaluation on a transactional application.

 

Final thoughts

Doing Analytics doesn’t require Python programming, and many of the tools we need are already within our grasp with simple SQL. Db2 in both LUW and z/OS platforms deliver new functionality that at first is not so easy to see in your applications or in your everyday reporting, but sometimes we need just a small example or a small idea to get our hands on this and explore new possibilities. Today we learned that anomaly detection is not really a Machine Learning problem, but if you’re still interested and you want to see it as such, in a future article, I will go over this same topic using Machine Learning with Isolation Forests for quick analysis and visualizations.

 

References

Gorrie, C. (2016, March 29). Three ways to detect outliers. Retrieved from http://colingorrie.github.io/outlier-detection.html.

Interquartile range. (2019, November 21). Retrieved from https://en.wikipedia.org/wiki/Interquartile_range.

Kirkman, T. W. (1996). Box Plot: Display of Distribution. Retrieved from http://www.physics.csbsju.edu/stats/box2.html.

Db2 LUW related documentation

PERCENTILE_DISC. (n.d.). Retrieved from https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061838.html.

PERCENTILE_CONT. (n.d.). Retrieved from https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061836.html.

Db2 for z/OS related documentation

PERCENTILE_DISC. (n.d.). Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPEK_12.0.0/sqlref/src/tpc/db2z_bif_percentiledisc.html.

PERCENTILE_CONT. (n.d.). Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPEK_12.0.0/sqlref/src/tpc/db2z_bif_percentilecont.html.

Laube, B. (n.d.). Db2 SQL for the DBA. Retrieved from https://www.idug.org/p/bl/ar/blogaid=943.

1 Comment
1 Like

More examples of using these new V12 functions with EMP table

January 10, 2020 09:27 AM by Brian Laube

Great article Javier! 

You explained your perspective of this new V12 function.  I get it.

And then you explained IQR from an analytical perspective.  And I get it…

For those of us who like to do analytics with SQL alone then we can use the concepts you explain to do very advanced analytics.

 

I took your concepts and created two more SQL on the IBM supplied sample EMP table. 

The second takes the pseudo-code SQL you provided above ….  (I use more CTE then you did… but I love CTE nowadays.. personal preference)

 

-- slightly more complex example of using new PERCENTIL_CONT function!

-- from EMP table -> find avg_salary and 75% and 25% percentil salary

--                   for each WORKDEPT

-- I think this is *interesting* -> the result shows each WORKDEPT!

SELECT workdept

, decimal(avg(salary),9,2) as avg_salary

, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY SALARY) as PCT_75_SALARY

, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY SALARY) as PCT_25_SALARY

, decimal(avg(salary),9,2) as avg_salary

, count(*) as cnt

FROM DSN81210.EMP

group by workdept

order by workdept

;

 

------------------------------------------------------------------------

------------------------------------------------------------------------

-- now, a more complex example -- similar to the pseudo sql in blog

--

-- use find EMP outliers based upon salary! 

-- >> who is surprisingly overpaid or underpaid (within a WORKDEPT)

--

-- first, find workdept avg_salary and percentile salaries

with cte_percentiles as 

(SELECT workdept

, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY SALARY) as PCT_75_SALARY

, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY SALARY) as PCT_25_SALARY

, decimal(avg(salary),9,2) as avg_salary

, count(*) as cnt

FROM DSN81210.EMP

group by workdept

)

-- now, find workdept IQR for those percentiles

, cte_iqr as

(select workdept

, PCT_75_SALARY

, PCT_25_SALARY

, (PCT_75_SALARY - PCT_25_SALARY) as iqr

, avg_salary

, cnt

from cte_percentiles

)

-- now, go through EMP table again.  join to CTE_IQR 

-- > identify which EMP rows have SALARY outlier! (based upon IQR)

, cte_emp_predictions as

(select i.workdept, i.avg_salary  

, i.PCT_75_SALARY

, i.PCT_25_SALARY

, i.iqr

, i.cnt as workdept_emp_cnt

, e.fIRSTNME  ,e.LASTNAME   ,e.JOB,      e.EDLEVEL , e.SEX

, e.salary

, case when i.cnt = 1 then '*set of one*' /* not very interesting */

       when e.salary >= (i.PCT_75_SALARY + (i.iqr*1.5)) then 'outlier_pos'

       when e.salary <= (i.PCT_25_SALARY - (i.iqr*1.5)) then 'outlier_neg'

       else 'inlier'  /* not an anomoly -> boring */

  end as wd_prediction      

 

from DSN81210.EMP e

inner join cte_iqr i

   on e.workdept = i.workdept

)

-- now ... at the end of this long SQL with many CTE

-- .. let us look at these EMP in the final CTE-> pay attention to "wd_prediction"

select *

from cte_emp_predictions

--where wd_prediction like 'outlier%'  -- report all .. or whatever

order by workdept, salary desc

;

 

 

 

 

Recent Stories
Cardinality Without Runstats

Jenkins Pipeline H2 Database Integration Testing

December Content Recap