Time series analysis for people in a rush, survival analysis

By Javier Estrada Benavides posted Aug 10, 2021 08:47 AM

  

Time series analysis for people in a rush, survival analysis

We’ve seen it in many situations and not just as DBAs. For many cases where we control and constantly watch metrics, whether it’s Db2 monitoring, application tuning, or even non-technical everyday events, there’s a “magic metric” that, if it reaches a certain number, it’s an indicator or a trigger for undesired events.

  • When the number of GETPAGE requests of a Buffer Pool is greater than X, you need to take an action.
  • When the total number of connections in DDF approaches 90% of the CONDBAT ZPARM, you need to be alerted.
  • We also have several indicators in our RTS tables to tell us when to run REORG or RUNSTATS.
  • When a number of connections stay in LOCKWAIT status for at least X amount of time, you’ll get a message.

How about some other scenarios?

  • The number of days passed since you opened a box of milk and it’s in the fridge.
  • The number of times you’ve been shaving your beard with the same blades.
  • And the list goes on…

Yes, it’s all Math, and to continue with our series of Time Series Analysis for people in a rush (that is, for people like us who don’t want to pick up a statistics course), we’ll have a look at one of the most used techniques and use it for our daily monitoring.

The first part can be found here:

https://www.idug.org/blogs/javier-estrada-benavides1/2021/01/26/time-series-analysis

Bear with me, as the nature of this concept is negative. We’ll talk about Survival Analysis and how convenient it is as a “quick use alarm”. It is a convenient statistical tool and it started in the Medical field to track (you guessed it right), the survival rate of a population given a certain diagnosis and history of other studies. This same principle can also apply to measure the expected use of machine parts, or the time it will take for a new customer to churn.

The principle

We will simply calculate the probability that a given event has not occurred at time t. What does it mean? How can I see it in better terms?

  • If I have a time sensitive data set, and..
  • I also have measurements at certain times that an event happened or not in individual observations, then…
  • I can have a general sense of “how likely an event X will happen given after T time”

Question: For how long can I use my new laptop until it begins to malfunction?

Example result: Your new laptop has 40% of chances that it will malfunction after 4 years of use.

 

How does it apply to me as a DBA?

We can use a bit of creativity and turn time into a metric, and track the event. Here’s one quick example:

Time = Metric we’re watching

Event = An action you need to take based on the metric you’re watching

Example: How likely it is that I need to run a COPY when my number of changed paged increases?

You can run this same example for your frequency in RUNSTATS/REORGs, resize of Buffer Pools, detecting the need of a new index even before you run an EXPLAIN, etc.

Now let’s have a more specific approach: In the previous article, I used the example of the number of DDF connections of a single application to tell its new trend. I can also say that if I see more than 400 connections, I might start to see slow performance. Now let’s see it in parts…

The metric: Number of connections to DDF

Event: Degradation in app performance

Question: How many connections can I safely see at any given time without compromising performance?

Example result: With no DBA action, X number of connections gives 80% of chances of stable performance.



Data required

You will need a collection of observations, and a simple indicator to tell if the event happened or not. Normally you would obtain this data set by doing stress tests of your application, or simple monitoring (and a bit of manual effort) will do the trick. Here is a sample of my data:

Ds

y

alert

3/4/2021

535

1

3/5/2021

350

0

3/6/2021

77

0

3/7/2021

216

0

3/8/2021

521

1

3/9/2021

65

0

3/10/2021

50

0

3/11/2021

475

1

3/12/2021

317

1

3/13/2021

237

0

3/14/2021

378

0

3/15/2021

432

1

Column info:

DS – The date where I obtained the sample: I always took the peak times for each day.

Y – The number of connections from the application.

Alert – A “1” indicates slow performance was observed, “0” indicates no report occurred.

Given this data (I obtained over 150 measurements), I will try to get a general chart to measure my “Survival rate”.

 

The code

Once again we will use Python and a couple of libraries for this case. The requirements are as follows:

  • Python version: I will be using 3.6 in this case. I suggest the Anaconda distribution since it already contains the next statistical library and other visualization libraries.
  • Python library SciPy. There are many libraries that support this tool, and we’ll be using one of the simplest ones, called lifelines. To install it you need to go to your command terminal and enter the following:
pip install lifelines

Or if you’re using Anaconda…

conda install –c conda-forge lifelines

  • A data visualization library. I’m using the Python Plotly library for this case. You can install it with the following command:
pip install plotly

 

The implementation is quite straightforward. We will generate a chart that will show the Survival Function (this is, the chances that the event will not happen given the metric used)

First we read the dataset:

from lifelines import KaplanMeierFitter

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

df = pd.read_csv('App1_2.csv')

df.head()

Now we will fit and plot the data. By invoking the fit method we’re already processing the data set and doing the calculations to obtain the Survival rate.

#create kmf object

kmf = KaplanMeierFitter()

 

#Fit the data

kmf.fit(df['y'], df['alert'], label='App performance confidence')

# Now we just simply plot the data

kmf.plot(ci_show=False)

The result is the following chart:

How to read this chart

The X axis shows the number of connections, while the Y axis shows the probability of “Survival” (or Survival Rate), this is: The probability that the event will NOT occur. In our case, it can be treated as “the confidence rate that we will not see performance degradation”. We can see a few examples:

  • Between 0 and 100 connections, I have 100% confidence in the stability of the application.
  • With more than 350 connections, the performance of this application will start to decay.
  • If I observe more than 500 connections from this application, I can expect performance issues.

Now, based on the previous chart and statements, here are some questions for you to think about:

  • How would you set an alert for this?
  • When would you start taking an action as a DBA?
    • And when is it too late to take an action?

How to obtain an interactive chart (that you can export since it’s HTML)

If you need to insert this simple chart in a dashboard or email report, an HTML version will be more suited for this. We can use the following code:

import plotly

import plotly.graph_objs as go

plotly.offline.init_notebook_mode(connected=False)

 

x_range = np.arange(150,650)

survival_prob = kmf.survival_function_at_times(times=np.arange(150,650))

 

chart = go.Scatter(x = x_range,y = survival_prob,mode='lines+markers',name='Survival function')

   

data = [chart]

 

layout = go.Layout(title = 'App Survival rate by number of connections',autosize = False,width = 600,

                       height = 600,xaxis = dict(title = 'Number of connections to DDF'),

                       yaxis = dict(title = 'Chances of survival'))

fig = go.Figure(data = data, layout = layout)

 

plotly.offline.iplot(fig)

The result can be exported to any HTML dashboard or email, and when viewed from a Python Jupyter terminal, it will allow for user interaction.



As usual, it requires previous homework to obtain a number of observations and label the data to show when the event occurred or not, but the result is a simple visualization that can be used to tell “when to expect a problem”. This is a simple scenario but it can be quite useful for quick reference by taking your favorite Db2 monitoring metric.

 

Closing thoughts

One of the main problems is not entirely on learning Statistics or become a Data Scientist, but on knowing if there are ready to use tools or ready to export solutions, and Time Series Analysis is a topic that very DBA should learn at some degree given that we have to create performance weekly reports or do Db2 Capacity Planning. Now we have the choice of many open tools that allow us to import our data and create dashboards, but at the same time, having a bit of control over our tools and techniques also allows to create our own customized reports and give us more flexibility not just as DBAs, but as storytellers.



0 comments
839 views

Permalink