Time Series Analysis and forecast for people in a rush

By Javier Estrada Benavides posted Feb 03, 2021 01:00 AM

  
The interactive version of this article can be viewed here:

https://nbviewer.jupyter.org/github/Salzinni/idug_power_to_your_monitors/blob/master/Time_Series_Analysis_for_people_in_a_rush.ipynb



If you’re a fan of Star Trek you would probably say that space is the final frontier, but today I would argue that it’s actually time, as things we experience are sorted in our memory according to relative time.

The same holds true for our daily work in Db2, especially monitoring, since everything we see is time sensitive: Metrics we track, connections that come and go, utilities running, etc. Of course, the most common phone calls/emails we get are around time (my application is running slower, Db2 is responding slower, and all those comments we’ve all heard a million times already).

The goal here will be simple: We want to have a quick way to observe, understand and maybe forecast time sensitive data so that we, as DBAs, can have a good view before we even get the first performance complaint. In addition to that, we’ll add a couple more goals:

  • For anything we generate, we must have an interactive view.
  • We should also be able to export it to other dashboards or daily monitoring emails.

This can apply to any given number of metric we have so we can use it anywhere: CPU, elapsed time for transactions, number of connections, memory… anything we want to show. Surely, experience as DBAs can give us the “magical metric with the magical number to catch”.

The drawback: Before we start with this story, there’s one drawback I’ll need to mention… Time Series Analysis is quite a big field and there’s a huge amount of books, frameworks, and approaches trying to give you the best techniques for a given goal, and our starting point must be with univariate data. What it means is that we’ll only be able to analyze 1 metric at a time as a separate and independent entity (which is not always true in real life). There are books dedicated solely to the study of multivariate data and as usual, we want to get started playing with our data that we already have in front of us.

Jupyter notebook for this article

  Everything shown in here has been obtained from the following jupyter notebook:

https://github.com/Salzinni/idug_power_to_your_monitors/blob/master/Time_Series_Analysis_for_people_in_a_rush.ipynb

Note: Github doesn’t quite render charts made with D3.js, so I would suggest to open it with Nbviewer instead. Follow this link:

https://nbviewer.jupyter.org/github/Salzinni/idug_power_to_your_monitors/blob/master/Time_Series_Analysis_for_people_in_a_rush.ipynb


The frameworks for this article

We’ll be using Python for these examples, and we will need the following libraries/frameworks:

  • I recommend installing Anaconda since you’ll also get jupyter and pandas.
  • Interactive charts are created with the Plotly library. Here’s their homepage for your reference and API description

You can easily install it by executing the pip command

Pip install plotly


Data for this article

For the following exercises, I’ll be using a collection of numbers that represent the average number of connections of a particular application (taken during the rush hours). Given that we’ll be working with univariate data we will only need two columns:

  • The date (or time)
  • The metric we’ll measure

We’ll be reusing this same data for the next article to “give it a different look with fresh eyes”.

Here’s a sample of the data just to show what it looks like:

#First we read the file, our data came as CSV
df = pd.read_csv('App1_2.csv')
df.head(10)

ds

y

0

3/4/2019

535

1

3/5/2019

350

2

3/6/2019

77

3

3/7/2019

216

4

3/8/2019

521

5

3/9/2019

65

6

3/10/2019

50

The task

Let’s divide this exercise into several steps, and phrase them into usual questions we have when trying to understand time sensitive data:

  • How can I quickly see it in a chart?
  • Can I see the average or trend as well?
  • Is there a way if I can identify patterns? I want to do it with my eyes, instead of learning complicated math
  • How can I export it? Doing it every time in Excel is not the best way
  • Can I get a forecast?

By the end of this article, every question will be answered, and even if it looks like it’s too much, don’t worry, this can all be quickly set up.

Visualizing your data

First things first, let’s get a quick chart of our data. Plotly gives us a quick one-liner for this:

import plotly.graph_objects as go

fig = go.Figure()

 

fig.add_trace(go.Scatter(

                x = df.ds,

                y = df.y,

                name = 'Observations',

                line_color = 'deepskyblue',

                opacity = 0.8))

fig.update_layout(title_text="App 1 connections made to DDF",

                   xaxis_rangeslider_visible=True)

fig.show()


I highly suggest that you check the notebook because all charts made by Plotly are interactive by default, and they contain tooltips and buttons to zoom in, out and export as an image.

Right now we’ve already answered two questions:

  • We can now see the data in just a small section that can be reused for other files and kinds of time series data.
  • Something I haven’t mentioned, is that Plotly generates chars using D3.js, therefore, the output is html content and it can be inserted into any dashboard or set up as automated emails.

Instead of just showing the raw data, let’s add the trend, or moving average in the same visualization.

Side note: Moving average

Moving average is a simple algorithm and it contains the principle for other, much more complex forecasting algorithms: It can be treated as a weighted average, where there’s penalization for older values.

What does that mean and what is the reasoning?

Rule of thumb suggests that what I saw yesterday is most likely what I’ll see today, but what I saw last month is no longer representative, so I need to decide on the number of values that I’ll really use to “represent or model” the current trend. This number of values is known as “the rolling window”.

We’ll try to add a trend with a rolling window of 7 observations. What it means in simple terms is that we’ll plot the weekly trend in another line.

So we’ll substitute the previous code with this block:

import plotly.graph_objects as go

 

#Create another column containing the trend with a window of 7 obs

df['SMA_7'] = df.iloc[:,1].rolling(window=7).mean()

 

fig = go.Figure()

 

#Add the first line, the observations

fig.add_trace(go.Scatter(

                x = df.ds,

                y = df.y,

                name = 'Observations',

                line_color = 'deepskyblue',

                opacity = 0.8))

 

#Add the second line, the trend

fig.add_trace(go.Scatter(

                x = df.ds,

                y = df['SMA_7'],

                name = 'MA 1 week',

                line_color = 'dimgray',

                opacity = 0.8))

 

fig.update_layout(title_text="App 1 connections made to DDF",

                   xaxis_rangeslider_visible=True)

fig.show()


Nothing complicated up to this point. We have met these requirements

  • Simple visualization.
  • Discovering the trend.
  • Generating charts that can be exported to almost anywhere.

 

Let’s try to answer these questions now:

  • Learn a bit more on the data (i.e. Does it come in cycles? How big?)

 

Forecasting Time Series data

The topic of forecasting is quite deep, and you’ll be required to know the nature of your data to select the appropriate algorithm for forecasting since there are a few pitfalls:

  • Seasonality: What I see today is different from what I see during Xmas.
    • If I use my data from Xmas for forecasting, I’ll be assuming very low activity and therefore take a biased analysis.
  • Not all data can actually be forecasted. We’ll skip this analysis for now, but if you’re interested, I suggest looking for the terms “auto-correlation” and “partial auto-correlation”.

What we want is a quick solution that I can use with most of my data, whether it came from Db2 or somewhere else.

Topics for future study

Don’t forget what I described earlier: In the most basic forms, algorithms will resemble a weighted average. The rest of the complexities will compensate for other factors where a simple average is not enough. Here are the names of the topics if you’re interested in how the most popular approaches work:

  • ARIMA: Auto-Regressive Integrated Moving Average. More often than not, ARIMA will give you a reliable and fast approach to forecasting. An alternative tutorial can be found here:

 

https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/

 

  • SARIMA: Seasonal Auto-Regressive Integrated Moving Average. If your data contains cycles or “seasons”, SARIMA is your “best” friend. Here’s a tutorial on how it can be used:

 

https://medium.com/analytics-vidhya/time-series-forecasting-sarima-vs-auto-arima-models-f95e76d71d8f

 

Time Series Forecasting with Facebook Prophet

Regardless of the actual value of this social network, the Facebook team tried to create a user-friendly framework for Time Series forecasting that would require zero knowledge on its underlying algorithm, and they succeeded. We will see how this works and how easily you can use it.

Step #1: Establish floor and cap values

Prophet needs a reference both for the minimum and maximum possible values. As you have already noticed, my application connects to DDF, therefore I can use the ZPARM CONDBAT as my maximum and the minimum will be equal to the default size of the JDBC connection pool in the server’s datasource. I’ll add both columns to our data and display a sample:

import pandas as pd

from fbprophet import Prophet

 

df = pd.read_csv('App1_2.csv')

df['floor'] = 5

df['cap'] = 700

df.head()

ds

y

floor

cap

0

3/4/2019

535

5

700

1

3/5/2019

350

5

700

2

3/6/2019

77

5

700

3

3/7/2019

216

5

700

4

3/8/2019

521

5

700

Step #2: Create a future dataset to forecast on

Now we need to create an alternate dataset, it will be a table containing future dates, and a floor and cap indicator (you might have also noticed that my data is a bit old, but it’s the future according to the original observations).

We will also be creating our instance of Prophet.

#This creates a Prophet instance

m = Prophet(growth='logistic', weekly_seasonality=True)

 

#We invoke fit over our original dataset

m.fit(df)

#Now we have our Prophet model :)

 

#Prophet creates the future dataset and we can specify how many observations we will want

#here I’m saying that I want 30 future observations

future = m.make_future_dataframe(periods=30)

future['floor'] = 5

future['cap'] = 700

future.tail()

ds

floor

cap

173

2019-08-24

5

700

174

2019-08-25

5

700

175

2019-08-26

5

700

176

2019-08-27

5

700

177

2019-08-28

5

700

Step #3: Obtain more information (and the forecast predictions)

Of course we want the predictions, but we can also obtain more information from our own dataset. Do you want to see what else we can learn from it? Let’s see what Prophet can tell us.

#First… make a prediction over our forecast dataset

forecast = m.predict(future)

 

#We now plot the components

fig2 = m.plot_components(forecast)


We can see two important points here:

  1. The trend goes up, but the slope is not quite big, it tends to stay in the same range of values.
  2. Based on a weekly view, my peak days are on Mondays, and then the trend tends to go down day by day until next Monday.

 

Step #4: Plot the forecast

We already have the forecast in our “forecast” dataset. We just have to plot it, and for that we have 2 ways:

Option number 1: Simple way

Prophet already has a function to plot the forecast values

fig1 = m.plot(forecast)


How to read this chart:

  • The black dots are the actual observations, so you can see that at some point close to “2019-08” they stop, it means that the dataset ends at that moment and the remaining portion on the right shows the forecast.
  • The blue line represents the forecast. It continues after “2019-08” and it shows how the future looks like.
  • The blue region, or shadow, is your confidence region.

 

If you find that one to be not so friendly, how about an interactive version?

from fbprophet.plot import plot_plotly

import plotly.offline as py

py.init_notebook_mode()

 

fig_plotly = plot_plotly(m, forecast)

py.iplot(fig_plotly)


Notice a few things:

  • The slider at the bottom, so that you can zoom on any region you want.
  • There are controls on the top left corner to limit the info for a weekly view, monthly, 1 year view, or all the data.
  • Again, it’s generated in html, ready to be exported anywhere.

General recommendations

There are a few things to mention at this point. Of course, it’s best to do this kind of analysis when you have some knowledge of your data since you already know if contains seasons or cycles and their size, and the size of the window you’ll use for the moving average calculation. I always say that one of the benefits of being a DBA is that you already have a ton of data to play with, and you will definitely be required to play with it. If you’re in a situation where you didn’t know the data at all, these quick examples already give you a good start.

What’s next?

In the next article, we’ll expand this same data with one more column and turn it into a survival analysis scenario to plot the chances of alerts.

 

Closing thoughts

There are many books covering Time Series Analysis in many approaches: From simple average to Neural Networks and even beyond. This is exactly why I said in the beginning that time is the final frontier, but it doesn’t mean that we have to learn rocket science. There’s always at least one way to play and get started and there’s always a team of people trying to give the most approachable solutions and frameworks.

 In any case, you should still be familiar with the basics so that next time you start working on a Monday morning, you will already have all the data from the previous week in a pretty looking dashboard, and you’ll be ready to face any emergency meeting.



0 comments
621 views

Permalink