Exploring the rich world of SQL OLAP functions - Moving Averages by George Baklarz

Exploring the rich world of SQL OLAP functions - Moving Averages

 (*editorial note of the content committee: most of what is in the article is cross-platform, except the examples of LAG/LEAD and FIRST_VALUE/LAST_VALUE are not available in DB2 for z/OS)

Pity the poor developer or end user that has never had the chance to use "real" SQL. You know, the kind of SQL that does more than select a few rows with possibly a few predicates thrown in?

Well, some may say that it might be a good thing not having to know the details behind SQL! However, knowing some of the sophisticated grouping, ordering, and sorting options of SQL could make your life easier, and perhaps even result in faster results.

The Expanding SQL Syntax

 

The SQL language is rich in functionality and can often be complex to navigate. How did we get to this point in the language from our simple INSERT, UPDATE, DELETE, and SELECT beginnings?

 

Early relational database systems lacked analytic capabilities - for instance, the ability to calculate standard deviation, or generate moving averages. This type of calculation was typically performed in either a spreadsheet, or an application. There are a large number of products in the marketplace that generate their own cubes and reports by extracting data from a database (and often using some very poor extraction techniques!).

 

Running reports locally on a workstation resulted in two major bottlenecks:

 

  • Transferring large quantities of data to a workstation over a limited network
  • Compute-intensive calculations being run on slow processors with little memory and disk

 

Aside from adding more richness to the SQL language, the ability to generate CUBEs and summary information at the database level helped improve performance in a number of ways:

 

  • The larger server that the database was on was more efficient at computing the cubes, summaries, and other calculations that the workstation was challenged to do
  • The data was local to the processor so no unnecessary transfer of data was required when computing the result
  • Only the final computed rows would be returned to the application

 

Of course, some DBAs amongst us would also point out that this new capability increased the amount of headaches they have to deal with because of over-enthusiastic SQL fanatics who didn't have a clue about indexes and optimization!

 

We sometimes forget that the DB2 database engine can do a lot of work on our behalf rather than relying on our favorite spreadsheet to compute some of these values. You may find that using these functions can speed up your reporting. Or you can impress your close friends with your ability to write complex SQL!

 

On-Line Analytical Processing (OLAP)

 

OLAP functions provide a variety of capabilities, including the ability to return ranking, row numbering, and modify column functions to analyze a range of values within a column rather than the entire result set. Typically an OLAP function can be included in expressions in a select-list or the ORDER BY clause of a select-statement.

 

While there are lots of OLAP functions found within DB2, we are going to look at  moving averages and how you can make use of them.

 

Before we get into the details of the OLAP functions, we need to create a table and some sample data. The following example assumes that you have a connection to the SAMPLE database that is usually generated for you in a DB2 for Linux, Unix, and Windows installation.

 

Note: The SQL examples that are found in this article are available as a downloadable text file for you to use. Nothing is worse than having to type in a set of SQL and having it fail because of a missing comma or misspelled column name.  A link to the file is provided at the end of the article.

 

First we need to create the STOCKS table, which contains the stocks symbol (SYMBOL), the trading date (TRADINGDATE), and the closing price for the stock at the end of the day (CLOSEPRICE).

 

drop table stocks;

create table stocks (

  symbol char(4),

  tradingdate date,

  closeprice int

);

 

The following SQL uses recursion to generate data for the STOCKS table with just one stock  (IDUG). The stock price is 100 at the beginning of the year and then increased by a value of 1 for every day until January 10th.  The numbers have been kept simple so you can see how the calculations are done in the examples.

 

insert into stocks

   with temp1(nm,tx,s1) as

     (

     values ('IDUG', date('2015-01-01'),100)

     union all

     select 'IDUG', tx + 1 day, s1+1

       from temp1

       where tx < date('2015-01-10')

     )

   select * from temp1;

 

If you want to generate random numbers for your stocks, you can modify the SELECT statement to include the following CASE statement.

 

     SELECT 'IDUG',TX + 1 DAY, S1+ CASE

         WHEN RAND() >= .50 THEN INT(-10*RAND())

         ELSE INT(10 * RAND())

       END

 

The data in the stock table is shown below.

 

SELECT * FROM STOCKS;

 

SYMBOL TRADINGDATE         CLOSEPRICE

------ ------------------- -----------

IDUG   2015-01-01-00.00.00         100

IDUG   2015-01-02-00.00.00         101

IDUG   2015-01-03-00.00.00         102

IDUG   2015-01-04-00.00.00         103

IDUG   2015-01-05-00.00.00         104

IDUG   2015-01-06-00.00.00         105

IDUG   2015-01-07-00.00.00         106

IDUG   2015-01-08-00.00.00         107

IDUG   2015-01-09-00.00.00         108

IDUG   2015-01-10-00.00.00         109

 

30-Day Moving Average

 

We will start by looking at a moving average calculation.  Moving averages allow us to look at a "smoothed" curve of data points. There will always be spikes in data values, especially with data associated with the stock market! Moving averages can give us a better view of the long-term trend of the data rather than looking at individual data points.

 

The general format of a moving average involves the use of three operators within the SQL statement: OVER, PARTITION, and ROWS BETWEEN … PRECEEDING AND CURRENT ROW. 

 

  • OVER 
  • The OVER function is used in conjunction with a function like AVG. The OVER keyword instructs DB2 to calculate the average (or other function) over a range of values. We can specify a partition (i.e. a stock symbol), which tells us what column will be used to stop the moving average.
  • PARTITION 
  • The partition clause is used during processing to limit the calculation. A partition can be thought of a window into the data that the function is applied to. In the example we are going to use the STOCK column as the partition. Basically this means that every stock value in the table will have its own moving average value. If you don't specify a partition, all values in the table will be used, regardless of the stock symbol.
  • ORDER BY
  • ORDER BY must be specified within the OVER clause so that the data is ordered for the ROWS function. You can get the same partitioning results by specifying ORDER BY partition_column, ordering_column.
  • ROWS BETWEEN x PRECEDING AND CURRENT ROW
  • This is the clause that tells DB2 what values to use when computing the average value. The SQL specifies the number of rows before (PRECEDING) from the current row. So if we want a 30-day moving average, we would specify BETWEEN 29 PRECEDING AND CURRENT ROW.

 

Now that we have the basics of a moving average calculation, here is the SQL that will give us the 30-day moving average of our stock table:

 

[ 1] WITH V1(SYMBOL, TRADINGDATE, MOVINGAVG) AS

[ 2]   (

[ 3]   SELECT SYMBOL, TRADINGDATE,

[ 4]     AVG(CLOSEPRICE) [ 5]       OVER ( [ 6]         PARTITION BY SYMBOL

[ 7]         ORDER BY TRADINGDATE

[ 8]         ROWS BETWEEN 29 PRECEDING AND CURRENT ROW

[ 9]       )

[10]   FROM STOCKS

[11]     WHERE SYMBOL = 'IDUG' [12]   )

[13] SELECT SYMBOL, TRADINGDATE, DEC(MOVINGAVG,5,2) FROM V1

[14] ORDER BY SYMBOL, TRADINGDATE

 

The first portion of the SQL creates a temporary table called V1 that contains the stock symbol, the trading date, and the moving averages. These calculations will be done within the body of the WITH statement.

 

[ 1] WITH V1(SYMBOL, TRADINGDATE, MOVINGAVG) AS

[ 2]  ( [12]  )

[13] SELECT SYMBOL, TRADINGDATE, DEC(MOVINGAVG,5,2) FROM V1

[14] ORDER BY SYMBOL, TRADINGDATE

 

The SELECT statement after the WITH clause retrieves all of the data from the temporary table. You could always add a BETWEEN clause to limit the results to a specific range of data.

 

WHERE TRADINGDATE BETWEEN '2015-01-01' AND '2015-01-10'

 

The real work takes place within the WITH clause.

 

[ 3]   SELECT SYMBOL, TRADINGDATE,

[ 4]     AVG(CLOSEPRICE) [ 5]       OVER ( [ 6]         PARTITION BY SYMBOL

[ 7]         ORDER BY TRADINGDATE

[ 8]         ROWS BETWEEN 29 PRECEDING AND CURRENT ROW

[ 9]       )

[10]   FROM STOCKS

[11]     WHERE SYMBOL = 'IDUG'

 

The average function (AVG) is going to be applied to the CLOSEPRICE column. The OVER function specifies the following:

  • The partition is based on the stock symbol (think of it as where breaks would occur during the processing of an average)
  • Ordering of the rows should be done by trading date so that we have the rows in the proper order for the moving average
  • The average should take into account the 29 preceding rows from the current row being processed (for a total of 30 days)The FROM clause selects the specific stock values that we want to track, although we could remove that from the example and we would get moving averages for all of the stocks in the table. Running the query against our test table generates the following result.
  •  

 

SYMBOL TRADINGDATE         3

------ ------------------- -------           

IDUG   2015-01-01-00.00.00  100.00

IDUG   2015-01-02-00.00.00  100.50

IDUG   2015-01-03-00.00.00  101.00

IDUG   2015-01-04-00.00.00  101.50

IDUG   2015-01-05-00.00.00  102.00

IDUG   2015-01-06-00.00.00  102.50

IDUG   2015-01-07-00.00.00  103.00

IDUG   2015-01-08-00.00.00  103.50

IDUG   2015-01-09-00.00.00  104.00

IDUG   2015-01-10-00.00.00  104.50

 

You could add an additional twist to the select statement to show the delta between the moving average and the original value.

WITH V1(SYMBOL, TRADINGDATE, MOVINGAVG) AS

  (

  SELECT SYMBOL, TRADINGDATE,

    AVG(CLOSEPRICE) OVER (PARTITION BY SYMBOL ORDER BY TRADINGDATE

        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW

    )

  FROM STOCKS

  ),

V2(SYMBOL, TRADINGDATE, ORIGINAL, MOVINGAVG, DELTA) AS

  (

  SELECT S.SYMBOL, S.TRADINGDATE, S.CLOSEPRICE,

    DEC(V1.MOVINGAVG,5,2),DEC(S.CLOSEPRICE-V1.MOVINGAVG,5,2)

  FROM V1, STOCKS S

    WHERE V1.SYMBOL=S.SYMBOL AND V1.TRADINGDATE=S.TRADINGDATE

  )

SELECT SYMBOL, TRADINGDATE, ORIGINAL, MOVINGAVG, DELTA FROM V2

ORDER BY SYMBOL, TRADINGDATE

 

There are many ways to generate these results, but a WITH statement is a convenient way of building up the SQL that we want to execute.  The results illustrate how our average is being computed over the range of values.

 

SYMBOL TRADINGDATE         ORIGINAL    MOVINGAVG DELTA

------ ------------------- ----------- --------- -------

IDUG   2015-01-01-00.00.00         100    100.00    0.00

IDUG   2015-01-02-00.00.00         101    100.50    0.50

IDUG   2015-01-03-00.00.00         102    101.00    1.00

IDUG   2015-01-04-00.00.00         103    101.50    1.50

IDUG   2015-01-05-00.00.00         104    102.00    2.00

IDUG   2015-01-06-00.00.00         105    102.50    2.50

IDUG   2015-01-07-00.00.00         106    103.00    3.00

IDUG   2015-01-08-00.00.00         107    103.50    3.50

IDUG   2015-01-09-00.00.00         108    104.00    4.00

IDUG   2015-01-10-00.00.00         109    104.50    4.50

 

Some of you may have wondered how we can have a moving 30 day average when our data doesn't actually contain 30 days of data prior to the current row. In the previous result, you'll notice that the first of January has an average of 100 since there was only one value involved. The 2nd has an average of 100.50 because our total is 201 for the two days and then divided by two. So the OVER function is intelligent enough to create a moving average based on the values that are available to the function.

 

The following results use a 5 day moving average, which shows that the average function is limited to 5 days. The delta will always be 2 after the 5th day.

 

SYMBOL TRADINGDATE         ORIGINAL    MOVINGAVG DELTA

------ ------------------- ----------- --------- -------

IDUG   2015-01-01-00.00.00         100    100.00    0.00

IDUG   2015-01-02-00.00.00         101    100.50    0.50

IDUG   2015-01-03-00.00.00         102    101.00    1.00

IDUG   2015-01-04-00.00.00         103    101.50    1.50

IDUG   2015-01-05-00.00.00         104    102.00    2.00

IDUG   2015-01-06-00.00.00         105    103.00    2.00

IDUG   2015-01-07-00.00.00         106    104.00    2.00

IDUG   2015-01-08-00.00.00         107    105.00    2.00

IDUG   2015-01-09-00.00.00         108    106.00    2.00

IDUG   2015-01-10-00.00.00         109    107.00    2.00

 

The LAG/LEAD Function

 

There will be situations where the data may contain NULL values and care must be taken on how these values are used during the computation of the moving average. Are NULL values supposed to be treated as zeros, or should our moving average eliminate them from our average calculation?

 

The following UPDATE statement will set January 3rd, 4th, and 10th weekend values of the stock to null. January 1st was kept as 100 since that was the starting point for generating all of the values.

 

UPDATE STOCKS

  SET CLOSEPRICE = NULL

  WHERE TRADINGDATE IN ('2015-01-03','2015-01-04','2015-01-10');

 

Re-issuing the 5-day moving average query gives some interesting results.

 

SYMBOL TRADINGDATE         ORIGINAL    MOVINGAVG DELTA

------ ------------------- ----------- --------- -------

IDUG   2015-01-01-00.00.00         100    100.00    0.00

IDUG   2015-01-02-00.00.00         101    100.50    0.50

IDUG   2015-01-03-00.00.00           -   100.50       -

IDUG   2015-01-04-00.00.00           -   100.50       -

IDUG   2015-01-05-00.00.00         104    101.67    2.33

IDUG   2015-01-06-00.00.00         105    103.33    1.67

IDUG   2015-01-07-00.00.00         106    105.00    1.00

IDUG   2015-01-08-00.00.00         107    105.50    1.50

IDUG   2015-01-09-00.00.00         108    106.00    2.00

IDUG   2015-01-10-00.00.00           -   106.50       -

 

The moving average function uses only non-null values to determine the moving average. For stock values that are null on a particular day, adding all of the non-null values and dividing by the count of non-null values will determine the average.

 

For instance, the moving average on the 10th is computed as:

 

105+106+107+108/4 = 106.50

 

There will be situations where you want the stock curve to be continuous (even though there was no trading on a holiday or weekend). In cases like this, you want to have the null closing price changed to the last valid closing price.  Rather than modifying our data, we can use the LAG function to get previous values used in a calculation.

 

The LAG function has the following syntax:

 

LAG (or LEAD)   (column, offset, default_value, 'RESPECT NULLS' | 'IGNORE NULLS')

 

  • Column This field is either the name of the column we want the value from, or an expression that includes the name of the column in it.
  • Offset The LAG function returns the value for the row that is offset "n" number of rows before the current row. A value of zero would be the current row, 1 would be the previous row, and so on. The default value for offset is 1.

 

  • Default value The default value is used in the event that a row does not exist at the offset (i.e. after the last row, or before the first row). For instance, for the first row in our table, an offset of 1 would not exist, so this default value would be used instead. The default value is null.
  • Nulls You can specify how NULLS should be handled during the processing of the values. If 'IGNORE NULLS' is specified, all rows with null values will be ignored during the calculation. If 'RESPECT NULLS' is used, then nulls will be considered.

 

The LEAD function is identical to the LAG function, but the values are taken from the following rows as opposed to the preceding ones.

 

The following SQL will create an intermediate result set that fills in the null values with the last valid ending stock price. The COALESCE function returns the first non-null result in a list. If the stock value exists, its value is returned by the function. Otherwise, we use the LAG function to find the first non-null value in the table prior to the current row.

 

WITH V1(SYMBOL, TRADINGDATE, EODVALUE) AS

   (

   SELECT SYMBOL, TRADINGDATE,

     COALESCE(

       CLOSEPRICE,

       LAG(CLOSEPRICE,1,CAST(NULL AS INT),'IGNORE NULLS')

         OVER (PARTITION BY SYMBOL

               ORDER BY TRADINGDATE

       )

     )

   FROM STOCKS

   )

SELECT * FROM V1

 

The results show that our null values are now replaced with the first non-null closing price prior to the current row.

 

SYMBOL TRADINGDATE         EODVALUE

------ ------------------- -----------

IDUG   2015-01-01-00.00.00         100

IDUG   2015-01-02-00.00.00         101

IDUG   2015-01-03-00.00.00         101

IDUG   2015-01-04-00.00.00         101

IDUG   2015-01-05-00.00.00         104

IDUG   2015-01-06-00.00.00         105

IDUG   2015-01-07-00.00.00         106

IDUG   2015-01-08-00.00.00         107

IDUG   2015-01-09-00.00.00         108

IDUG   2015-01-10-00.00.00         108

 

The portion of the SQL that includes the COALESCE function is now placed into our original moving average function.

 

WITH V1(SYMBOL, TRADINGDATE, MOVINGAVG) AS

   (

   SELECT SYMBOL, TRADINGDATE,

     AVG(        COALESCE(

         CLOSEPRICE,

         LAG(CLOSEPRICE,1,CAST(NULL AS INT),'IGNORE NULLS')

         OVER (PARTITION BY SYMBOL ORDER BY TRADINGDATE)

       )

     )      OVER (PARTITION BY SYMBOL ORDER BY TRADINGDATE

           ROWS BETWEEN 4 PRECEDING AND CURRENT ROW

          )

   FROM STOCKS

   ),

 V2(SYMBOL, TRADINGDATE, ORIGINAL, MOVINGAVG, DELTA) AS

   (

   SELECT S.SYMBOL, S.TRADINGDATE, S.CLOSEPRICE,

     DEC(V1.MOVINGAVG,5,2),DEC(S.CLOSEPRICE-V1.MOVINGAVG,5,2)

   FROM V1, STOCKS S

     WHERE V1.SYMBOL=S.SYMBOL AND

           V1.TRADINGDATE=S.TRADINGDATE

   )

 SELECT SYMBOL, TRADINGDATE, ORIGINAL, MOVINGAVG, DELTA FROM V2

 ORDER BY SYMBOL, TRADINGDATE

 

The SQL does become a bit complex to look at, but the at least the results make sense!

 

 

SYMBOL TRADINGDATE         ORIGINAL    MOVINGAVG DELTA

------ ------------------- ----------- --------- -------

IDUG   2015-01-01-00.00.00         100    100.00    0.00

IDUG   2015-01-02-00.00.00         101    100.50    0.50

IDUG   2015-01-03-00.00.00          -   100.67       -

IDUG   2015-01-04-00.00.00           -   100.75       -

IDUG   2015-01-05-00.00.00         104    101.40    2.60

IDUG   2015-01-06-00.00.00         105    102.40    2.60

IDUG   2015-01-07-00.00.00         106    103.40    2.60

IDUG   2015-01-08-00.00.00         107    104.60    2.40

IDUG   2015-01-09-00.00.00         108    106.00    2.00

IDUG   2015-01-10-00.00.00           -   106.80       -

 

The results show that the averages now use the substitute values (last valid closing price) rather than ignoring those rows in average calculations. There is probably some debate on whether these numbers are useful in a table like this. In fact, the results are misleading because the average didn't really go up over the weekend because the stock didn't trade. However, these data points are useful when you are plotting a trend graph. You get a smoother graph rather than having empty slots or large jumps in values.

 

 

Knowing how financial analysts work, they would be much happier to show you the LAG function as being the trend for the stock!

 


 

The FIRST_VALUE /LAST_VALUE Function

 

A couple of additional functions are included with the OLAP specification that you can use in your calculations. The LEAD and LAG functions deal with values "around" the current row that is being processed. The FIRST_VALUE function returns the first value in the evaluation window, while LAST_VALUE returns the last value. These are useful when you are trying to compare individual values in the table to the first or last value. Typically you would sort your table in ascending order and then FIRST_VALUE would correspond to the lowest value in the entire table.

 

We could use our stock table to figure out what the ratio of the daily closing price is compared to the original price of the stock at the beginning of the year.  Rather than worry about null values in this SQL, the example will assume we have the original stock table with valid closing prices for all dates. The following SQL will give us the ratio between the current stock price and the original price for the stock.

 

WITH V1(SYMBOL, TRADINGDATE, RATIO) AS

   (

   SELECT SYMBOL, TRADINGDATE,

     ( CLOSEPRICE -

       FIRST_VALUE(CLOSEPRICE)

       OVER (PARTITION BY SYMBOL ORDER BY TRADINGDATE)

     )  /

       FIRST_VALUE(CLOSEPRICE)

       OVER (PARTITION BY SYMBOL ORDER BY TRADINGDATE)

     * 100

  FROM STOCKS

    WHERE SYMBOL = 'IDUG'

 )

SELECT SYMBOL, TRADINGDATE, DEC(RATIO,5,2) FROM V1

ORDER BY SYMBOL, TRADINGDATE 

 

The corresponding result is shown below.

 

SYMBOL TRADINGDATE         3

------ ------------------- -------

IDUG   2015-01-01-00.00.00    0.00

IDUG   2015-01-02-00.00.00    1.00

IDUG   2015-01-03-00.00.00    2.00

IDUG   2015-01-04-00.00.00    3.00

IDUG   2015-01-05-00.00.00    4.00

IDUG   2015-01-06-00.00.00    5.00

IDUG   2015-01-07-00.00.00    6.00

IDUG   2015-01-08-00.00.00    7.00

IDUG   2015-01-09-00.00.00    8.00

IDUG   2015-01-10-00.00.00    9.00

 

There are ways to remove the duplication of the two OLAP statements. One approach is to use a WITH statement, but those clever SQL developers out there would probably just create a WITH statement that executes the following SQL to get the stock price.

 

SELECT CLOSEPRICE FROM STOCKS

ORDER BY TRADINGDATE ASC

FETCH FIRST ROW ONLY

 

You can insert this SQL into a WITH clause to calculate the same results.

 

WITH FIRSTVALUE(FIRST) AS (

  SELECT CLOSEPRICE FROM STOCKS

  ORDER BY TRADINGDATE ASC

  FETCH FIRST ROW ONLY

  )

SELECT SYMBOL, TRADINGDATE, DEC((CLOSEPRICE-FIRST)/FIRST * 100.0, 5, 2)

  FROM STOCKS, FIRSTVALUE

ORDER BY TRADINGDATE ASC

 

But while this SQL will work, it takes all of the fun of using OLAP SQL! That's one of the fascinating, yet frustrating aspects of the SQL language - there is always more than one way to get your answer. It's just figuring out if it is correct!

 

Summary

 

DB2 has a rich SQL syntax that includes a number of OLAP functions. These functions give you the ability to create moving ranges within calculations, deal with missing or null data values and insert minimum and maximum values into your math without having to use WITH statements.

 

Besides the increased computational power in SQL, these new OLAP functions can also improve response time by moving compute expensive processing into the database engine and reducing the overhead of moving non-summarized data back to the client.

 

Who knew that SQL could do more than INSERT, SELECT, DELETE, and UPDATE?

 

Sample SQL Link

 

http://www.idug.org/p/do/sd/sid=7977

 

Author

 

George Baklarz, B. Math, M. Sc, Ph.D. Eng.

 

George has spent 30 years in IBM working on various aspects of database technology. From 1987 to 1991 he worked on SQL/DS as part of the product planning department, system test team, performance team, and application development group. In 1991, Baklarz was part of the team that helped move the OS/2 ES database to Toronto to become part of the DB2 family of products. Since this time he has worked on vendor enablement, competitive analysis, product marketing, product planning, and technical sales support. He currently manages the Worldwide Core Database Technical Sales department.

Recent Stories
Some of my favorite tunings tips and tricks for DB2 by Joe Geller

SQL Tuning for Developers – Part 1

The Key To Mastering Db2 Locking On Row-Store Tables