SQL Percentiles

With the SQL OLAP RANK and DENSE_RANK functions being around for a while in SQL programming, it is quite easy to query for the TOP XX from data.  Take for example the following query that will return the TOP 5 employees out of the EMP table based on the column SALARY.  DENSE_RANK will be used due to the fact that some employees could have the same salary, and in the following example that is true.  You will see that the employees that make the top 5 salaries is actually 6 rows right now in the data as there are 2 who make the top salary of $52,750.00.

WITH X AS
  (SELECT EMPNO, SALARY,
         DENSE_RANK() OVER (ORDER BY SALARY DESC) AS RANK
    FROM EMP)
SELECT EMPNO, SALARY, RANK
FROM X
WHERE RANK < 6;  

 Result:

pic1.png

But when it comes to percentiles, how do we query and get results for say the top 5% or the bottom 10% of data? Let’s first look at some definitions.

Percentile:  A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value below which 20% of the observations may be found. Wikipedia

Percentile Rank: The percentile rank of a score is the percentage of scores in its frequency distribution that are equal to lower than it.  For example, a test score that is greater than 75% of the scores of people taking the test is said to be at the 75th percentile, where 75 is the percentile rank. Wikipedia

Let’s take this same query above, and bring back the top 10% of employees based on their salary. You do this by:

  • Dividing their rank by the number of employees in the group (you have to know this or calculate it)
  • Multiply by 100, and then subtract that number from 100.          

For example, if there are 600 students in your grade and you are ranked 120th, then you are in the 80th percentile because (120/600)*100=20, and 100-20=80. You are then also in the top 20% of your class.

In our small table of employees (33 rows where there are 2 employees (‘000010’ and ‘000011’) that make the highest salary, let’s try to get the top 10%. But first let’s just return all the rows with their calculated percentiles.  Results show just the first 24.  

WITH TEMP_RANK AS                                             
  (SELECT EMPNO, LASTNAME, SALARY,                             
       SMALLINT(DENSE_RANK () OVER (ORDER BY SALARY ASC) ) AS RANK
     FROM EMP                                           
    )                                         
SELECT EMPNO, LASTNAME, SALARY, RANK,                         
             (RANK/33.0) * 100 AS PERCENTILE                         
FROM TEMP_RANK                                                 
ORDER BY PERCENTILE DESC                                       

pic2.png

If you do not know exactly your number within the group you are querying against, it can be calculated using an SQL Scalar Fullselect in the common table expression. In this group, we are including everyone in the table.

WITH TEMP_RANK AS                                               
   (SELECT EMPNO, LASTNAME, SALARY,                             
       SMALLINT(DENSE_RANK () OVER (ORDER BY SALARY ASC) ) AS RANK ,
       (SELECT DEC(COUNT(*),5,2)  FROM EMP) AS NUM_EMPS
     FROM EMP                                           
     )                                         
SELECT EMPNO, LASTNAME, SALARY, RANK,                         
              (RANK/NUM_EMPS) * 100 AS PERCENTILE                         
FROM TEMP_RANK                                                 
ORDER BY PERCENTILE DESC                                       

 NOTE:  When calculating the COUNT you need to cast it as a decimal because in SQL if you take a whole number and divide it by another whole number, the answer is only the whole number portion without the decimals.  We need the decimals here!            

The following query will get us the top 10% we are looking for. In order to accomplish this we need to calculate the percentile within the common table expression so we can add ‘Where’ logic to it.

WITH TEMP_RANK AS                                              
    (SELECT EMPNO, LASTNAME, SALARY,                             
        DENSE_RANK () OVER (ORDER BY SALARY ASC) AS RANK,
       (DENSE_RANK () OVER (ORDER BY SALARY ASC) / 33.0) * 100 AS PERCENTILE
     FROM EMP                                           
     )               
 SELECT EMPNO, LASTNAME, SALARY, RANK, PERCENTILE                                             
   FROM TEMP_RANK                                                 
  WHERE PERCENTILE >= 90                                         
  ORDER BY PERCENTILE DESC

 pic3.png

NOTE:  In our EMP data both employee ‘000010’ and ‘000011’ are named HAAS and have the same highest salaries.   

Again, if you need to calculate the count within your group, you add it within the common table expression:

WITH TEMP_RANK AS                                              
    (SELECT EMPNO, LASTNAME, SALARY,                             
                    DENSE_RANK () OVER (ORDER BY SALARY ASC)  AS  RANK,
                   (DENSE_RANK () OVER (ORDER BY SALARY ASC)  /
                         (SELECT DEC(COUNT(*),5,2) FROM EMP))  * 100 AS PERCENTILE
     FROM EMP                                           
     )                                         
  SELECT EMPNO, LASTNAME, SALARY, RANK,                         
          PERCENTILE                                             
   FROM TEMP_RANK                                                 
  WHERE PERCENTILE >= 90                                         
  ORDER BY PERCENTILE DESC                                       

You could also execute this logic by first ranking on ‘SALARY DESC’, and then change the WHERE logic on the column PERCENTILE.  For example the following also gets us the same employees that make up the top 10% of salaries.

WITH TEMP_RANK AS                                            
    (SELECT EMPNO, LASTNAME, SALARY,                             
       DENSE_RANK () OVER (ORDER BY SALARY DESC)  AS RANK,
       (DENSE_RANK () OVER (ORDER BY SALARY DESC)  /
                   (SELECT DEC(COUNT(*),5,2) FROM .EMP))    * 100 AS PERCENTILE 
    FROM EMP                                           
     )                                         
  SELECT EMPNO, LASTNAME, SALARY, RANK,                         
          PERCENTILE                                             
   FROM TEMP_RANK                                                 
  WHERE PERCENTILE <= 10                                         
  ORDER BY RANK                                    

pic4.png

 

Recent Stories
SQL Percentiles

Spring Batch using Db2

More on Microservice Architectures and Performance