SQL Tricks - Part 3 Using ROW_NUMBER to Assist in Deleting Duplicates

By Tony Andrews posted Dec 01, 2020 03:22 PM

  

 This topic comes from a presentation at IDUG EMEA from Marcel Scheibe that was a part of many topics in his presentation.  I thought it was appropriate for SQL Tricks topic.  Thanks Marcel!

Often times we have all seen duplicate rows in table data that occurred thru some incorrect SQL or program logic.  In this example the timestamp column makes up part of the composite key along with the employee number.  So how do we delete the duplicates?

  • Write a program and order the data through a cursor by primary key and execute deletes ?
  • Write a single SQL statement to fix the problem? That is what this article is about.   

  

 1)  Table of data with duplicates. PK=(EMPNO, EMPNO_TSP):

     

   

    As you can see, some rows have duplicates and some do not.  Because of this we cannot just zero in on the max timestamp for each. 

    What if there is more than one duplicate for some? 

    What if some did not have duplicates? 

     If there was only one duplicate, and EVERY row had a duplicate, then the following could take place to find for each EMPNO the maximum       
     timestamp value to delete. If this query ran, it would delete a row for every EMPNO in the table (5 rows).
   

DELETE 
FROM EMP2 
WHERE (EMPNO, EMPNO_TSP) IN  
        (SELECT E.EMPNO, E.EMPNO_TSP                
         FROM EMP2 E                                
         WHERE E.EMPNO_TSP =                        
                 (SELECT MAX(E2.EMPNO_TSP)          
                  FROM EMP2 E2                      
                  WHERE E2.EMPNO = E.EMPNO          
                  )                                  
        )                                        

 

 2)  Assign Row_Number

       Using the Row_Number ranking function along with ‘Partition By’ we can zero in on the ones that need to be deleted.
     

SELECT * FROM
   (SELECT EMPNO, EMPNO_TSP, ROW_NUMBER () 
       OVER (PARTITION BY EMPNO ORDER BY EMPNO_TSP) AS ROW_NBR 
    FROM EMP2 ORDER BY EMPNO) 
WHERE ROW_NBR > 1


 

  Now how do we use this for a delete statement?  Here is the query to delete where the row number > 1  and also to use the OLD TABLE so we see 
   what was deleted.

SELECT EMPNO, EMPNO_TSP                                        
FROM OLD TABLE                                                 
(  DELETE FROM EMP2                                            
   WHERE (EMPNO, EMPNO_TSP) IN                                 
     (SELECT EMPNO, EMPNO_TSP                                  
      FROM  (SELECT EMPNO, EMPNO_TSP, ROW_NUMBER ()            
             OVER (PARTITION BY EMPNO ORDER BY EMPNO_TSP)      
                                                    AS ROW_NBR 
             FROM EMP2                                         
             ORDER BY EMPNO                                    
            )                                                  
      WHERE ROW_NBR > 1                                        
     )                                                         
 )           

  Output shows the 3 rows that were deleted.  If there were more duplicates for an EMPNO, they also would get deleted since their Row_Number 
    would be greater than 1.

1 comment
1228 views

Permalink

Comments

Dec 02, 2020 04:06 AM

Another technique that might work is to use the RID function

DELETE FROM EMP2 E1
WHERE EXISTS
(SELECT 1
FROM EMP2 E2
WHERE E1.EMPNO = E2.EMPNO
AND RID(E1) > RID(E2)
)

This might also work if the rows are complete duplicates - that is, not even a time stamp to distinguish them.