Nested NOT EXISTS subqueries to anti-joins

TaiNing WANG

Nested NOT EXISTS subqueries to anti-joins
Hi,
I installed DB2 Express C 11.1 on 64-bit Ubuntu 14.04, with 32GB memory and one 1TB disk space.
Short story: DB2 doesn’t use hash (anti-) join for the following query, so it’s very slow. What can I do to make it fast?
SELECT  s_suppkey
FROM    supplier
WHERE  NOT EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  ps_suppkey, ps_supplycost
                FROM    partsupp
                WHERE  NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    part
                        WHERE  ps_partkey = p_partkey
                                AND p_name = 'cyan orchid indian cornflower saddle'
                        )
                ) AS pruned_partsupp
        WHERE  ps_suppkey = s_suppkey
                AND s_acctbal > 2500 * ps_supplycost
        )
Long story: When I run the above query on 1GB TPC-H data, it takes very long to get the result (hundreds of seconds).
The query in relational algebra is essentially: Supplier anti-join (Partsupp anti-join Part), i.e. two anti-joins.
The plan shown by db2exfmt (for both Optimization Level = 5 and 9):
---
        Total Cost:             4.15451e+06
        Query Degree:           1
                               Rows
                              RETURN
                              (   1)
                               Cost
                                I/O
                                |
                               5000
                              TBSCAN
                              (   2)
                            4.15451e+06
                               13222
                            /---+----\
                       0.711056       10000
                        FETCH    TABLE: TAINING
                        (   3)         SUPPLIER
                        6404.31        Q5
                        4596.61
         +----------------+--+--------------+
         80              4.97537e-06      800000
       IXSCAN              FETCH      TABLE: TAINING
       (   4)                      (   5)        PARTSUPP
       4238.25             20.3157          Q1
       4276.71                3
         |               /---+----\
       800000           1         200000
   INDEX: SYSIBM     IXSCAN   TABLE: TAINING
 SQL170928182919810  (   6)        PART
         Q1          13.5471        Q2
                        2
                       |
                     200000
                 INDEX: SYSIBM
---
From the plan, DB2 doesn’t use a join algorithm like hash join for (Partsupp anti-join Part).
I executed the same query on PostgreSQL 9.6.1, and the plan by PostgreSQL used hash join for both of the anti-joins. It ran in less than one second.
---
Memory parameters used in DB2:
bufferpool size = 5GB
sortheap = 250000 (1GB)
Memory parameters used in PostgreSQL:
shared_buffers = 5GB
work_mem = 1GB
---
I then tried to force hash join between Partsupp and Part using the following guideline in optimization profile in DB2.
<OPTGUIDELINES>
     <HSJOIN>
       <ACCESS TABLE='partsupp'/>
       <IXSCAN TABLE='part'/>
     </HSJOIN>
</OPTGUIDELINES>
    
Applying this optimization profile, db2exfmt showed that forcing hash join for partsupp and part was not allowed, because they were not in the same FROM clause (since anti-join was expressed as NOT EXISTS subquery):
---
Diagnostic Details:     EXP0015W  Invalid join request. Join refers to
                        tables that are not in the same FROM clause. Line
                        number "28", character number "14".
---
I thought maybe DB2 did not automatically convert NOT EXISTS to anti-join. I then tried the following optimization profile to enable the conversion:
<OPTGUIDELINES>
        <SUBQ2JOIN OPTION="ENABLE" />
</OPTGUIDELINES>
       
DB2 still did not apply this:
---
Diagnostic Details:     EXP0028W  The optimization guideline SUBQ2JOIN has
                        not been applied because semantic conditions are
                        not met or context was changed.
---
To see whether DB2 can handle query with only one anti-join, I extracted the inner NOT EXISTS subquery and issued it to DB2:
                SELECT  ps_suppkey, ps_supplycost
                FROM    partsupp
                WHERE  NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    part
                        WHERE  ps_partkey = p_partkey
                                AND p_name = 'cyan orchid indian cornflower saddle'
                        );
DB2 recognized it as anti-join and used hash join in the plan. Thus, it seems the problem is only for two nested NOT EXISTS.
What do you think I might be doing wrong that causes DB2 not to use hash join for the query? Or is it the case that DB2 really cannot recognize two nested NOT EXISTS as two anti-joins? Any help would be much appreciated.
Regards,
TaiNing
Edited By:
TaiNing WANG[Organization Members] @ Nov 15, 2017 - 06:33 PM (Asia/Singapore)

TaiNing WANG

RE: Nested NOT EXISTS subqueries to anti-joins
(in response to TaiNing WANG)

I tried to use CTE as follows, but DB2 doesn't use hash join either.

WITH pruned_partsupp AS

(

        SELECT  ps_suppkey, ps_supplycost

        FROM    partsupp

        WHERE   NOT EXISTS

                (

                SELECT  NULL

                FROM    part

                WHERE   ps_partkey = p_partkey

                        AND p_name = 'cyan orchid indian cornflower saddle'

                )

)

 

SELECT  s_suppkey

FROM    supplier

WHERE   NOT EXISTS

        (

        SELECT  NULL

        FROM    pruned_partsupp

        WHERE   ps_suppkey = s_suppkey

                AND s_acctbal > 2500 * ps_supplycost

        );

Joe Geller

RE: Nested NOT EXISTS subqueries to anti-joins
(in response to TaiNing WANG)

You could rewrite the NOT EXISTS as an anti-join yourself.  That would be the simplest solution.  You also might be able to use an OPTGUIDELINE using the optimized query (which db2exfmt or Data Studio will show you). For this you would use TABID instead of TABLE in the guideline and instead of referencing the table's correlation tabid, you would use the correlation of the nested select.  That of course is more complicated and error prone, and it might still not be feasible.  That is why in these situations I usually rewrite the query as an anti-join.  Let me know if you need help with the re-write.

Joe

In Reply to TaiNing WANG:

Hi,
I installed DB2 Express C 11.1 on 64-bit Ubuntu 14.04, with 32GB memory and one 1TB disk space.
Short story: DB2 doesn’t use hash (anti-) join for the following query, so it’s very slow. What can I do to make it fast?
SELECT  s_suppkey
FROM    supplier
WHERE  NOT EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  ps_suppkey, ps_supplycost
                FROM    partsupp
                WHERE  NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    part
                        WHERE  ps_partkey = p_partkey
                                AND p_name = 'cyan orchid indian cornflower saddle'
                        )
                ) AS pruned_partsupp
        WHERE  ps_suppkey = s_suppkey
                AND s_acctbal > 2500 * ps_supplycost
        )
Long story: When I run the above query on 1GB TPC-H data, it takes very long to get the result (hundreds of seconds).
The query in relational algebra is essentially: Supplier anti-join (Partsupp anti-join Part), i.e. two anti-joins.
The plan shown by db2exfmt (for both Optimization Level = 5 and 9):
---
        Total Cost:             4.15451e+06
        Query Degree:           1
                               Rows
                              RETURN
                              (   1)
                               Cost
                                I/O
                                |
                               5000
                              TBSCAN
                              (   2)
                            4.15451e+06
                               13222
                            /---+----\
                       0.711056       10000
                        FETCH    TABLE: TAINING
                        (   3)         SUPPLIER
                        6404.31        Q5
                        4596.61
         +----------------+--+--------------+
         80              4.97537e-06      800000
       IXSCAN              FETCH      TABLE: TAINING
       (   4)                      (   5)        PARTSUPP
       4238.25             20.3157          Q1
       4276.71                3
         |               /---+----\
       800000           1         200000
   INDEX: SYSIBM     IXSCAN   TABLE: TAINING
 SQL170928182919810  (   6)        PART
         Q1          13.5471        Q2
                        2
                       |
                     200000
                 INDEX: SYSIBM
---
From the plan, DB2 doesn’t use a join algorithm like hash join for (Partsupp anti-join Part).
I executed the same query on PostgreSQL 9.6.1, and the plan by PostgreSQL used hash join for both of the anti-joins. It ran in less than one second.
---
Memory parameters used in DB2:
bufferpool size = 5GB
sortheap = 250000 (1GB)
Memory parameters used in PostgreSQL:
shared_buffers = 5GB
work_mem = 1GB
---
I then tried to force hash join between Partsupp and Part using the following guideline in optimization profile in DB2.
    
      
      
    
    
Applying this optimization profile, db2exfmt showed that forcing hash join for partsupp and part was not allowed, because they were not in the same FROM clause (since anti-join was expressed as NOT EXISTS subquery):
---
Diagnostic Details:     EXP0015W  Invalid join request. Join refers to
                        tables that are not in the same FROM clause. Line
                        number "28", character number "14".
---
I thought maybe DB2 did not automatically convert NOT EXISTS to anti-join. I then tried the following optimization profile to enable the conversion:
       
       
DB2 still did not apply this:
---
Diagnostic Details:     EXP0028W  The optimization guideline SUBQ2JOIN has
                        not been applied because semantic conditions are
                        not met or context was changed.
---
To see whether DB2 can handle query with only one anti-join, I extracted the inner NOT EXISTS subquery and issued it to DB2:
                SELECT  ps_suppkey, ps_supplycost
                FROM    partsupp
                WHERE  NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    part
                        WHERE  ps_partkey = p_partkey
                                AND p_name = 'cyan orchid indian cornflower saddle'
                        );
DB2 recognized it as anti-join and used hash join in the plan. Thus, it seems the problem is only for two nested NOT EXISTS.
What do you think I might be doing wrong that causes DB2 not to use hash join for the query? Or is it the case that DB2 really cannot recognize two nested NOT EXISTS as two anti-joins? Any help would be much appreciated.
Regards,
TaiNing

Dave Nance

Nested NOT EXISTS subqueries to anti-joins
(in response to Joe Geller)
   I would suggest the SQL more along the lines of below. Also, on the part table, do you have an index on the p_name column?
SELECT  s_suppkey
FROM    supplier
WHERE  NOT EXISTS
            (
             SELECT  ps_suppkey, ps_supplycost
                FROM    partsupp
             WHERE  NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    part
                        WHERE  ps_partkey = p_partkey
                                AND p_name = 'cyan orchid indian cornflower saddle'
                        )
               AND  s_acctbal > 2500 * ps_supplycost
               AND  ps_suppkey = s_suppkey
            )
 David Nance


From: Joe Geller <[login to unmask email]>
To: [login to unmask email]
Sent: Thursday, November 16, 2017 9:15 AM
Subject: [DB2-L] - RE: Nested NOT EXISTS subqueries to anti-joins

You could rewrite the NOT EXISTS as an anti-join yourself.  That would be the simplest solution.  You also might be able to use an OPTGUIDELINE using the optimized query (which db2exfmt or Data Studio will show you). For this you would use TABID instead of TABLE in the guideline and instead of referencing the table's correlation tabid, you would use the correlation of the nested select.  That of course is more complicated and error prone, and it might still not be feasible.  That is why in these situations I usually rewrite the query as an anti-join.  Let me know if you need help with the re-write.JoeIn Reply to TaiNing WANG:
Hi,I installed DB2 Express C 11.1 on 64-bit Ubuntu 14.04, with 32GB memory and one 1TB disk space.Short story: DB2 doesn’t use hash (anti-) join for the following query, so it’s very slow. What can I do to make it fast?SELECT  s_suppkeyFROM    supplierWHERE  NOT EXISTS        (        SELECT  NULL        FROM    (                SELECT  ps_suppkey, ps_supplycost                FROM    partsupp                WHERE  NOT EXISTS                        (                        SELECT  NULL                        FROM    part                        WHERE  ps_partkey = p_partkey                                AND p_name = 'cyan orchid indian cornflower saddle'                        )                ) AS pruned_partsupp        WHERE  ps_suppkey = s_suppkey                AND s_acctbal > 2500 * ps_supplycost        )Long story: When I run the above query on 1GB TPC-H data, it takes very long to get the result (hundreds of seconds).The query in relational algebra is essentially: Supplier anti-join (Partsupp anti-join Part), i.e. two anti-joins.The plan shown by db2exfmt (for both Optimization Level = 5 and 9):---        Total Cost:             4.15451e+06        Query Degree:           1                               Rows                              RETURN                              (   1)                               Cost                                I/O                                |                               5000                              TBSCAN                              (   2)                            4.15451e+06                               13222                            /---+----\                       0.711056       10000                        FETCH    TABLE: TAINING                        (   3)         SUPPLIER                        6404.31        Q5                        4596.61         +----------------+--+--------------+         80              4.97537e-06      800000       IXSCAN              FETCH      TABLE: TAINING       (   4)                      (   5)        PARTSUPP       4238.25             20.3157          Q1       4276.71                3         |               /---+----\       800000           1         200000   INDEX: SYSIBM     IXSCAN   TABLE: TAINING SQL170928182919810  (   6)        PART         Q1          13.5471        Q2                        2                       |                     200000                 INDEX: SYSIBM---From the plan, DB2 doesn’t use a join algorithm like hash join for (Partsupp anti-join Part).I executed the same query on PostgreSQL 9.6.1, and the plan by PostgreSQL used hash join for both of the anti-joins. It ran in less than one second.---Memory parameters used in DB2:bufferpool size = 5GBsortheap = 250000 (1GB)Memory parameters used in PostgreSQL:shared_buffers = 5GBwork_mem = 1GB---I then tried to force hash join between Partsupp and Part using the following guideline in optimization profile in DB2.                        Applying this optimization profile, db2exfmt showed that forcing hash join for partsupp and part was not allowed, because they were not in the same FROM clause (since anti-join was expressed as NOT EXISTS subquery):---Diagnostic Details:     EXP0015W  Invalid join request. Join refers to                        tables that are not in the same FROM clause. Line                        number "28", character number "14".---I thought maybe DB2 did not automatically convert NOT EXISTS to anti-join. I then tried the following optimization profile to enable the conversion:              DB2 still did not apply this:---Diagnostic Details:     EXP0028W  The optimization guideline SUBQ2JOIN has                        not been applied because semantic conditions are                        not met or context was changed.---To see whether DB2 can handle query with only one anti-join, I extracted the inner NOT EXISTS subquery and issued it to DB2:                SELECT  ps_suppkey, ps_supplycost                FROM    partsupp                WHERE  NOT EXISTS                        (                        SELECT  NULL                        FROM    part                        WHERE  ps_partkey = p_partkey                                AND p_name = 'cyan orchid indian cornflower saddle'                        );DB2 recognized it as anti-join and used hash join in the plan. Thus, it seems the problem is only for two nested NOT EXISTS.What do you think I might be doing wrong that causes DB2 not to use hash join for the query? Or is it the case that DB2 really cannot recognize two nested NOT EXISTS as two anti-joins? Any help would be much appreciated.Regards,TaiNing

Site Links: View post online   View mailing list online   Start new thread via email   Unsubscribe from this mailing list   Manage your subscription  

This email has been sent to: [login to unmask email] a data refresh task in less time than it takes to make a cup of coffee + save up to 90% in CPU
ESAi's BCV5 & XDM fast data refresh & Test Data Mgmt products will make you a hero to users. See
http://www.ESAIGroup.com/idug

Use of this email content is governed by the terms of service at:
http://www.idug.org/p/cm/ld/fid=2

TaiNing WANG

RE: Nested NOT EXISTS subqueries to anti-joins
(in response to Joe Geller)

Hi Joe,

Thank you for the reply! How do you rewrite it as explicit anti-join? I thought DB2 (and most if not all database systems) doesn't have an explicit anti-join operator, and NOT EXISTS is the standard syntax for anti-join?

Regards,

TaiNing

TaiNing WANG

RE: Nested NOT EXISTS subqueries to anti-joins
(in response to Dave Nance)

Hi David,

Thanks for the reply! No, I don't have an index on p_name. I tried the suggested simplification, but the plan didn't change. I think the big problem is that DB2 doesn't use a join algorithm like hash join for the query. On the other hand, Postgresql executes the query very fast using hash join, and doesn't need an index on p_name.

 

Regards,

TaiNing


In Reply to Dave Nance:

   I would suggest the SQL more along the lines of below. Also, on the part table, do you have an index on the p_name column?
SELECT  s_suppkey
FROM    supplier
WHERE  NOT EXISTS
            (
             SELECT  ps_suppkey, ps_supplycost
                FROM    partsupp
             WHERE  NOT EXISTS
                        (
                        SELECT  NULL
                        FROM    part
                        WHERE  ps_partkey = p_partkey
                                AND p_name = 'cyan orchid indian cornflower saddle'
                        )
               AND  s_acctbal > 2500 * ps_supplycost
               AND  ps_suppkey = s_suppkey
            )
 David Nance




Site Links: View post online   View mailing list online   Start new thread via email   Unsubscribe from this mailing list   Manage your subscription  

This email has been sent to: [login to unmask email] a data refresh task in less time than it takes to make a cup of coffee + save up to 90% in CPU
ESAi's BCV5 & XDM fast data refresh & Test Data Mgmt products will make you a hero to users. See
http://www.ESAIGroup.com/idug

Use of this email content is governed by the terms of service at:
http://www.idug.org/p/cm/ld/fid=2

Joe Geller

RE: Nested NOT EXISTS subqueries to anti-joins
(in response to TaiNing WANG)

Db2 does not have an explicit anti-join syntax, but there is a join that is equivalent to the NOT EXISTS.  The idea is to use a LEFT OUTER JOIN (which means take the left hand table regardless of whether or not there is a row in the right hand table) and have a WHERE clause that says, "only take the ones that do not have a row in the right hand table".  That predicate would be to compare a column of the right hand table to NULL (because if there is no row from the right, all of it's columns in the result set will be set to NULL).  Here is a general example:

SELECT A.COL1, A.COL2

  FROM TAB1 A

  WHERE NOT EXISTS (SELECT * FROM TAB2 B WHERE A.KEY1 = B.KEY1)

SELECT A.COL1, A.COL2

FROM TAB1 A LEFT OUTER JOIN TAB2 B ON A.KEY1 = B.KEY1

WHERE B.KEY1 IS NULL

If there is a matching row in B, it will fail the where clause predicate and the row will be skipped.  If there is no matching row in B, the row in A will be selected.



In Reply to TaiNing WANG:

Hi Joe,

Thank you for the reply! How do you rewrite it as explicit anti-join? I thought DB2 (and most if not all database systems) doesn't have an explicit anti-join operator, and NOT EXISTS is the standard syntax for anti-join?

Regards,

TaiNing

TaiNing WANG

RE: Nested NOT EXISTS subqueries to anti-joins
(in response to Joe Geller)

Hi Joe,

I see. It seems DB2 doesn't automatically recognize two NOT EXISTS as two anti-joins. I will play with optimization guidelines a bit more to try to force that. If that's too difficult or impossible to achieve, I will consider manually rewriting the query to OUTER JOIN followed by IS NULL selection. Thanks a lot!

Regards,

TaiNing