sql rewrite

Henrik Krakowski

sql rewrite

Hi

 

Db2 v11 on z

 

I have a simple (??) sql statement,which takes around 2 cpu minutes to execute  and I am wondering,whether there is possible to rewrite that one in order to lower the cpu cost.

It is an update statement ,containg correlated subquery on the table,which contains about 40,000 rows.

There is an index on columns  IDAO,COMPONENT.

The sql is like that:

 

UPDATE  TABTEST      A        

     SET COMPONENT = :HV1      

   WHERE EXISTS (             

  SELECT  IDAO                

    FROM TABTEST              

   WHERE IDAO = A.IDAO        

     AND COMPONENT = :HV2  )   

     AND COMPONENT <> :HV3;     

 

         Some details

COMPONENT column contains only 5 different values,varying from 33,000 to 400.

There are many duplicate rows with same values on these 2 columns

The accesspath shows

 

  1. Reading sequentially outer table (PREFETCH=S) and for every hit
  2. Reading via an index with 2 matching columns the inner table
  3. The estimated  cost in DSN_STATEMNT_TABLE is about 0.6 cpu second…

 

My undesratnding of this sql    is as follows

  1.   Select all values of IDAO where COMPONENT=:HV2
  2. Read all rows with IDAO from above and remove these rows where COMPONENT<> :HV3
  3. Update remaining rows  with COMPONENT= :HV1

 

 

I assume,that I can possibly improve the outcome by having in the inner part

 

SELECT      distinct  IDAO                

    FROM TABTEST              

   WHERE IDAO = A.IDAO        

     AND COMPONENT = :HV2 

 

Any other suggestions ?

Thanks in advance

 

Henrik Krakowski

 

 

 

Walter Jani&#223;en

AW: sql rewrite
(in response to Henrik Krakowski)
Hi

Correlated Subqueries are always Stage2 and therefore an index on the table in the main-query can’t be used, but you can change it to a non-correlated IN-subquery. Then you can get matching index access on TABTEST, but probably a worse access for the subquery.

Kind regards
Walter Janißen [standard_IBM+Champ+7+Yr+Analytics]

ITERGO Informationstechnologie GmbH
Anwendungsentwicklung
Technische Anwendungsarchitektur
Victoriaplatz 2
D-40198 Düsseldorf
[login to unmask email]<mailto:[login to unmask email]>

ITERGO Informationstechnologie GmbH
Vorsitzender des Aufsichtsrats: Christian Diedrich
Geschäftsführung: Dr. Bettina Anders (Vorsitzende),
Lothar Engelke, Ina Kirchhof, Dr. Michael Regauer
Sitz: Düsseldorf, Handelsregister: Amtsgericht Düsseldorf HRB 37996

Von: Henrik Krakowski <[login to unmask email]>
Gesendet: Donnerstag, 14. Juni 2018 10:13
An: [login to unmask email]
Betreff: [DB2-L] - sql rewrite


Hi



Db2 v11 on z



I have a simple (??) sql statement,which takes around 2 cpu minutes to execute and I am wondering,whether there is possible to rewrite that one in order to lower the cpu cost.

It is an update statement ,containg correlated subquery on the table,which contains about 40,000 rows.

There is an index on columns IDAO,COMPONENT.

The sql is like that:



UPDATE TABTEST A

SET COMPONENT = :HV1

WHERE EXISTS (

SELECT IDAO

FROM TABTEST

WHERE IDAO = A.IDAO

AND COMPONENT = :HV2 )

AND COMPONENT <> :HV3;



Some details

COMPONENT column contains only 5 different values,varying from 33,000 to 400.

There are many duplicate rows with same values on these 2 columns

The accesspath shows


1. Reading sequentially outer table (PREFETCH=S) and for every hit
2. Reading via an index with 2 matching columns the inner table
3. The estimated cost in DSN_STATEMNT_TABLE is about 0.6 cpu second…



My undesratnding of this sql is as follows
1. Select all values of IDAO where COMPONENT=:HV2
2. Read all rows with IDAO from above and remove these rows where COMPONENT<> :HV3
3. Update remaining rows with COMPONENT= :HV1





I assume,that I can possibly improve the outcome by having in the inner part



SELECT distinct IDAO

FROM TABTEST

WHERE IDAO = A.IDAO

AND COMPONENT = :HV2



Any other suggestions ?

Thanks in advance



Henrik Krakowski







-----End Original Message-----
Attachments

  • image001.png (2.6k)

bernd oppolzer

sql rewrite
(in response to Henrik Krakowski)
Hi,

first of all, the outer and the inner table are the same;
with about 40000 rows. The 2 minutes seem very strange to me;
there must be something bad going on.

UPDATE  TABTEST      A
     SET COMPONENT = :HV1
   WHERE EXISTS (
  SELECT  IDAO
   FROM TABTEST
   WHERE IDAO = A.IDAO
     AND COMPONENT = :HV2  )
     AND COMPONENT <> :HV3;

What the SQL wants to do:

check all rows having the same IDAO key; if there is one having
the value :HV2 in column COMPONENT, set all these rows to :HV1,
except the ones which have :HV3.

If there is a good index on (IDAO, COMPONENT) - best -
or only on (IDAO) and the number of rows per IDAO is small - still good -
the SQL should be very fast, because the subquery is index-only.
And you told us that it is executed using matching index scan
colcount 2. The TS scan for the outer table is normal and should
terminate within sub-seconds, given the 40.000 rows.

So, given the two minutes, there must be something really bad
happening (some sort of RID pool overflow or anything like that,
can't imagine). You should get a clue by monitoring your system.

BTW: changing the select list of the inner select to DISTINCT
shouldn't change anything. You could as well write

SELECT 1 FROM TABTEST ...

in the inner query; that's what I always do when using EXISTS,
because with EXISTS, the result of the inner query doesn't matter
at all. The subquery should terminate when the first result is found,
and you get additional benefit from the correlated subquery cache,
which executes the subquery only once, given the same input values,
and caches the results.

HTH, kind regards

Bernd
freelance DB2 consultant
(call me, if you need one)



Am 14.06.2018 um 10:13 schrieb Henrik Krakowski:
>
> Hi
>
> Db2 v11 on z
>
> I have a simple (??) sql statement,which takes around 2 cpu minutes to
> execute  and I am wondering,whether there is possible to rewrite that
> one in order to lower the cpu cost.
>
> It is an update statement ,containg correlated subquery on the
> table,which contains about 40,000 rows.
>
> There is an index on columns  IDAO,COMPONENT.
>
> The sql is like that:
>
> UPDATE  TABTEST      A
>
>      SET COMPONENT = :HV1
>
>    WHERE EXISTS (
>
>   SELECT  IDAO
>
>     FROM TABTEST
>
>    WHERE IDAO = A.IDAO
>
>      AND COMPONENT = :HV2  )
>
>      AND COMPONENT <> :HV3;
>
>          Some details
>
> COMPONENT column contains only 5 different values,varying from 33,000
> to 400.
>
> There are many duplicate rows with same values on these 2 columns
>
> The accesspath shows
>
> 1. Reading sequentially outer table (PREFETCH=S) and for every hit
> 2. Reading via an index with 2 matching columns the inner table
> 3. The estimated  cost in DSN_STATEMNT_TABLE is about 0.6 cpu second…
>
> My undesratnding of this sql    is as follows
>
> 1.   Select all values of IDAO where COMPONENT=:HV2
> 2. Read all rows with IDAO from above and remove these rows where
> COMPONENT<> :HV3
> 3. Update remaining rows  with COMPONENT= :HV1
>
> I assume,that I can possibly improve the outcome by having in the
> inner part
>
> SELECT      distinct  IDAO
>
>     FROM TABTEST
>
>    WHERE IDAO = A.IDAO
>
>      AND COMPONENT = :HV2
>
> Any other suggestions ?
>
> Thanks in advance
>
> *Henrik Krakowski*
>
>
> -----End Original Message-----

Michael Hannan

RE: sql rewrite
(in response to Henrik Krakowski)

Henrik,

DB2 can probably transform the access path from subquery to a "join" (unless DB2 prevents it due to same table access- I forget if that matters) if you have an index to suit access to TABTEST (currently in the subquery) first, by columns COMPONENT, IDAO (in addition to index commencing IDAO).

Even if does not transform, this index would still perform better.

Are there any referential integrity rules affected by the Update or any Triggers? These could make the performance worse than expected. 

Would be better to see the access path (and what columns are in the indexes used), your description of it is a little vague.

Michael Hannan,
DB2 Application Performance Specialist
CPT Global Ltd

Edited By:
Michael Hannan[Organization Members] @ Jun 14, 2018 - 12:55 PM (Europe/Berlin)
Michael Hannan[Organization Members] @ Jun 14, 2018 - 12:58 PM (Europe/Berlin)
Michael Hannan[Organization Members] @ Jun 14, 2018 - 01:01 PM (Europe/Berlin)
Michael Hannan[Organization Members] @ Jun 14, 2018 - 01:02 PM (Europe/Berlin)

bernd oppolzer

sql rewrite
(in response to bernd oppolzer)
Just some more ideas:

please show us all index definitions including statistics on table TABTEST;

it is crucial that the index is defined as (IDAO, COMPONENT)
and not - for example - as (COMPONENT, IDAO), because in the latter
case it would we bad, given the large number of duplicates in the COMPONENT
column. (Ok, if MATCHCOLS is really 2, this should not be the problem,
but anyway ... the COMPONENT predicate contains a host variable,
that could make a difference).

Or: if there is a second index containing only the COMPONENT column,
which is preferred by DB2 due to bad (manipulated) statistic values. I
had this
at a customer's site. When running RUNSTATS, the problem disappeared.

Consider running RUNSTATS and then EXPLAIN again.

Kind regards

Bernd



Am 14.06.2018 um 11:13 schrieb Bernd Oppolzer:
>
> Hi,
>
> first of all, the outer and the inner table are the same;
> with about 40000 rows. The 2 minutes seem very strange to me;
> there must be something bad going on.
>
> UPDATE  TABTEST      A
>      SET COMPONENT = :HV1
>    WHERE EXISTS (
>   SELECT  IDAO
>    FROM TABTEST
>    WHERE IDAO = A.IDAO
>      AND COMPONENT = :HV2  )
>      AND COMPONENT <> :HV3;
>
> What the SQL wants to do:
>
> check all rows having the same IDAO key; if there is one having
> the value :HV2 in column COMPONENT, set all these rows to :HV1,
> except the ones which have :HV3.
>
> If there is a good index on (IDAO, COMPONENT) - best -
> or only on (IDAO) and the number of rows per IDAO is small - still good -
> the SQL should be very fast, because the subquery is index-only.
> And you told us that it is executed using matching index scan
> colcount 2. The TS scan for the outer table is normal and should
> terminate within sub-seconds, given the 40.000 rows.
>
> So, given the two minutes, there must be something really bad
> happening (some sort of RID pool overflow or anything like that,
> can't imagine). You should get a clue by monitoring your system.
>
> BTW: changing the select list of the inner select to DISTINCT
> shouldn't change anything. You could as well write
>
> SELECT 1 FROM TABTEST ...
>
> in the inner query; that's what I always do when using EXISTS,
> because with EXISTS, the result of the inner query doesn't matter
> at all. The subquery should terminate when the first result is found,
> and you get additional benefit from the correlated subquery cache,
> which executes the subquery only once, given the same input values,
> and caches the results.
>
> HTH, kind regards
>
> Bernd
> freelance DB2 consultant
> (call me, if you need one)
>
>
>
> Am 14.06.2018 um 10:13 schrieb Henrik Krakowski:
>>
>> Hi
>>
>> Db2 v11 on z
>>
>> I have a simple (??) sql statement,which takes around 2 cpu minutes
>> to execute  and I am wondering,whether there is possible to rewrite
>> that one in order to lower the cpu cost.
>>
>> It is an update statement ,containg correlated subquery on the
>> table,which contains about 40,000 rows.
>>
>> There is an index on columns  IDAO,COMPONENT.
>>
>> The sql is like that:
>>
>> UPDATE  TABTEST      A
>>
>>      SET COMPONENT = :HV1
>>
>>    WHERE EXISTS (
>>
>>   SELECT  IDAO
>>
>>     FROM TABTEST
>>
>>    WHERE IDAO = A.IDAO
>>
>>      AND COMPONENT = :HV2  )
>>
>>      AND COMPONENT <> :HV3;
>>
>>          Some details
>>
>> COMPONENT column contains only 5 different values,varying from 33,000
>> to 400.
>>
>> There are many duplicate rows with same values on these 2 columns
>>
>> The accesspath shows
>>
>> 1. Reading sequentially outer table (PREFETCH=S) and for every hit
>> 2. Reading via an index with 2 matching columns the inner table
>> 3. The estimated  cost in DSN_STATEMNT_TABLE is about 0.6 cpu second…
>>
>> My undesratnding of this sql    is as follows
>>
>> 1.   Select all values of IDAO where COMPONENT=:HV2
>> 2. Read all rows with IDAO from above and remove these rows where
>> COMPONENT<> :HV3
>> 3. Update remaining rows  with COMPONENT= :HV1
>>
>> I assume,that I can possibly improve the outcome by having in the
>> inner part
>>
>> SELECT      distinct  IDAO
>>
>>     FROM TABTEST
>>
>>    WHERE IDAO = A.IDAO
>>
>>      AND COMPONENT = :HV2
>>
>> Any other suggestions ?
>>
>> Thanks in advance
>>
>> *Henrik Krakowski*
>>
>>
>> -----End Original Message-----
>
>
> -----End Original Message-----

Sam Baugh

sql rewrite
(in response to bernd oppolzer)
How about something like

UPDATE TABTEST A

SET COMPONENT = :HV1

WHERE COMPONENT <> :HV3

AND IDAO IN (SELECT DISTINCT IDAO FROM TABTEST WHERE COMPONENT = :HV2)

On Thu, Jun 14, 2018 at 4:13 AM, Bernd Oppolzer <[login to unmask email]>
wrote:

> Hi,
>
> first of all, the outer and the inner table are the same;
> with about 40000 rows. The 2 minutes seem very strange to me;
> there must be something bad going on.
>
> UPDATE TABTEST A
> SET COMPONENT = :HV1
> WHERE EXISTS (
> SELECT IDAO
> FROM TABTEST
> WHERE IDAO = A.IDAO
> AND COMPONENT = :HV2 )
> AND COMPONENT <> :HV3;
>
> What the SQL wants to do:
>
> check all rows having the same IDAO key; if there is one having
> the value :HV2 in column COMPONENT, set all these rows to :HV1,
> except the ones which have :HV3.
>
> If there is a good index on (IDAO, COMPONENT) - best -
> or only on (IDAO) and the number of rows per IDAO is small - still good -
> the SQL should be very fast, because the subquery is index-only.
> And you told us that it is executed using matching index scan
> colcount 2. The TS scan for the outer table is normal and should
> terminate within sub-seconds, given the 40.000 rows.
>
> So, given the two minutes, there must be something really bad
> happening (some sort of RID pool overflow or anything like that,
> can't imagine). You should get a clue by monitoring your system.
>
> BTW: changing the select list of the inner select to DISTINCT
> shouldn't change anything. You could as well write
>
> SELECT 1 FROM TABTEST ...
>
> in the inner query; that's what I always do when using EXISTS,
> because with EXISTS, the result of the inner query doesn't matter
> at all. The subquery should terminate when the first result is found,
> and you get additional benefit from the correlated subquery cache,
> which executes the subquery only once, given the same input values,
> and caches the results.
>
> HTH, kind regards
>
> Bernd
> freelance DB2 consultant
> (call me, if you need one)
>
>
>
> Am 14.06.2018 um 10:13 schrieb Henrik Krakowski:
>
> Hi
>
>
>
> Db2 v11 on z
>
>
>
> I have a simple (??) sql statement,which takes around 2 cpu minutes to
> execute and I am wondering,whether there is possible to rewrite that one
> in order to lower the cpu cost.
>
> It is an update statement ,containg correlated subquery on the table,which
> contains about 40,000 rows.
>
> There is an index on columns IDAO,COMPONENT.
>
> The sql is like that:
>
>
>
> UPDATE TABTEST A
>
> SET COMPONENT = :HV1
>
> WHERE EXISTS (
>
> SELECT IDAO
>
> FROM TABTEST
>
> WHERE IDAO = A.IDAO
>
> AND COMPONENT = :HV2 )
>
> AND COMPONENT <> :HV3;
>
>
>
> Some details
>
> COMPONENT column contains only 5 different values,varying from 33,000 to
> 400.
>
> There are many duplicate rows with same values on these 2 columns
>
> The accesspath shows
>
>
>
> 1. Reading sequentially outer table (PREFETCH=S) and for every hit
> 2. Reading via an index with 2 matching columns the inner table
> 3. The estimated cost in DSN_STATEMNT_TABLE is about 0.6 cpu second…
>
>
>
> My undesratnding of this sql is as follows
>
> 1. Select all values of IDAO where COMPONENT=:HV2
> 2. Read all rows with IDAO from above and remove these rows where
> COMPONENT<> :HV3
> 3. Update remaining rows with COMPONENT= :HV1
>
>
>
>
>
> I assume,that I can possibly improve the outcome by having in the inner
> part
>
>
>
> SELECT distinct IDAO
>
> FROM TABTEST
>
> WHERE IDAO = A.IDAO
>
> AND COMPONENT = :HV2
>
>
>
> Any other suggestions ?
>
> Thanks in advance
>
>
>
> *Henrik Krakowski*
>
>
>
>
>
>
>
> -----End Original Message-----
>
>
>
> -----End Original Message-----
>

bernd oppolzer

sql rewrite
(in response to Sam Baugh)
IMO, a subquery using EXIST will perform better in general in this case.

Compare

UPDATE TABTEST A
   SET COMPONENT = :HV1
 WHERE COMPONENT <> :HV3
   AND IDAO IN
(SELECT DISTINCT IDAO
FROM TABTEST
WHERE COMPONENT = :HV2);

with

UPDATE TABTEST A
   SET COMPONENT = :HV1
 WHERE COMPONENT <> :HV3
   AND EXISTS
(SELECT 1
FROM TABTEST
         WHERE IDAO = A.IDAO
AND COMPONENT = :HV2);

The subquery in the first case will materialize,
and it will be duplicate-eliminated. At least that was
true for earlier versions of DB2.

In both cases you will have a TS scan on the outer table;
in the first case DB2 looks into the materialized subquery result for
every row;
in the second case it checks the index of TABTEST (only once for every
input value combination, because of the corr. subq. cache).
The cost of the materialization and duplicate elimination is not needed.

I don't know, if today there is a sort of index access on the outer table,
based on the materialized subquery results (sort of nested join).
This will sure be possible with SELECTs, if the subquery doesn't
produce duplicates (the subquery is then rewritten to a nested loop join),
but here we have an UPDATE statement, and the two tables are the same,
which could complicate things further.

I would try both variants, use EXPLAIN, and compare the results.
In earlier versions, it was very clear from looking at the PLAN_TABLE
that the correlated variant was better, given usable indexes on the
columns (IDAO, COMPONENT). The non-correlated variant used
a workfile and a sort, and the correlated variant did not.

Kind regards

Bernd



Am 14.06.2018 um 16:42 schrieb Sam Baugh:
> How about something like
>
> UPDATE  TABTEST      A
>
>      SET COMPONENT = :HV1
>
>    WHERE COMPONENT <> :HV3
>
>     AND IDAO IN (SELECT DISTINCT IDAO FROM TABTEST WHERE COMPONENT = :HV2)
>
>
> On Thu, Jun 14, 2018 at 4:13 AM, Bernd Oppolzer <[login to unmask email]
> <mailto:[login to unmask email]>> wrote:
>
> Hi,
>
> first of all, the outer and the inner table are the same;
> with about 40000 rows. The 2 minutes seem very strange to me;
> there must be something bad going on.
>
> UPDATE  TABTEST      A
>      SET COMPONENT = :HV1
>    WHERE EXISTS (
>   SELECT  IDAO
>    FROM TABTEST
>    WHERE IDAO = A.IDAO
>      AND COMPONENT = :HV2  )
>      AND COMPONENT <> :HV3;
>
> What the SQL wants to do:
>
> check all rows having the same IDAO key; if there is one having
> the value :HV2 in column COMPONENT, set all these rows to :HV1,
> except the ones which have :HV3.
>
> If there is a good index on (IDAO, COMPONENT) - best -
> or only on (IDAO) and the number of rows per IDAO is small - still
> good -
> the SQL should be very fast, because the subquery is index-only.
> And you told us that it is executed using matching index scan
> colcount 2. The TS scan for the outer table is normal and should
> terminate within sub-seconds, given the 40.000 rows.
>
> So, given the two minutes, there must be something really bad
> happening (some sort of RID pool overflow or anything like that,
> can't imagine). You should get a clue by monitoring your system.
>
> BTW: changing the select list of the inner select to DISTINCT
> shouldn't change anything. You could as well write
>
> SELECT 1 FROM TABTEST ...
>
> in the inner query; that's what I always do when using EXISTS,
> because with EXISTS, the result of the inner query doesn't matter
> at all. The subquery should terminate when the first result is found,
> and you get additional benefit from the correlated subquery cache,
> which executes the subquery only once, given the same input values,
> and caches the results.
>
> HTH, kind regards
>
> Bernd
> freelance DB2 consultant
> (call me, if you need one)
>
>
>
> Am 14.06.2018 um 10:13 schrieb Henrik Krakowski:
>>
>> Hi
>>
>> Db2 v11 on z
>>
>> I have a simple (??) sql statement,which takes around 2 cpu
>> minutes to execute  and I am wondering,whether there is possible
>> to rewrite that one in order to lower the cpu cost.
>>
>> It is an update statement ,containg correlated subquery on the
>> table,which contains about 40,000 rows.
>>
>> There is an index on columns  IDAO,COMPONENT.
>>
>> The sql is like that:
>>
>> UPDATE  TABTEST      A
>>
>>      SET COMPONENT = :HV1
>>
>>    WHERE EXISTS (
>>
>>   SELECT  IDAO
>>
>>     FROM TABTEST
>>
>>    WHERE IDAO = A.IDAO
>>
>>      AND COMPONENT = :HV2  )
>>
>>      AND COMPONENT <> :HV3;
>>
>>          Some details
>>
>> COMPONENT column contains only 5 different values,varying from
>> 33,000 to 400.
>>
>> There are many duplicate rows with same values on these 2 columns
>>
>> The accesspath shows
>>
>> 1. Reading sequentially outer table (PREFETCH=S) and for every hit
>> 2. Reading via an index with 2 matching columns the inner table
>> 3. The estimated  cost in DSN_STATEMNT_TABLE is about 0.6 cpu
>> second…
>>
>> My undesratnding of this sql    is as follows
>>
>> 1.   Select all values of IDAO where COMPONENT=:HV2
>> 2. Read all rows with IDAO from above and remove these rows
>> where COMPONENT<> :HV3
>> 3. Update remaining rows  with COMPONENT= :HV1
>>
>> I assume,that I can possibly improve the outcome by having in the
>> inner part
>>
>> SELECT      distinct  IDAO
>>
>>     FROM TABTEST
>>
>>    WHERE IDAO = A.IDAO
>>
>>      AND COMPONENT = :HV2
>>
>> Any other suggestions ?
>>
>> Thanks in advance
>>
>> *Henrik Krakowski*
>>
>>
>> -----End Original Message-----
>
>
> -----End Original Message-----

Michael Hannan

RE: sql rewrite
(in response to bernd oppolzer)

In Reply to bernd oppolzer:

IMO, a subquery using EXIST will perform better in general in this case.

Compare

UPDATE TABTEST A
   SET COMPONENT = :HV1
 WHERE COMPONENT <> :HV3
   AND IDAO IN
(SELECT DISTINCT IDAO
FROM TABTEST
WHERE COMPONENT = :HV2);

with

UPDATE TABTEST A
   SET COMPONENT = :HV1
 WHERE COMPONENT <> :HV3
   AND EXISTS
(SELECT 1
FROM TABTEST
         WHERE IDAO = A.IDAO
AND COMPONENT = :HV2);

The subquery in the first case will materialize,
and it will be duplicate-eliminated. At least that was
true for earlier versions of DB2.

This is actually a complex topic. DB2 can transform Non-correlated subqueries into Correlated, and can transform Correlated subqueries into Join like processes. The rules were once published in the manual, but in later versions, the rules became so complex that they were only stated in vague terms, basically that transformation is possible.

When in doubt you have to look at your access path, understand it, and decide if it is good or not.

The old style of IN subqueries, when not transformed, was to materialise the IN list with Dups removed to a work file, then for each row of outer query search the materialised work file assisted by a sparse index. I would agree with Bernd that this is likely to have very bad performance unless the IN list is small.

I almost never use IN type subqueries in my own query SQLs, unless they are so simple that I am sure DB2 will transform them into something better. Correlated IN can be even worse, if old method is repeated for each correlation value.

Non-correlated IN subqueries are unfortunately very popular but I don't recommend them because of the danger of hitting a non-transformation case, with a large IN list.

Just my opinions that may not match everyone else. I prefer to write a correlated EXISTS subquery given a suitable index or best index, or an access path where DB2 built an in memory index  or table in memory , or an access path where DB2 transformed to a join like process (in reverse direction) when a good subquery lookup index was not there.

I say generally avoid using IN subqueries. It is very rare they will be the winner with an NCOSUB access path and two Sort Indicators. You don't want to see this access path too much. I consider it a last resort for a weak SQL when transform is too difficult. With transformation by Optimizer to something else, they may perform very well.

Not sure why Sam coded DISTINCT inside his IN subquery. IN already implies removal of duplicates. In old days, DISTINCT might have disabled certain transformations. These days I would need to see the access path to be sure I was happy with it, i.e.make sure it was transformed. Otherwise I think IN is risky.

Michael Hannan,
DB2 Application Performance Specialist
CPT Global Ltd

Edited By:
Michael Hannan[Organization Members] @ Jun 14, 2018 - 09:55 PM (Europe/Berlin)
Michael Hannan[Organization Members] @ Jun 14, 2018 - 10:01 PM (Europe/Berlin)
Michael Hannan[Organization Members] @ Jun 14, 2018 - 10:02 PM (Europe/Berlin)

Daniel Luksetich

sql rewrite
(in response to Michael Hannan)
Here is my general rule of thumb:

Which technique is best when? Non-correlated subqueries are typically best when:



* The subquery result is small compared to the outer query
* The subquery result isn’t excessively large
* The outer query uses matching index access for the predicate containing the subquery
* There’s no supporting index for the subquery.

Correlated subqueries are typically best when:

* The subquery result is potentially large compared to the outer query
* The outer query has no supporting index for the predicate containing the subquery
* The predicate inside the subquery definitely has a supporting index.

Joins are typically best when:

* There’s good matching index potential for outer query and subquery
* The subquery converted to a join doesn’t introduce duplicate rows in the result.



Taken from this article:

http://www.enterprisesystemsmedia.com/it-management/db2-for-z-os-sql-performance-choices



Ultimately you run a benchmark. Here are my guidelines for that:

https://www.idug.org/p/bl/et/blogid=278 https://www.idug.org/p/bl/et/blogid=278&blogaid=344 &blogaid=344



I am right now working on some SQL that runs counts on key values in tables based upon a non-correlated subquery that contains nearly 1 million rows. Surprise! Performance is excellent. No transformation, just a regular non correlated subquery on Db2 11 for z/OS.



Select from table1 where <key-column> in

(select <key-column> from table2) <-1 million row subquery result



Go figure…



Happy Friday!



Daniel L Luksetich

DanL Database Consulting



IBM GOLD Consultant

IBM Champion for Analytics

IDUG Content Committee Past-Chairman

IDUG DB2-L Administrator

IBM Certified Database Adminstrator – DB2 11 DBA for z/OS

IBM Certified System Administrator – DB2 11 for z/OS

IBM Certified Application Developer – DB2 11 for z/OS

IBM Certified Advanced Database Administrator – DB2 10.1 for Linux UNIX and Windows



From: Michael Hannan <[login to unmask email]>
Sent: Thursday, June 14, 2018 2:52 PM
To: [login to unmask email]
Subject: [DB2-L] - RE: sql rewrite



In Reply to bernd oppolzer:

IMO, a subquery using EXIST will perform better in general in this case.

Compare

UPDATE TABTEST A
SET COMPONENT = :HV1
WHERE COMPONENT <> :HV3
AND IDAO IN
(SELECT DISTINCT IDAO
FROM TABTEST
WHERE COMPONENT = :HV2);

with

UPDATE TABTEST A
SET COMPONENT = :HV1
WHERE COMPONENT <> :HV3
AND EXISTS
(SELECT 1
FROM TABTEST
WHERE IDAO = A.IDAO
AND COMPONENT = :HV2);

The subquery in the first case will materialize,
and it will be duplicate-eliminated. At least that was
true for earlier versions of DB2.

This is actually a complex topic. DB2 can transform Non-correlated subqueries into Correlated, and can transform Correlated subqueries into Join like processes. The rules were once published in the manual, but in later versions, the rules became so complex that they were only stated in vague terms, basically that transformation is possible.

When in doubt you have to look at your access path, understand it, and decide if it is good or not.

The old style of IN subqueries, when not transformed, was to materialise the IN list with Dups removed to a work file, then for each row of outer query search the materialised work file assisted by a sparse index. I would agree with Bernd that this is likely to have very bad performance unless the IN list is small.

I almost never use IN type subqueries in my own query SQLs, unless they are so simple that I am sure DB2 will transform them into something better. Correlated IN can be even worse, if old method is repeated for each correlation value.

Non-correlated IN subqueries are unfortunately very popular but I don't recommend them because of the danger of hitting a non-transformation case, with a large IN list.

Just my opinions that may not match everyone else. I prefer to write a correlated EXISTS subquery given a suitable index or best index, or an access path where DB2 built an in memory index or table in memory , or an access path where DB2 transformed to a join like process (in reverse direction) when a good subquery lookup index was not there.

I say generally avoid using IN subqueries. It is very rare they will be the winner with an NCOSUB access path and two Sort Indicators. With transformation by Optimizer to something else, they may perform very well.

Michael Hannan,
DB2 Application Performance Specialist
CPT Global Ltd



-----End Original Message-----

Michael Hannan

RE: sql rewrite
(in response to Daniel Luksetich)

In Reply to Daniel Luksetich:

I am right now working on some SQL that runs counts on key values in tables based upon a non-correlated subquery that contains nearly 1 million rows. Surprise! Performance is excellent. No transformation, just a regular non correlated subquery on Db2 11 for z/OS.

Select from table1 where in (select from table2) 

Go figure…

I recall your rules of thumb from last time.  Not everyone uses them and understands them.

Such a simple case should be transformed, so I would like to know what disabled that, e.g. no indexes on anything. Unfortunately a very simple test case is not always representative of the general case and worse case queries with more complexity. Yes it is possible that simple IN will perform. It is also possible that complex ones perform very badly. Certain complexities can disable Optimizer assists like in memory cache sparse index, etc.

Can the average programmer be a super expert on just which situations will give O.K. performance with IN? I doubt it.

The method the Old Correlated IN subquery used, with two sort indicators indicating Descending sort and removal of duplicates to work file, followed by search the work file with 1 page sparse index assist, was known to perform very badly for large IN  lists (a lot of Getpages per value searched). DB2 V11 has tried to help. We are not limited to a single page sparse index any longer and we have a large shared MXDTCACHE which maybe searched by hashing or
 binary search algorithm. It is possible that it performs O.K. but would be  surprised if it outperforms a transformed case (given any reasonable indexing).

If IN list is sorted and becomes outer table of a join and what was outer can now be looked up with an index matched that can be good. Maybe not be possible if original outer table had to be matched on other columns. So if not transformed, IN relies on building a hash lookup or sparse index to the search list to reduce the Workfile Getpages. Will perform best if entire search list fits in the Cache.

On the other hand I see so many IN subqueries at customer sites, performing very poorly, that I cannot really recommend the technique. I know the basic workings so can see why it performs so badly too. Yes there are some exceptions, which is basically for me, only when a join or correlated subquery is unworkable, since the rules otherwise become too complex. The worst cases of IN tend to be correlated ones, then INs where subquery selects multiple columns (row value expression), NOT INS (not transformable to join), finally any IN where outer table IN predicate is stage 2, does strong filtering, and IN list is large.

In the modern day, even missing an index is not always a big deal for Optimizer. It can build an in memory sparse index for joins and correlated EXISTs subqueries if it is really needed. Huge joins can use Merge Scan given good Equals Join predicates.

Do I ever need to use IN?  Hard to remember many occasions when I simply had to. Maybe in some cases for tables with no indexing I chose it, but was very reluctant to if IN list is large. How much memory Cache can I use for my query?

My message really, is before using an IN, just because you always do, think about the access path and understand why it will perform better than a correlated EXISTS or a join. Mostly it doesn't and sometimes the performance can be really  bad. I see those a lot and wonder why the programmer used an IN. The answer was because he likes IN, most likely or easier to code. The rules for deciding otherwise were too complicated.

IN subquery is my method of last resort, when nothing else is viable.

Someday soon, I will try to publish a benchmark of the performance of some of the bad INs, compared to a rewrite performing much better.

Now I know you will argue the Correlated EXISTS could perform badly too under circumstance of not having the right index and being a case not transformable to a join. It is usually far easier to tune that by adding a good index than to ask the customer to rewrite the SQL currently using a bad IN and retest the application. 

If the programmer's standards said "Avoid Using IN subquery" with maybe a few very specific exceptions, I would be happy. I see all the ugly performing INs out there (well fortunately only some of them). The DB2 Optimizer team has helped some of the INs to perform better but not all of them.

As always "It Depends", but certain features of DB2 I like to avoid.

I recently saw a Scrollable Cursor coded at a customer site and was surprised that the Cursor performed significantly worse than a normal cursor for same SQL and access path index and table, a lot more CPU per fetch. So do I really need this Gimmick of being able to fetch previous row or one in a specified position? I could not see why the programmer needed it. Looked like they just wanted the first row. Just wanted to try it. Cost was almost 10 times higher than what was expected for a comparable normal Cursor.

Use DB2 features wisely. Some things are for when performance does not matter.

A non-correlated subquery is executed up front, normally, which is Ok for a one off query but the cost of that better be  low if you are in an OLTP situation, where you SQL could be executed a very large number of times in a loop, or just a search query will only select the first few rows of the result set, so materialising a whole large IN list could be a waste of time. Executing a correlated subquery makes the look up cost a lot more related to how many outer rows are processed, the best case being to lookup one row only, the worst case being where the correlated subquery filtering is too strong and eliminates all the rows from a very large set, making a very long outer table scan. This could perform just as badly for a non transformed IN subquery as well.

Michael Hannan,
DB2 Application Performance Specialist
CPT Global Ltd

Edited By:
Michael Hannan[Organization Members] @ Jun 16, 2018 - 01:12 PM (Europe/Berlin)
Michael Hannan[Organization Members] @ Jun 16, 2018 - 01:40 PM (Europe/Berlin)

bernd oppolzer

sql rewrite
(in response to Michael Hannan)
See below ...


Am 16.06.2018 um 13:10 schrieb Michael Hannan:
> My message really, is before using an IN, just because you always do,
> think about the access path and understand why it will perform better
> than a correlated EXISTS or a join. Mostly it doesn't and sometimes
> the performance can be really  bad. I see those a lot and wonder why
> the programmer used an IN. The answer was because he likes IN, most
> likely or easier to code. The rules for deciding otherwise were too
> complicated.
>
> IN subquery is my method of last resort, when nothing else is viable.
>
> ...
>
> If the programmer's standards said "Avoid Using IN subquery" with
> maybe a few very specific exceptions, I would be happy. I see all the
> ugly performing INs out there (well fortunately only some of them).
> The DB2 Optimizer team has helped some of the INs to perform better
> but not all of them.

I was in charge for conceiving and doing the DB2 training at a large
insurance company
in Europe for more than 20 years (1992 to 2014, in fact). And I teached
all my students
not to use IN whenever possible, but to rewrite all non-correlated IN
subqueries to joins,
whenever possible ... and not hoping for DB2 doing it. We were very
successful
users of DB2 over the whole period of time, IMO due to the quality of
the DB2 training
we applied to all our developers. From day one of the SQL training, with
every newly
introduced SQL WHERE condition in the training session, we talked about its
performance issues, to make the students sensible for these topics.

> Use DB2 features wisely. Some things are for when performance does not
> matter.
>

Amen.

> Michael Hannan,
> DB2 Application Performance Specialist
> CPT Global Ltd
>
>

Terry Purcell

RE: sql rewrite
(in response to bernd oppolzer)

Good discussion.

I sincerely appreciate the attempts to set rules about when correlated or non-correlated (or join) is most appropriate - which Dan attempted to do. Or Michael suggesting avoiding non-correlated INs, or Bernd suggesting joins rather than INs where possible. My belief is that whatever works for you - go for it.

I also accept that the "transformation" rules are too complex now - and that is why we don't document the detail (as Michael pointed out).

The V9 enhancement where we can cost a query in its correlated and non-correlated form only applies to SELECTs. And inverting a non-correlated IN to be used as the outer so that  index access can be used may not be available if it will result in a halloween update (which is possible here).

Regarding non-correlated IN and sparse index - we have done a lot of work in V11 and V12 to improve sparse index (incl improving hash support) - such that sparse index can outperform direct index access under the right circumstances. But again - I accept that prior performance issues will validly dictate your rules of thumb.

For such a small table - it's unlikely that this is access path related (thus I think the subquery discussion is not the most important issue here). As mentioned - its often what we don't see in the access path that is the issue - lock contention, or number of indexes that must be updated, or matching index access is probing all DPSI parts (unlikely here with small table), or RI etc etc.

Since the subquery "looks" fine (matchcols=2) - I would start with trying to eliminate rows earlier in the process so that less rows are locked and also less subquery accesses are required.

Here is my suggested rewrite - although I again stress that there is something else going on (which Bernd also stated).


UPDATE  TABTEST      A        

     SET COMPONENT = :HV1      

   WHERE COMPONENT NOT IN (:HV1, :HV3)

   AND (COMPONENT = :HV2

     OR EXISTS (             

                SELECT  1                

                FROM TABTEST  B           

                WHERE B.IDAO = A.IDAO        

                AND   B.COMPONENT = :HV2  )   ;

 

Regards

Terry Purcell

bernd oppolzer

sql rewrite
(in response to Terry Purcell)
Hi Terry,

thanks a lot.

In the meantime the OP told us that the EXPLAIN cost estimate and the
PLAN TABLE access plan came from test environment, but the 2 minutes
run time came from another environment, where the index on the table
on the two relevant columns (IDAO and COMPONENT) was missing,
which of course explains the behaviour.

Kind regards

Bernd


Am 18.06.2018 um 00:20 schrieb Terry Purcell:
>
> Good discussion.
>
> I sincerely appreciate the attempts to set rules about when correlated
> or non-correlated (or join) is most appropriate - which Dan attempted
> to do. Or Michael suggesting avoiding non-correlated INs, or Bernd
> suggesting joins rather than INs where possible. My belief is that
> whatever works for you - go for it.
>
> I also accept that the "transformation" rules are too complex now -
> and that is why we don't document the detail (as Michael pointed out).
>
> The V9 enhancement where we can cost a query in its correlated and
> non-correlated form only applies to SELECTs. And inverting a
> non-correlated IN to be used as the outer so that  index access can be
> used may not be available if it will result in a halloween update
> (which is possible here).
>
> Regarding non-correlated IN and sparse index - we have done a lot of
> work in V11 and V12 to improve sparse index (incl improving hash
> support) - such that sparse index can outperform direct index access
> under the right circumstances. But again - I accept that prior
> performance issues will validly dictate your rules of thumb.
>
> For such a small table - it's unlikely that this is access path
> related (thus I think the subquery discussion is not the most
> important issue here). As mentioned - its often what we don't see in
> the access path that is the issue - lock contention, or number of
> indexes that must be updated, or matching index access is probing all
> DPSI parts (unlikely here with small table), or RI etc etc.
>
> Since the subquery "looks" fine (matchcols=2) - I would start with
> trying to eliminate rows earlier in the process so that less rows are
> locked and also less subquery accesses are required.
>
> Here is my suggested rewrite - although I again stress that there is
> something else going on (which Bernd also stated).
>
>
> UPDATE  TABTEST      A
>
>      SET COMPONENT = :HV1
>
>    WHERE COMPONENT NOT IN (:HV1, :HV3)
>
>    AND (COMPONENT = :HV2
>
>      OR EXISTS (
>
>                 SELECT  1
>
>                 FROM TABTEST  B
>
>                 WHERE B.IDAO = A.IDAO
>
>                 AND   B.COMPONENT = :HV2  )   ;
>
> Regards
>
> Terry Purcell
>
>

Peter Backlund

sql rewrite maybe [OT]
(in response to Terry Purcell)
Hello,

I noticed with satisfaction that Terry is using "... exists (select 1 from ..."
That reminds me of a story from when I was much younger...

I was consulting at a customer where they had some kind of SQL pre-processor which flagged "select *"
I told them that there is no actual select when it is used within the [not] exists select.
In order to avoid problems we changed their rules into using (select 1 from...)

Later, when I was teaching a class and told them to use (select 1 from...),
one of the not so well educated students asked "Doesn't that select the first column?"

So, from then on I always use(d) "(select 0 from ... "

Best regards,

Peter, now retired, but still interested!

On 2018-06-18 00:20, Terry Purcell wrote:

Good discussion.

I sincerely appreciate the attempts to set rules about when correlated or non-correlated (or join) is most appropriate - which Dan attempted to do. Or Michael suggesting avoiding non-correlated INs, or Bernd suggesting joins rather than INs where possible. My belief is that whatever works for you - go for it.

I also accept that the "transformation" rules are too complex now - and that is why we don't document the detail (as Michael pointed out).

The V9 enhancement where we can cost a query in its correlated and non-correlated form only applies to SELECTs. And inverting a non-correlated IN to be used as the outer so that  index access can be used may not be available if it will result in a halloween update (which is possible here).

Regarding non-correlated IN and sparse index - we have done a lot of work in V11 and V12 to improve sparse index (incl improving hash support) - such that sparse index can outperform direct index access under the right circumstances. But again - I accept that prior performance issues will validly dictate your rules of thumb.

For such a small table - it's unlikely that this is access path related (thus I think the subquery discussion is not the most important issue here). As mentioned - its often what we don't see in the access path that is the issue - lock contention, or number of indexes that must be updated, or matching index access is probing all DPSI parts (unlikely here with small table), or RI etc etc.

Since the subquery "looks" fine (matchcols=2) - I would start with trying to eliminate rows earlier in the process so that less rows are locked and also less subquery accesses are required.

Here is my suggested rewrite - although I again stress that there is something else going on (which Bernd also stated).


UPDATE  TABTEST      A        

     SET COMPONENT = :HV1      

   WHERE COMPONENT NOT IN (:HV1, :HV3)

   AND (COMPONENT = :HV2

     OR EXISTS (             

                SELECT  1                

                FROM TABTEST  B           

                WHERE B.IDAO = A.IDAO        

                AND   B.COMPONENT = :HV2  )   ;

Regards
Terry Purcell
Site Links: View post online   View mailing list online   Start new thread via email   Unsubscribe from this mailing list   Manage your subscription  

This email has been sent to: [login to unmask email]
Faster data refresh is here! The long waits and babysitting of unload/load jobs is over. Contact
ESAi to learn about BCV5 & XDM. Be a hero to users with fast on-demand test/QA data provisioning.See
http://www.ESAIGroup.com/idug


Use of this email content is governed by the terms of service at:
http://www.idug.org/p/cm/ld/fid=2
--

+--------------------------------+---------------------------------+
| Peter G Backlund | E-mail: [login to unmask email] |
+--------------------------------+---------------------------------+
| Phone: 0046(0)70 - 764 7554 | Skype: BacklundDB2 |
+--------------------------------+---------------------------------+
| May - September | October - April |
+------------------------------------------------------------------+
| Råsundavägen 94 | Calle Alella 55, Apt 2B |
| S-169 57 SOLNA | 03185 Torrevieja |
| Sweden | Spain |
+--------------------------------+---------------------------------+
| Playing with DB2 since 1981 ... and forever! |
+------------------------------------------------------------------+

Michael Hannan

RE: sql rewrite maybe [OT]
(in response to Peter Backlund)



In Reply to Peter Backlund:

I was consulting at a customer where they had some kind of SQL pre-processor which flagged "select *"
I told them that there is no actual select when it is used within the [not] exists select.
In order to avoid problems we changed their rules into using (select 1 from...) 

I don't ban 'SELECT *' from my own queries. We must remember the purpose of the rule, rather than follow rules religiously.

The point was, we don't want an SQL that will fail, in the case where the number of columns in the table changes.

Here are simplified dynamic SQL examples where I use SELECT * for brevity, but don't risk the problem:

WITH PT AS
(SELECT * FROM PLAN_TABLE
 WHERE  PROGNAME LIKE ?
 )
,PT2 AS
 (SELECT P.LAST_USED, P.PCTIMESTAMP,  P.CONTOKEN
                ,specific cols from X 
    FROM PT  X
  LEFT JOIN   SYSIBM.SYSPACKAGE P
  ON   P.LOCATION = ' '
  AND P.COLLID =X.COLLID
  AND P.NAME = X.PROGNAME
  AND P.VERSION = X.VERSION
)
  SELECT  *
   FROM PT2
  ORDER BY cols
  FETCH FIRST n ROWS ONLY
  WITH UR;


etc. 

If there are columns in PLAN_TABLE CTE (Common Table Expression)  that are not used by my subsequent query, the Optimizer will drop them out automatically, never actually retrieve them (and could even get an Index Only access path). In the future when new columns are added to PLAN_TABLE, the query still works.

The final Select is limited to what columns were earlier selected in PT2  CTE. It won't change until I modify PT2 to select some more columns, however there are less parts of my complex Query to maintain.

Sometimes we have to spell out the columns we want, sometimes we can abbreviate it. Being lazy to spellout a long list of columns several times, I often use a SELECT X.* plus some other columns. Unfortunately there is no way to specify I want X.* but dropping out columns A and B. I have to let the Optimizer dropout the columns automatically that I do not reference further.

The Optimizer can drop out whole Common Table Expressions if there is no subsequent reference to them. This is handy for cases where I don't want to maintain a lot of different variations of basically a similar query, but can just comment out a certain bit, referencing an earlier CTE.  Good example is where I normally go to DSN_PREDICAT_TABLE but found on some subsystem I had no Authority to Select from the table, or a suitable index was missing.

Table Functions help make queries flexible too. There are like a View but with parameter substitution into the depths of the query, since with a View, predicate pushdown into a particular subquery may not be possible.

 


 

Michael Hannan,
DB2 Application Performance Specialist
CPT Global Ltd