Halloween bug and Db2 zOS protection from it.

Michael Hannan

Halloween bug and Db2 zOS protection from it.

Does anyone have any, or know where can be found, any good documentation of how IBM are protecting us from the "Halloween Problem"?  As a dinosaur I have seen this problem in an application many years ago. Is some release, IBM introduced List Prefetch into some access paths for Cusrors with FOR UPDATE OF clause to protect us against finding same row twice in an Index Scan, and Terry Purcell hinted about some hidden workfile protecting Updates when List Prefetch  access path could not be used.

I did do a Google Search but did not find anything very informative for Db2 zOS. A lot for SQL Server.

I assume Halloween can still happen for UPDATE (not WHERE CURRENT OF) with predicates on same table as a Cursor is traversing in a non materialised access path, and have potential to retrieve same row twice due to index entry moved. Right or wrong? In what circumstances does IBM take special measures to protect us, and how does it work. Obviously I understand how List Prefetch protects us by materialising the scanned index RIDs.

Can we assume that Halloween still happens only for rather faulty Application code? or is that unfair? 

Normally when designing indexes for performance, we include updated columns, when necessary without much regard for a possible Halloween problem. I am trying to find out how risky this can be.

I don't ask the trivial questions, so please good detailed Technical info is what I am after.

Another possible case of retrieving same row twice would be TS Scan, and an Updated row moved to a later partition due to update of Partition key. 

Michael Hannan,
DB2 Application Performance Specialist

 

Edited By:
Michael Hannan[Organization Members] @ May 07, 2020 - 12:31 PM (Australia/Melbourne)
Michael Hannan[Organization Members] @ May 07, 2020 - 12:34 PM (Australia/Melbourne)

Andy Smith

RE: Halloween bug and Db2 zOS protection from it.
(in response to Michael Hannan)

Hello Michael

This is not a reply to your specific question regarding documentation for the Halloween Problem, but I'm sharing this with you because it is relevant and you might want to look out for it.
We recently encountered a major performance issue following a V11 package being rebound on Db2 12 (FL500), despite no access plan change.
It involved an SQL statement with a LPF access plan for a CURSOR using FOR UPDATE OF (index column) - i.e. the Halloween Problem.
The change in runtime behaviour observed was, upon reaching the RDS limit (25% of table), RID List processing was now (after REBIND on Db2 12) failing over to a Tablespace scan, whereas before (bound at v11) it was spilling over to a workfile.
Note that we have MAXTEMPS_RID=NOLIMIT.
We used REBIND SWITCH(PREVIOUS) to restore the V11 runtime structures for the Package, which re-introduced the spill to workfile upon RDS Limit being reached.
IBM advised us that in Db2 12, overflow to workfile is no longer performed when the RDS limit is exceeded, it's now a tablespace scan.
PH22633 was opened due to issues in this area - see APAR description.
Apparently this APAR will also fix the issue for SELECT FOR UPDATE OF - it will always spill to workfile and never failover to r-scan.

Regards
Andy

Michael Hannan

RE: Halloween bug and Db2 zOS protection from it.
(in response to Andy Smith)

Andy,

Thanks indeed. I don't mind a tangent at all, when the information is something special and has some detail. I don't know why they call this the "RDS Limit" but I guess must accept the terminology. I guess they believe if Catalog Stats (hopefully not too far out from correct number of rows) say that 25% of rows in the table must be visited, then the RID sort becomes costly and pointless, and TS Scan will be quicker. Scanning pages for rows is cheaper than direct access to the rows by RID pointers, even if the RIDs are in order, and predicate evaluation is cheaper too on table instead of index, despite decompression. Its not worth hogging the RID Pool as well.

So I have to ask, is falling back to TS Scan in this context so bad? The List Prefetch was not giving you the rows in any particular order anyway. In the old days, there were other much worse reasons for falling back to TS Scan such as not enough space in RID Pool. It once annoyed me (many versions ago) that an Equals match on the partitioning key List Prefetch could fall back to TS Scan when clearly Partition scan would have been adequate.

Using up the RID Pool is the really bad situation where falling back to TS Scan, could give very bad performance. 25% of the table, not really so bad. It made a lot more sense unless Table CARDF Stat is too low.I suppose its possible the 25% hit rows could all be in a 25% subset of partitions of the table, in which case TS Scan would be much worse. IBM assumes the 25% is likely spread throughout the TS.

I am surprised the problem occurred at Function level 500. I am not surprised by change of performance without change of access path. This has happened before for me too. PLAN_TABLE does not externalize every last piece of detail. In Package Copy Management, it is import to Retain Duplicates, so that a SWITCH back to old package Copy is possible, even when no access path change occurred.

I had to SWITCH back when MX/MI performance blew out in V10 (compat with V9 mode). The reason was V10 no longer discarded really bad unpromising legs of an MX when some other leg was quite good on its own. Instead of discarding the bad leg, all the RIDs collected would overflow to workfile in V10 and keep going. Although access path was a bad one, V9 was more forgiving and performed better by discarding the worthless legs of the MX/MI path. Interesting V12 Adaptive MX/MI fixes that problem and can discard non promising legs once again, and at runtime determine which leg of MX/MI path is filtering the best, or likely to return the least RIDs, and so even rearrange the order of the MX legs at runtime (I think of it as a semi dynamic access path but with far less overhead than a Prepare  and I like it). 

Now TS Scan would avoid the Halloween bug on Update, unless the partition key is updated and DB2 forced to move the row to another partition. I don't know if it has any further hidden mechanism to prevent revisiting a row moved in this way. In old days DB2 prevented something like a Halloween bug on partition key update by locking the from and to partitions and all in between. This protected not the Updater, but some other thread from reading the row twice on TS Scan. This was devastating for concurrency so I, for one, screamed for those locks to be taken away (and they were). My argument at the time was there was no such protection from reading a row twice when index scan was happening. The index entry could have been moved by another thread, and we don't use repeatable read. 

This leads to another tangent to Halloween. Why prevent a process from seeing the same row again to update, if another thread (not the one updating and moving the index entry) could read the same row twice during an Index  Scan, and therefore SUM the financial amounts to a wrong total? There was an old myth, don't use WITH UR unfortunately labelled "Dirty Read" because the summed results could be unreliable if someone backed out an UPDATE. The summed results will be more likely unreliable if another updater succeeded (and did not back out). WITH UR does not read pages half way updated. Page Latch prevents that. WITH UR results are in no way worse than than WITH CS results in my mind. In Data Sharing , with CS is not guaranteed to see the latest copy of a page. The member may be notified to invalidate its copy of a page very shortly after the page has already been processed. Anytime we read rows, someone could update a fraction later and render our results wrong. 

Any time we read rows via an index scan we could get the same row twice, if another process moved, deleted and re-inserted the index entry (on row update) and committed before our scan reaches the the new entry. given the Updater and reader are different Threads, technically this is not Halloween. Maybe this is what scares people from putting updatable columns in indexes, yet the reality is lots of indexes contain updatable columns. Getting Index Only paths can be an obsession at times.

Michael Hannan,
DB2 Application Performance Specialist

Edited By:
Michael Hannan[Organization Members] @ May 08, 2020 - 12:19 AM (Australia/Melbourne)

Terry Purcell

RE: Halloween bug and Db2 zOS protection from it.
(in response to Andy Smith)

Hi Andy,

Just a couple of minor clarifications here - since this response prompted a few emails directly to me regarding concern about the current Db2 12 behavior.

The situation where we utilize the 25% threshold is slightly more complicated - but we already had an internal performance issue that caused us to re-evaluate the runtime behavior for RID failover in V12. The biggest exposure here was when we reached 1 RID block (4000 RIDs) and that had already exceeded the bindtime estimate of 25% of the table. Had we not already hit the 25% - index probing and RTS could have been used to confirm what percentage of the table will qualify based upon current values and not the table size at bind time.

The fix in development will use index probing + RTS to always validate the approximate filtering at execution time - rather than using bind time values (this infrastructure was already there in V12 - but clearly not used enough). And the decision to revert to rscan vs writing RIDs to WF is based upon the locking behavior - but suffice to say that an UPDATE or SELECT FOR UPDATE OF that exceeds the RDS threshold (25%) based upon RTS will continue with WF.

In V12 - hitting DM limit (which is the maximum number of RIDs in a single RID map) or RID pool thresholds will all continue to use WF unless the RDS limit is projected to be hit. This is existing V12 behavior.

Also - the optimizer is unlikely to choose list PF if it is estimated that the query will exceed an RDS or DM threshold at bind time. Regardless of whether it is a SELECT or a SELECT FOR UPDATE OF. Again - this is existing V12 behavior.

I did provide the update on behalf of Db2 development to the IBMer who was in communication with your company. But we did not validate your scenario other than my understanding of the issue that was brought directly to my attention. Since this did not come via a SF case (PMR) to IBM - I cannot confirm the exact trigger. Regardless - it is expected this fix will address the issue.

Regards

Terry Purcell



In Reply to Andy Smith:

Hello Michael

This is not a reply to your specific question regarding documentation for the Halloween Problem, but I'm sharing this with you because it is relevant and you might want to look out for it.
We recently encountered a major performance issue following a V11 package being rebound on Db2 12 (FL500), despite no access plan change.
It involved an SQL statement with a LPF access plan for a CURSOR using FOR UPDATE OF (index column) - i.e. the Halloween Problem.
The change in runtime behaviour observed was, upon reaching the RDS limit (25% of table), RID List processing was now (after REBIND on Db2 12) failing over to a Tablespace scan, whereas before (bound at v11) it was spilling over to a workfile.
Note that we have MAXTEMPS_RID=NOLIMIT.
We used REBIND SWITCH(PREVIOUS) to restore the V11 runtime structures for the Package, which re-introduced the spill to workfile upon RDS Limit being reached.
IBM advised us that in Db2 12, overflow to workfile is no longer performed when the RDS limit is exceeded, it's now a tablespace scan.
PH22633 was opened due to issues in this area - see APAR description.
Apparently this APAR will also fix the issue for SELECT FOR UPDATE OF - it will always spill to workfile and never failover to r-scan.

Regards
Andy

Terry Purcell

RE: Halloween bug and Db2 zOS protection from it.
(in response to Michael Hannan)

To Michael's original question about risk of Halloween Update.

I am not aware of any external documentation regarding how Db2 handles (or avoids) the Halloween Update problem. I had already communicated to Michael that list PF is the typical approach used by the optimizer - if the index key can be updated such that a subsequent access (within the same SQL or process) will see the new row. Please note - there is no such thing as an index-only UPDATE - there can be an index-only access path for an UPDATE - but if the row is actually UPDATEd, the data row is accessed. Hence why list PF is often effective.

The optimizer could also choose an access path that does not involve an index where the key(s) are updated and potentially seen again.

I did confirm that the workfile approach for avoiding Halloween UPDATE became mainstream in V7 when "self-referencing" UPDATE/DELETE SQL support was delivered. And was also added to access paths where list PF was not possible - and example is IN-list access (ACCESSTYPE='N') - it will use the WF deferred UPDATE according to the optimizer code.

Going back any further to see how we handled Halloween UPDATE without the deferred update support (using workfiles) is not a productive use of mine or my teams time.

The fact that we haven't done much in this space in 20 years, and the fact that in my 17+ years in Optimizer development I haven't seen any customer problems related to Halloween UPDATE where Db2 is in control (such as an UPDATE or SELECT FOR UPDATE OF) - my conclusion is that Db2 is protecting you when it knows that Halloween UPDATE exposure exists.

Application induced Halloween UPDATE is still a risk as Michael implies. I have seen very few cases ever reported to Db2. And I do recall that Bonnie Baker had some great articles on programming practices for this issue in the 1990s and early 2000s - which I did search for and failed to find a few years ago the last time a customer saw this issue themselves (and I wanted to educate them).

There may be information in V7 redbooks about the workfile usage. I have not taken the time - since this is something anyone on the list can do. I searched the Db2 code instead - which is not something most of you can do.

Regards

Terry Purcell

In Reply to Michael Hannan:

Does anyone have any, or know where can be found, any good documentation of how IBM are protecting us from the "Halloween Problem"?  As a dinosaur I have seen this problem in an application many years ago. Is some release, IBM introduced List Prefetch into some access paths for Cusrors with FOR UPDATE OF clause to protect us against finding same row twice in an Index Scan, and Terry Purcell hinted about some hidden workfile protecting Updates when List Prefetch  access path could not be used.

I did do a Google Search but did not find anything very informative for Db2 zOS. A lot for SQL Server.

I assume Halloween can still happen for UPDATE (not WHERE CURRENT OF) with predicates on same table as a Cursor is traversing in a non materialised access path, and have potential to retrieve same row twice due to index entry moved. Right or wrong? In what circumstances does IBM take special measures to protect us, and how does it work. Obviously I understand how List Prefetch protects us by materialising the scanned index RIDs.

Can we assume that Halloween still happens only for rather faulty Application code? or is that unfair? 

Normally when designing indexes for performance, we include updated columns, when necessary without much regard for a possible Halloween problem. I am trying to find out how risky this can be.

I don't ask the trivial questions, so please good detailed Technical info is what I am after.

Another possible case of retrieving same row twice would be TS Scan, and an Updated row moved to a later partition due to update of Partition key. 

Michael Hannan,
DB2 Application Performance Specialist

 

bernd oppolzer

Halloween bug and Db2 zOS protection from it.
(in response to Terry Purcell)
I recall one story where a problem similar to Halloween Update occured
during an unexpected
access path change at my customer's site.

A cursor select (with order by) which controlled the execution of a
batch program changed from sort and workfile
to index scan (maybe better statistics or reorg of the index,
whatsoever). The application occasionally changed a
status field in the key (rightmost part of a multi-column key) from F to
M. Another application from M to F.

The M to F application had no problem, but the F to M application had
... it could not tolerate finding the
M key after having processed the F key successfully :-)

The customer did not understand why the application suddenly had a
problem, after having worked correctly for years ...
nothing changed :-)

Kind regards

Bernd


Am 08.05.2020 um 18:10 schrieb Terry Purcell:
> Application induced Halloween UPDATE is still a risk as Michael
> implies. I have seen very few cases ever reported to Db2.

Michael Hannan

RE: Halloween bug and Db2 zOS protection from it.
(in response to Terry Purcell)

In Reply to Terry Purcell:

There may be information in V7 redbooks about the workfile usage. I have not taken the time - since this is something anyone on the list can do. I searched the Db2 code instead - which is not something most of you can do.

Regards

Terry Purcell

Thanks Terry, I will check the V7 Red Book. I am thinking about writing a Native Stored Proc  as practice, and at same trying to provoke a Halloween Problem by coding in a bad way with UPDATE that is NOT where Current of Cursor.

I do recognise that there is nothing we can really do to protect us from faulty application code, DB2 SQL does give Naive developers a lot of rope to do stupid things.

I may consider trying to detect code likely to potentially contain a Halloween problem, and also to consider whether I can abstain from creating new tuning indexes that could lead to a Halloween, given the faulty coding already exists in the application code. We do have to protect ourselves from causing problems in a Production System.

I do find cases of Fetch Only or ambiguous Cursors with no materialising sorts or Prefetch and an updated column occurs in the access path index (not an equals matched column) and another UPDATE statement setting that column and has WHERE predicates, probably similar to those of the Cursor.    Then whether the column is ascending or descending in the index may affect potential for Halloween, depending if the update is lesser or greater value.

Hopefully a vast majority of the UPDATEs will test for a specific value of the updated column and therefore not find it again after update occurs. None of those silly 10% Salary Rise type updates, usually used in Halloween examples.

Might have to check each one more carefully to decide if Halloween could be lurking there or not. Looks like all these Cursors with no materialise and also there is an UPDATE without WHERE CURRENT OF but on same table could be a danger. I will consider changing the Cursor access path with OPTHINT to force a materialise of some sort if it seems practical, or hint access path not to choose the index including the Updated column. At least there are tactics available. 

I can also have a list maybe, of all danger tables with such application code, where I need to be careful about  where the updated column is put in indexes. 

I have a natural tendency to put non-matched updated columns at the end of index key for least movement of an index entry but that is not avoiding Halloween even if the new index entry is put straight after the old pseudo deleted entry, I think.

I do agree with Terry, that Halloween in practice should be very rare (I have seen just once years ago) and recall being the one to diagnose Halloween as the problem.  It is just that conservative people can exist in a site that will say: don't put an updated column in an index (because of Halloween). Then we find the site has hundreds of Indexes containing updated columns already.   The most common one being EFFECTIVE_TO_TIMESTMP  or EFFECTIVE_TO_DATE. These are commonly updated when the row gets superseded by a new row.

Michael Hannan,
DB2 Application Performance Specialist

Edited By:
Michael Hannan[Organization Members] @ May 09, 2020 - 11:18 PM (Australia/Melbourne)

Michael Hannan

RE: Halloween bug and Db2 zOS protection from it.
(in response to Michael Hannan)

Based on Terry's reference to V7 redbook on Performance Topics, I did find this relating to Self Referencing Delete/Update:
"Two-step processing of the outer statement may be required for correlated subqueries. The
first step creates a work file and inserts the RID for a delete or RID and column value for an
update. After all rows of the outer statement have been processed, the second step reads the
work file. For each row in the work file, DB2 repositions on the record in the base table
pointed to by the RID, and then either updates or deletes the row as required."

Also:

"DB2 V7 allows searched UPDATE and DELETE statements to use the target tables within
subqueries in the WHERE or SET clauses. This enhancement does not support UPDATE and
DELETE WHERE CURRENT OF when the cursor definition contains a self referencing
subquery."

It's not clear to me that this has relevance to the Halloween Problem, or should I say probably doesn't have relevance in the terms stated, although the problem addressed for self referencing UPDATEs is similar to a Halloween issue.

What I can see is that workfile created to postpone the actual Update is only for the duration of the UPDATE, and the UPDATE must be completed.   Such a technique would not help in the same form when executing the UPDATE numerous times while a Cursor on the table  is traversed, especially as COMMIT could occur with  Cursor could be still Open WITH HOLD.

Michael Hannan,
DB2 Application Performance Specialist

Michael Hannan

RE: Halloween bug and Db2 zOS protection from it.
(in response to bernd oppolzer)

In Reply to bernd oppolzer:

I recall one story where a problem similar to Halloween Update occured
during an unexpected
access path change at my customer's site.

A cursor select (with order by) which controlled the execution of a
batch program changed from sort and workfile
to index scan (maybe better statistics or reorg of the index,
whatsoever). The application occasionally changed a
status field in the key (rightmost part of a multi-column key) from F to
M. Another application from M to F.

The M to F application had no problem, but the F to M application had
... it could not tolerate finding the
M key after having processed the F key successfully :-)
The customer did not understand why the application suddenly had a problem, after having worked correctly for years ...
nothing changed :-)

Bernd, That I thinkmust be a Halloween problem.

I was tempted to make a sordid  joke out of this about difficulty of changing an F to an M; should I ?  well no; decided I had better censor myself.

Michael Hannan,
DB2 Application Performance Specialist