v11 on z/OS: Transient data anomaly

Philip Sevetson

v11 on z/OS: Transient data anomaly
I'm wondering whether WITH UR can possibly read committed data incorrectly, probably involving an index-only query, definitely involving a very-recently-committed unit of work.

We had a count come out wrong in DB2 this morning. What happened:

1) COBOL pgm runs many INSERTs to table A. Ends with RC=0.

2) Four seconds later, DSNTEP2 runs a count of certain rows (last 365 days by a date column) against table A (WITH UR). Count = x.

3) X does not have the expected value, and a program checks the result and flags it to our attention.

4) Seven hours later, we rerun the count against Table A. Count = y.

5) Y _is_ the expected value.

6) There are no other programs which are scheduled or built to insert rows to Table A, and no other scheduled jobs which run this program. We're pretty sure that no updates took place in A during the interval between the first and second counts.

What we have appears to be a case of a SQL anomaly - the same SQL COUNT statement giving us an apparently wrong result four seconds after a mass-insert program, and then giving us the expected result several hours later.

The combination of the change in result, the close timing of the COMMIT and the COUNT, and the use of WITH UR leave me wondering if the count error is a feature of UR. Does UR permit some kind of reading which might ignore non-externalized index pages?

I'm opening a PMR with IBM on this as soon as my sysprogs get to their desks and re-enable my IBM account, so it isn't critical that this be answered here. If someone knows, of course, we're interested. If we need to do something like _not_ running our control report queries WITH UR, or if this is actually not a WAD/Feature, we need to change our code accordingly.

Comments are welcome.

--Thanks,

Phil Sevetson
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**

Walter Janißen

AW: v11 on z/OS: Transient data anomaly
(in response to Philip Sevetson)
Hi Phil

What is your setting of SKIPUNCI?
Has the table in question row level locking?
What is the package's value of IMMEDWTITE?

I think in datasharing, if you don't bind with IMMEDWRITE YES a statement running on a different member than the inserter might well don't see entries.

Kind regards
Walter Janißen

ITERGO Informationstechnologie GmbH
Anwendungsentwicklung
Technische Anwendungsarchitektur
Victoriaplatz 2
D-40198 Düsseldorf
Tel.: +49(0)211/477-2928
Fax: +49(0)211/477-6441
[login to unmask email]<mailto:[login to unmask email]>

ITERGO Informationstechnologie GmbH
Vorsitzender des Aufsichtsrats: Christian Diedrich
Geschäftsführung: Dr. Bettina Anders (Vorsitzende),
Lothar Engelke, Ina Kirchhof, Dr. Michael Regauer
Sitz: Düsseldorf, Handelsregister: Amtsgericht Düsseldorf HRB 37996

Von: Sevetson, Phil <[login to unmask email]>
Gesendet: Freitag, 6. Juli 2018 14:37
An: [login to unmask email]
Betreff: [DB2-L] - v11 on z/OS: Transient data anomaly

I'm wondering whether WITH UR can possibly read committed data incorrectly, probably involving an index-only query, definitely involving a very-recently-committed unit of work.

We had a count come out wrong in DB2 this morning. What happened:

1) COBOL pgm runs many INSERTs to table A. Ends with RC=0.

2) Four seconds later, DSNTEP2 runs a count of certain rows (last 365 days by a date column) against table A (WITH UR). Count = x.

3) X does not have the expected value, and a program checks the result and flags it to our attention.

4) Seven hours later, we rerun the count against Table A. Count = y.

5) Y _is_ the expected value.

6) There are no other programs which are scheduled or built to insert rows to Table A, and no other scheduled jobs which run this program. We're pretty sure that no updates took place in A during the interval between the first and second counts.

What we have appears to be a case of a SQL anomaly - the same SQL COUNT statement giving us an apparently wrong result four seconds after a mass-insert program, and then giving us the expected result several hours later.

The combination of the change in result, the close timing of the COMMIT and the COUNT, and the use of WITH UR leave me wondering if the count error is a feature of UR. Does UR permit some kind of reading which might ignore non-externalized index pages?

I'm opening a PMR with IBM on this as soon as my sysprogs get to their desks and re-enable my IBM account, so it isn't critical that this be answered here. If someone knows, of course, we're interested. If we need to do something like _not_ running our control report queries WITH UR, or if this is actually not a WAD/Feature, we need to change our code accordingly.

Comments are welcome.

--Thanks,

Phil Sevetson
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**
-----End Original Message-----

Larry Jardine

v11 on z/OS: Transient data anomaly
(in response to Walter Janißen)
Regardless of the reasons why the count didn't match, if your program depends on a precise value, then you should not be coding "with ur".

Larry Jardine
Database Advisor
Aetna

From: Walter Janißen [mailto:[login to unmask email]
Sent: Friday, July 06, 2018 9:23 AM
To: [login to unmask email]
Subject: [EXTERNAL] [DB2-L] - AW: v11 on z/OS: Transient data anomaly

**** External Email - Use Caution ****
Hi Phil

What is your setting of SKIPUNCI?
Has the table in question row level locking?
What is the package's value of IMMEDWTITE?

I think in datasharing, if you don't bind with IMMEDWRITE YES a statement running on a different member than the inserter might well don't see entries.

Kind regards
Walter Janißen

ITERGO Informationstechnologie GmbH
Anwendungsentwicklung
Technische Anwendungsarchitektur
Victoriaplatz 2
D-40198 Düsseldorf
Tel.: +49(0)211/477-2928
Fax: +49(0)211/477-6441
[login to unmask email]<mailto:[login to unmask email]>

ITERGO Informationstechnologie GmbH
Vorsitzender des Aufsichtsrats: Christian Diedrich
Geschäftsführung: Dr. Bettina Anders (Vorsitzende),
Lothar Engelke, Ina Kirchhof, Dr. Michael Regauer
Sitz: Düsseldorf, Handelsregister: Amtsgericht Düsseldorf HRB 37996

Von: Sevetson, Phil <[login to unmask email]>
Gesendet: Freitag, 6. Juli 2018 14:37
An: [login to unmask email]
Betreff: [DB2-L] - v11 on z/OS: Transient data anomaly

I'm wondering whether WITH UR can possibly read committed data incorrectly, probably involving an index-only query, definitely involving a very-recently-committed unit of work.

We had a count come out wrong in DB2 this morning. What happened:

1) COBOL pgm runs many INSERTs to table A. Ends with RC=0.

2) Four seconds later, DSNTEP2 runs a count of certain rows (last 365 days by a date column) against table A (WITH UR). Count = x.

3) X does not have the expected value, and a program checks the result and flags it to our attention.

4) Seven hours later, we rerun the count against Table A. Count = y.

5) Y _is_ the expected value.

6) There are no other programs which are scheduled or built to insert rows to Table A, and no other scheduled jobs which run this program. We're pretty sure that no updates took place in A during the interval between the first and second counts.

What we have appears to be a case of a SQL anomaly - the same SQL COUNT statement giving us an apparently wrong result four seconds after a mass-insert program, and then giving us the expected result several hours later.

The combination of the change in result, the close timing of the COMMIT and the COUNT, and the use of WITH UR leave me wondering if the count error is a feature of UR. Does UR permit some kind of reading which might ignore non-externalized index pages?

I'm opening a PMR with IBM on this as soon as my sysprogs get to their desks and re-enable my IBM account, so it isn't critical that this be answered here. If someone knows, of course, we're interested. If we need to do something like _not_ running our control report queries WITH UR, or if this is actually not a WAD/Feature, we need to change our code accordingly.

Comments are welcome.

--Thanks,

Phil Sevetson
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**
-----End Original Message-----

-----End Original Message-----

This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately. Thank you.

This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately. Thank you. Aetna

Philip Sevetson

v11 on z/OS: Transient data anomaly
(in response to Walter Janißen)
System SKIPUNCI=NO
Tablespace LOCKSIZE=ANY
Package IMMEDWRITE NO
This system is not built with datasharing.


From: Walter Janißen [mailto:[login to unmask email]
Sent: Friday, July 06, 2018 9:23 AM
To: [login to unmask email]
Subject: [DB2-L] - AW: v11 on z/OS: Transient data anomaly

Hi Phil

What is your setting of SKIPUNCI?
Has the table in question row level locking?
What is the package's value of IMMEDWTITE?

I think in datasharing, if you don't bind with IMMEDWRITE YES a statement running on a different member than the inserter might well don't see entries.

Kind regards
Walter Janißen

ITERGO Informationstechnologie GmbH
Anwendungsentwicklung
Technische Anwendungsarchitektur
Victoriaplatz 2
D-40198 Düsseldorf
Tel.: +49(0)211/477-2928
Fax: +49(0)211/477-6441
[login to unmask email]<mailto:[login to unmask email]>

ITERGO Informationstechnologie GmbH
Vorsitzender des Aufsichtsrats: Christian Diedrich
Geschäftsführung: Dr. Bettina Anders (Vorsitzende),
Lothar Engelke, Ina Kirchhof, Dr. Michael Regauer
Sitz: Düsseldorf, Handelsregister: Amtsgericht Düsseldorf HRB 37996

Von: Sevetson, Phil <[login to unmask email]>
Gesendet: Freitag, 6. Juli 2018 14:37
An: [login to unmask email]
Betreff: [DB2-L] - v11 on z/OS: Transient data anomaly

I'm wondering whether WITH UR can possibly read committed data incorrectly, probably involving an index-only query, definitely involving a very-recently-committed unit of work.

We had a count come out wrong in DB2 this morning. What happened:

1) COBOL pgm runs many INSERTs to table A. Ends with RC=0.

2) Four seconds later, DSNTEP2 runs a count of certain rows (last 365 days by a date column) against table A (WITH UR). Count = x.

3) X does not have the expected value, and a program checks the result and flags it to our attention.

4) Seven hours later, we rerun the count against Table A. Count = y.

5) Y _is_ the expected value.

6) There are no other programs which are scheduled or built to insert rows to Table A, and no other scheduled jobs which run this program. We're pretty sure that no updates took place in A during the interval between the first and second counts.

What we have appears to be a case of a SQL anomaly - the same SQL COUNT statement giving us an apparently wrong result four seconds after a mass-insert program, and then giving us the expected result several hours later.

The combination of the change in result, the close timing of the COMMIT and the COUNT, and the use of WITH UR leave me wondering if the count error is a feature of UR. Does UR permit some kind of reading which might ignore non-externalized index pages?

I'm opening a PMR with IBM on this as soon as my sysprogs get to their desks and re-enable my IBM account, so it isn't critical that this be answered here. If someone knows, of course, we're interested. If we need to do something like _not_ running our control report queries WITH UR, or if this is actually not a WAD/Feature, we need to change our code accordingly.

Comments are welcome.

--Thanks,

Phil Sevetson
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**
-----End Original Message-----

-----End Original Message-----
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**

Philip Sevetson

v11 on z/OS: Transient data anomaly
(in response to Larry Jardine)
Apparently not. We weren't aware that a table with *no* uncommitted work would be unstable under isolation UR - I thought that UR read from the bufferpool manager like everyone else, and the only possible explanation I can see for this is that it's reading obsolete pages _after_ their changed state is present in the bufferpool, let alone externalized.

--Phil

From: Jardine, Lawrence J [mailto:[login to unmask email]
Sent: Friday, July 06, 2018 9:29 AM
To: [login to unmask email]
Subject: [DB2-L] - RE: v11 on z/OS: Transient data anomaly

Regardless of the reasons why the count didn't match, if your program depends on a precise value, then you should not be coding "with ur".

Larry Jardine
Database Advisor
Aetna

From: Walter Janißen [mailto:[login to unmask email]
Sent: Friday, July 06, 2018 9:23 AM
To: [login to unmask email]
Subject: [EXTERNAL] [DB2-L] - AW: v11 on z/OS: Transient data anomaly

**** External Email - Use Caution ****
Hi Phil

What is your setting of SKIPUNCI?
Has the table in question row level locking?
What is the package's value of IMMEDWTITE?

I think in datasharing, if you don't bind with IMMEDWRITE YES a statement running on a different member than the inserter might well don't see entries.

Kind regards
Walter Janißen

ITERGO Informationstechnologie GmbH
Anwendungsentwicklung
Technische Anwendungsarchitektur
Victoriaplatz 2
D-40198 Düsseldorf
Tel.: +49(0)211/477-2928
Fax: +49(0)211/477-6441
[login to unmask email]<mailto:[login to unmask email]>

ITERGO Informationstechnologie GmbH
Vorsitzender des Aufsichtsrats: Christian Diedrich
Geschäftsführung: Dr. Bettina Anders (Vorsitzende),
Lothar Engelke, Ina Kirchhof, Dr. Michael Regauer
Sitz: Düsseldorf, Handelsregister: Amtsgericht Düsseldorf HRB 37996

Von: Sevetson, Phil <[login to unmask email]>
Gesendet: Freitag, 6. Juli 2018 14:37
An: [login to unmask email]
Betreff: [DB2-L] - v11 on z/OS: Transient data anomaly

I'm wondering whether WITH UR can possibly read committed data incorrectly, probably involving an index-only query, definitely involving a very-recently-committed unit of work.

We had a count come out wrong in DB2 this morning. What happened:

1) COBOL pgm runs many INSERTs to table A. Ends with RC=0.

2) Four seconds later, DSNTEP2 runs a count of certain rows (last 365 days by a date column) against table A (WITH UR). Count = x.

3) X does not have the expected value, and a program checks the result and flags it to our attention.

4) Seven hours later, we rerun the count against Table A. Count = y.

5) Y _is_ the expected value.

6) There are no other programs which are scheduled or built to insert rows to Table A, and no other scheduled jobs which run this program. We're pretty sure that no updates took place in A during the interval between the first and second counts.

What we have appears to be a case of a SQL anomaly - the same SQL COUNT statement giving us an apparently wrong result four seconds after a mass-insert program, and then giving us the expected result several hours later.

The combination of the change in result, the close timing of the COMMIT and the COUNT, and the use of WITH UR leave me wondering if the count error is a feature of UR. Does UR permit some kind of reading which might ignore non-externalized index pages?

I'm opening a PMR with IBM on this as soon as my sysprogs get to their desks and re-enable my IBM account, so it isn't critical that this be answered here. If someone knows, of course, we're interested. If we need to do something like _not_ running our control report queries WITH UR, or if this is actually not a WAD/Feature, we need to change our code accordingly.

Comments are welcome.

--Thanks,

Phil Sevetson
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**
-----End Original Message-----

-----End Original Message-----
This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately. Thank you.
This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately. Thank you. Aetna

-----End Original Message-----
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**

Walter Jani&#223;en

AW: v11 on z/OS: Transient data anomaly
(in response to Philip Sevetson)
Hi

If I understand you right: There is a program which inserts lots of rows. 4 seconds after this program successfully ended a count(*) WITH UR was executed giving the wrong number. No datasharing. Then I don't have any idea. Either the data is in the bufferpool or externalized to disk. In either case COUNT(*) WITH UR should give the correct number.

Kind regards
Walter Janißen [standard_IBM+Champ+7+Yr+Analytics]

ITERGO Informationstechnologie GmbH
Anwendungsentwicklung
Technische Anwendungsarchitektur
Victoriaplatz 2
D-40198 Düsseldorf
[login to unmask email]<mailto:[login to unmask email]>

ITERGO Informationstechnologie GmbH
Vorsitzender des Aufsichtsrats: Christian Diedrich
Geschäftsführung: Dr. Bettina Anders (Vorsitzende),
Lothar Engelke, Ina Kirchhof, Dr. Michael Regauer
Sitz: Düsseldorf, Handelsregister: Amtsgericht Düsseldorf HRB 37996

Von: Sevetson, Phil <[login to unmask email]>
Gesendet: Freitag, 6. Juli 2018 18:14
An: '[login to unmask email]'
Betreff: [DB2-L] - RE: v11 on z/OS: Transient data anomaly

Apparently not. We weren't aware that a table with *no* uncommitted work would be unstable under isolation UR - I thought that UR read from the bufferpool manager like everyone else, and the only possible explanation I can see for this is that it's reading obsolete pages _after_ their changed state is present in the bufferpool, let alone externalized.

--Phil

From: Jardine, Lawrence J [mailto:[login to unmask email]
Sent: Friday, July 06, 2018 9:29 AM
To: [login to unmask email]<mailto:[login to unmask email]>
Subject: [DB2-L] - RE: v11 on z/OS: Transient data anomaly

Regardless of the reasons why the count didn't match, if your program depends on a precise value, then you should not be coding "with ur".

Larry Jardine
Database Advisor
Aetna

From: Walter Janißen [mailto:[login to unmask email]
Sent: Friday, July 06, 2018 9:23 AM
To: [login to unmask email]<mailto:[login to unmask email]>
Subject: [EXTERNAL] [DB2-L] - AW: v11 on z/OS: Transient data anomaly

**** External Email - Use Caution ****
Hi Phil

What is your setting of SKIPUNCI?
Has the table in question row level locking?
What is the package's value of IMMEDWTITE?

I think in datasharing, if you don't bind with IMMEDWRITE YES a statement running on a different member than the inserter might well don't see entries.

Kind regards
Walter Janißen

ITERGO Informationstechnologie GmbH
Anwendungsentwicklung
Technische Anwendungsarchitektur
Victoriaplatz 2
D-40198 Düsseldorf
Tel.: +49(0)211/477-2928
Fax: +49(0)211/477-6441
[login to unmask email]<mailto:[login to unmask email]>

ITERGO Informationstechnologie GmbH
Vorsitzender des Aufsichtsrats: Christian Diedrich
Geschäftsführung: Dr. Bettina Anders (Vorsitzende),
Lothar Engelke, Ina Kirchhof, Dr. Michael Regauer
Sitz: Düsseldorf, Handelsregister: Amtsgericht Düsseldorf HRB 37996

Von: Sevetson, Phil <[login to unmask email]<mailto:[login to unmask email]>>
Gesendet: Freitag, 6. Juli 2018 14:37
An: [login to unmask email]<mailto:[login to unmask email]>
Betreff: [DB2-L] - v11 on z/OS: Transient data anomaly

I'm wondering whether WITH UR can possibly read committed data incorrectly, probably involving an index-only query, definitely involving a very-recently-committed unit of work.

We had a count come out wrong in DB2 this morning. What happened:

1) COBOL pgm runs many INSERTs to table A. Ends with RC=0.

2) Four seconds later, DSNTEP2 runs a count of certain rows (last 365 days by a date column) against table A (WITH UR). Count = x.

3) X does not have the expected value, and a program checks the result and flags it to our attention.

4) Seven hours later, we rerun the count against Table A. Count = y.

5) Y _is_ the expected value.

6) There are no other programs which are scheduled or built to insert rows to Table A, and no other scheduled jobs which run this program. We're pretty sure that no updates took place in A during the interval between the first and second counts.

What we have appears to be a case of a SQL anomaly - the same SQL COUNT statement giving us an apparently wrong result four seconds after a mass-insert program, and then giving us the expected result several hours later.

The combination of the change in result, the close timing of the COMMIT and the COUNT, and the use of WITH UR leave me wondering if the count error is a feature of UR. Does UR permit some kind of reading which might ignore non-externalized index pages?

I'm opening a PMR with IBM on this as soon as my sysprogs get to their desks and re-enable my IBM account, so it isn't critical that this be answered here. If someone knows, of course, we're interested. If we need to do something like _not_ running our control report queries WITH UR, or if this is actually not a WAD/Feature, we need to change our code accordingly.

Comments are welcome.

--Thanks,

Phil Sevetson
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**
-----End Original Message-----

-----End Original Message-----
This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately. Thank you.
This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately. Thank you. Aetna

-----End Original Message-----
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**
-----End Original Message-----
Attachments

  • image001.png (2.6k)

Daniel Luksetich

v11 on z/OS: Transient data anomaly
(in response to Philip Sevetson)
Phil,

Is this data sharing?

Dan



Daniel L Luksetich

DanL Database Consulting



IBM GOLD Consultant

IBM Champion for Analytics

IDUG Content Committee Past-Chairman

IDUG DB2-L Administrator

IBM Certified Database Adminstrator – DB2 11 DBA for z/OS

IBM Certified System Administrator – DB2 11 for z/OS

IBM Certified Application Developer – DB2 11 for z/OS

IBM Certified Advanced Database Administrator – DB2 10.1 for Linux UNIX and
Windows



From: Sevetson, Phil <[login to unmask email]>
Sent: Friday, July 6, 2018 11:14 AM
To: '[login to unmask email]' <[login to unmask email]>
Subject: [DB2-L] - RE: v11 on z/OS: Transient data anomaly



Apparently not. We weren’t aware that a table with *no* uncommitted work
would be unstable under isolation UR – I thought that UR read from the
bufferpool manager like everyone else, and the only possible explanation I
can see for this is that it’s reading obsolete pages _after_ their changed
state is present in the bufferpool, let alone externalized.



--Phil



From: Jardine, Lawrence J [mailto:[login to unmask email]
Sent: Friday, July 06, 2018 9:29 AM
To: [login to unmask email] <mailto:[login to unmask email]>
Subject: [DB2-L] - RE: v11 on z/OS: Transient data anomaly



Regardless of the reasons why the count didn’t match, if your program
depends on a precise value, then you should not be coding “with ur”.



Larry Jardine
Database Advisor

Aetna



From: Walter Janißen [mailto:[login to unmask email]
Sent: Friday, July 06, 2018 9:23 AM
To: [login to unmask email] <mailto:[login to unmask email]>
Subject: [EXTERNAL] [DB2-L] - AW: v11 on z/OS: Transient data anomaly



**** External Email - Use Caution ****

Hi Phil



What is your setting of SKIPUNCI?

Has the table in question row level locking?

What is the package’s value of IMMEDWTITE?



I think in datasharing, if you don’t bind with IMMEDWRITE YES a statement
running on a different member than the inserter might well don’t see
entries.



Kind regards
Walter Janißen

ITERGO Informationstechnologie GmbH
Anwendungsentwicklung
Technische Anwendungsarchitektur
Victoriaplatz 2
D-40198 Düsseldorf
Tel.: +49(0)211/477-2928
Fax: +49(0)211/477-6441
<mailto:[login to unmask email]> [login to unmask email]

ITERGO Informationstechnologie GmbH
Vorsitzender des Aufsichtsrats: Christian Diedrich
Geschäftsführung: Dr. Bettina Anders (Vorsitzende),
Lothar Engelke, Ina Kirchhof, Dr. Michael Regauer
Sitz: Düsseldorf, Handelsregister: Amtsgericht Düsseldorf HRB 37996



Von: Sevetson, Phil <[login to unmask email] <mailto:[login to unmask email]> >
Gesendet: Freitag, 6. Juli 2018 14:37
An: [login to unmask email] <mailto:[login to unmask email]>
Betreff: [DB2-L] - v11 on z/OS: Transient data anomaly



I’m wondering whether WITH UR can possibly read committed data incorrectly,
probably involving an index-only query, definitely involving a
very-recently-committed unit of work.



We had a count come out wrong in DB2 this morning. What happened:

1) COBOL pgm runs many INSERTs to table A. Ends with RC=0.

2) Four seconds later, DSNTEP2 runs a count of certain rows (last 365
days by a date column) against table A (WITH UR). Count = x.

3) X does not have the expected value, and a program checks the result
and flags it to our attention.

4) Seven hours later, we rerun the count against Table A. Count = y.

5) Y _is_ the expected value.

6) There are no other programs which are scheduled or built to insert
rows to Table A, and no other scheduled jobs which run this program. We’re
pretty sure that no updates took place in A during the interval between the
first and second counts.



What we have appears to be a case of a SQL anomaly – the same SQL COUNT
statement giving us an apparently wrong result four seconds after a
mass-insert program, and then giving us the expected result several hours
later.



The combination of the change in result, the close timing of the COMMIT and
the COUNT, and the use of WITH UR leave me wondering if the count error is a
feature of UR. Does UR permit some kind of reading which might ignore
non-externalized index pages?



I’m opening a PMR with IBM on this as soon as my sysprogs get to their desks
and re-enable my IBM account, so it isn’t critical that this be answered
here. If someone knows, of course, we’re interested. If we need to do
something like _not_ running our control report queries WITH UR, or if this
is actually not a WAD/Feature, we need to change our code accordingly.



Comments are welcome.



--Thanks,



Phil Sevetson

**This e-mail, including any attachments, may be confidential, privileged,
or otherwise legally protected. It is intended only for the addressee. If
you received this e-mail in error or from someone who was not authorized to
send it to you, do not disseminate, copy, or otherwise use this e-mail or
its attachments. Please notify the sender immediately by reply e-mail and
delete the e-mail from your system.**

-----End Original Message-----



-----End Original Message-----

This e-mail may contain confidential or privileged information. If you think
you have received this e-mail in error, please advise the sender by reply
e-mail and then delete this e-mail immediately. Thank you.

This e-mail may contain confidential or privileged information. If you think
you have received this e-mail in error, please advise the sender by reply
e-mail and then delete this e-mail immediately. Thank you. Aetna



-----End Original Message-----

**This e-mail, including any attachments, may be confidential, privileged,
or otherwise legally protected. It is intended only for the addressee. If
you received this e-mail in error or from someone who was not authorized to
send it to you, do not disseminate, copy, or otherwise use this e-mail or
its attachments. Please notify the sender immediately by reply e-mail and
delete the e-mail from your system.**

-----End Original Message-----

Philip Sevetson

v11 on z/OS: Transient data anomaly
(in response to Daniel Luksetich)
Dan,

It's not. It's a single subsystem image, not data sharing, not even single-member. Also, so far it's a one-time-only incident.

--Phil

From: Daniel L Luksetich [mailto:[login to unmask email]
Sent: Monday, July 09, 2018 9:21 AM
To: [login to unmask email]
Subject: [DB2-L] - RE: v11 on z/OS: Transient data anomaly

Phil,
Is this data sharing?
Dan

Daniel L Luksetich
DanL Database Consulting

IBM GOLD Consultant
IBM Champion for Analytics
IDUG Content Committee Past-Chairman
IDUG DB2-L Administrator
IBM Certified Database Adminstrator - DB2 11 DBA for z/OS
IBM Certified System Administrator - DB2 11 for z/OS
IBM Certified Application Developer - DB2 11 for z/OS
IBM Certified Advanced Database Administrator - DB2 10.1 for Linux UNIX and Windows

From: Sevetson, Phil <[login to unmask email]>
Sent: Friday, July 6, 2018 11:14 AM
To: '[login to unmask email]' <[login to unmask email]>
Subject: [DB2-L] - RE: v11 on z/OS: Transient data anomaly

Apparently not. We weren't aware that a table with *no* uncommitted work would be unstable under isolation UR - I thought that UR read from the bufferpool manager like everyone else, and the only possible explanation I can see for this is that it's reading obsolete pages _after_ their changed state is present in the bufferpool, let alone externalized.

--Phil

From: Jardine, Lawrence J [mailto:[login to unmask email]
Sent: Friday, July 06, 2018 9:29 AM
To: [login to unmask email]<mailto:[login to unmask email]>
Subject: [DB2-L] - RE: v11 on z/OS: Transient data anomaly

Regardless of the reasons why the count didn't match, if your program depends on a precise value, then you should not be coding "with ur".

Larry Jardine
Database Advisor
Aetna

From: Walter Janißen [mailto:[login to unmask email]
Sent: Friday, July 06, 2018 9:23 AM
To: [login to unmask email]<mailto:[login to unmask email]>
Subject: [EXTERNAL] [DB2-L] - AW: v11 on z/OS: Transient data anomaly

**** External Email - Use Caution ****
Hi Phil

What is your setting of SKIPUNCI?
Has the table in question row level locking?
What is the package's value of IMMEDWTITE?

I think in datasharing, if you don't bind with IMMEDWRITE YES a statement running on a different member than the inserter might well don't see entries.

Kind regards
Walter Janißen

ITERGO Informationstechnologie GmbH
Anwendungsentwicklung
Technische Anwendungsarchitektur
Victoriaplatz 2
D-40198 Düsseldorf
Tel.: +49(0)211/477-2928
Fax: +49(0)211/477-6441
[login to unmask email]<mailto:[login to unmask email]>

ITERGO Informationstechnologie GmbH
Vorsitzender des Aufsichtsrats: Christian Diedrich
Geschäftsführung: Dr. Bettina Anders (Vorsitzende),
Lothar Engelke, Ina Kirchhof, Dr. Michael Regauer
Sitz: Düsseldorf, Handelsregister: Amtsgericht Düsseldorf HRB 37996

Von: Sevetson, Phil <[login to unmask email]<mailto:[login to unmask email]>>
Gesendet: Freitag, 6. Juli 2018 14:37
An: [login to unmask email]<mailto:[login to unmask email]>
Betreff: [DB2-L] - v11 on z/OS: Transient data anomaly

I'm wondering whether WITH UR can possibly read committed data incorrectly, probably involving an index-only query, definitely involving a very-recently-committed unit of work.

We had a count come out wrong in DB2 this morning. What happened:

1) COBOL pgm runs many INSERTs to table A. Ends with RC=0.

2) Four seconds later, DSNTEP2 runs a count of certain rows (last 365 days by a date column) against table A (WITH UR). Count = x.

3) X does not have the expected value, and a program checks the result and flags it to our attention.

4) Seven hours later, we rerun the count against Table A. Count = y.

5) Y _is_ the expected value.

6) There are no other programs which are scheduled or built to insert rows to Table A, and no other scheduled jobs which run this program. We're pretty sure that no updates took place in A during the interval between the first and second counts.

What we have appears to be a case of a SQL anomaly - the same SQL COUNT statement giving us an apparently wrong result four seconds after a mass-insert program, and then giving us the expected result several hours later.

The combination of the change in result, the close timing of the COMMIT and the COUNT, and the use of WITH UR leave me wondering if the count error is a feature of UR. Does UR permit some kind of reading which might ignore non-externalized index pages?

I'm opening a PMR with IBM on this as soon as my sysprogs get to their desks and re-enable my IBM account, so it isn't critical that this be answered here. If someone knows, of course, we're interested. If we need to do something like _not_ running our control report queries WITH UR, or if this is actually not a WAD/Feature, we need to change our code accordingly.

Comments are welcome.

--Thanks,

Phil Sevetson
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**
-----End Original Message-----

-----End Original Message-----
This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately. Thank you.
This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately. Thank you. Aetna

-----End Original Message-----
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**
-----End Original Message-----

-----End Original Message-----
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**

Michael Hannan

RE: v11 on z/OS: Transient data anomaly
(in response to Philip Sevetson)

 In Reply to Philip Sevetson:

Apparently not. We weren't aware that a table with *no* uncommitted work would be unstable under isolation UR - I thought that UR read from the bufferpool manager like everyone else, and the only possible explanation I can see for this is that it's reading obsolete pages _after_ their changed state is present in the bufferpool, let alone externalized.

Phil,

People get unnecessarily afraid of WITH UR. So I don't agree with Larry. "Unstable" seems an emotive word. WITH UR still takes a share latch to avoid an "unstable" page (in process of being updated by another thread). So data read should not be corrupt. It avoids the Lock Avoidance (and locking), by not worrying if data is committed or not.

WITH UR works just fine to see result of committed work (in addition to uncommitted).

Data Sharing is different. Neither   UR not CS are guaranteed to see the latest updates immediately. It is guaranteed when take a logical Lock, but some SQLs have lock avoidance. SELECT COUNT(*) is one of those unless RR. This is based on the logic that the Update elsewhere could just as easily have run just after the SELECT COUNT. It is mere coincidence which occurs first. So in Data Sharing SELECT with CS advantage over UR is quite marginal, in my personal opinion when performing counts or making lists of rows.

Anyway you are not DS so I am digressing. In non DS, UR is normally guaranteed to see the latest data updates, whether committed or NOT.

The most likely cause of the problem, is someone hoaxed you, by deleting (or adding) some rows just before your SELECT COUNT, and rolled back just after your SELECT COUNT. I don't mean this is really the problem, but just pointing out a theoretical possibility.

Otherwise a possible DB2 fault, however examine all of the assumptions very carefully.

Michael Hannan,
DB2 Application Performance Specialist
CPT Global Ltd

Edited By:
Michael Hannan[Organization Members] @ Jul 10, 2018 - 10:10 AM (Europe/Berlin)
Michael Hannan[Organization Members] @ Jul 10, 2018 - 12:11 PM (Europe/Berlin)

Walter Jani&#223;en

AW: v11 on z/OS: Transient data anomaly
(in response to Michael Hannan)
Michael

I don’t understand what you don’t agree with in my response?

Kind regards
Walter Janißen [standard_IBM+Champ+7+Yr+Analytics]

ITERGO Informationstechnologie GmbH
Anwendungsentwicklung
Technische Anwendungsarchitektur
Victoriaplatz 2
D-40198 Düsseldorf
[login to unmask email]<mailto:[login to unmask email]>

ITERGO Informationstechnologie GmbH
Vorsitzender des Aufsichtsrats: Christian Diedrich
Geschäftsführung: Dr. Bettina Anders (Vorsitzende),
Lothar Engelke, Ina Kirchhof, Dr. Michael Regauer
Sitz: Düsseldorf, Handelsregister: Amtsgericht Düsseldorf HRB 37996

Von: Michael Hannan <[login to unmask email]>
Gesendet: Dienstag, 10. Juli 2018 10:09
An: [login to unmask email]
Betreff: [DB2-L] - RE: v11 on z/OS: Transient data anomaly


In Reply to Philip Sevetson:
Apparently not. We weren't aware that a table with *no* uncommitted work would be unstable under isolation UR - I thought that UR read from the bufferpool manager like everyone else, and the only possible explanation I can see for this is that it's reading obsolete pages _after_ their changed state is present in the bufferpool, let alone externalized.

Phil,

People get unnecessarily afraid of WITH UR. So I don't agree with Walter. "Unstable" seems an emotive word. WITH UR still takes a share latch to avoid an "unstable" page (in process of being updated by another thread). So data read should not be corrupt. It avoids the Lock Avoidance (and locking), by not worrying if data is committed or not.

WITH UR works just fine to see result of committed work (in addition to uncommitted).

Data Sharing is different. Neither UR not CS are guaranteed to see the latest updates immediately. It is guaranteed when take a logical Lock, but some SQLs have lock avoidance. SELECT COUNT(*) is one of those unless RR. This is based on the logic that the Update elsewhere could just as easily have run just after the SELECT COUNT. It is mere coincidence which occurs first. So in Data Sharing SELECT with CS advantage over UR is quite marginal, in my personal opinion when performing counts or making lists of rows.

Anyway you are not DS so I am digressing. In non DS, UR is normally guaranteed to see the latest data updates, whether committed or NOT.

The most cause of the problem, is someone hoaxed you, by deleting (or adding) some rows just before your SELECT COUNT, and rolled back just after your SELECT COUNT. I don't mean this is really the problem, but just pointing out a theoretical possibility.

Otherwise a possible DB2 fault, however examine all of the assumptions very carefully.

Michael Hannan,
DB2 Application Performance Specialist
CPT Global Ltd

-----End Original Message-----
Attachments

  • image001.png (2.6k)

Michael Hannan

RE: AW: v11 on z/OS: Transient data anomaly
(in response to Walter Janißen)



In Reply to Walter Janißen:

Michael

I don’t understand what you don’t agree with in my response?

Kind regards
Walter Janißen [standard_IBM+Champ+7+Yr+Analytics]

Apologies, due to the mess of indented replies, I got the wrong one. I will correct it.So actually I should not have been disagreeing with you. You said quite a few things that match my views.

Michael Hannan,
DB2 Application Performance Specialist
CPT Global Ltd

Edited By:
Michael Hannan[Organization Members] @ Jul 10, 2018 - 12:16 PM (Europe/Berlin)

alain pary

RE: v11 on z/OS: Transient data anomaly
(in response to Philip Sevetson)

Hi Phil , 

If X < Y then  you lose some insert or a delete happen between you commit or your count .

Do you run a dsn1logp to be sure of your assesment about the commit and the count.

or find the ddl with a third-party tools .

 

regards 

 

Alain

Philip Sevetson

v11 on z/OS: Transient data anomaly
(in response to alain pary)
Alain,

I understand what you’re saying. However:

This is a government payroll system, inside a fairly hardened/secured mainframe environment. The function of this table is to record certain payroll related information for several hundred thousand people, and it has financial and contract value. If someone inserted or deleted in between the two times, then (1) we’ve got a major security breach, and (2) someone’s probably going to jail for criminal fraud or larceny. Also, there is no informational benefit in our system for taking the data out and then putting it back, which would be the only possible cause for the later counts matching.

The alteration of data would have to have taken place in four seconds of elapsed time, on a reasonably secure production system, without alerting anyone involved and without causing a problem with the follow-on query. I understand your reasoning, but it’s circumstantially a lot less likely than a synchronization failure in the DBMS.

I will look at the DSN1LOGP and SYSLGRNX as soon as I can identify the RBAs involved, but there’s not much doubt about what I’m not going to find. I’ll buy you a beer somewhere if you’re right and I’m wrong.

--Thanks,

Phil


From: alain pary [mailto:[login to unmask email]
Sent: Tuesday, July 10, 2018 10:38 AM
To: [login to unmask email]
Subject: [DB2-L] - RE: v11 on z/OS: Transient data anomaly


Hi Phil ,

If X < Y then you lose some insert or a delete happen between you commit or your count .

Do you run a dsn1logp to be sure of your assesment about the commit and the count.

or find the ddl with a third-party tools .



regards



Alain

-----End Original Message-----
**This e-mail, including any attachments, may be confidential, privileged, or otherwise legally protected. It is intended only for the addressee. If you received this e-mail in error or from someone who was not authorized to send it to you, do not disseminate, copy, or otherwise use this e-mail or its attachments. Please notify the sender immediately by reply e-mail and delete the e-mail from your system.**

Peter Conlin

v11 on z/OS: Transient data anomaly
(in response to Philip Sevetson)
Phil,
Does the table has unique key?
You mentioned indices as a possible factor in the initial post.
Could COUNT(*) with UR be optimized against the index (versus the table)?
If so, perhaps the COMMIT is being satisfied (at least in the mind of the UR SELECT) against the table alone, rather than against table+index?
Maybe not yet externalized catalog/statistics stats are used for optimization of such queries?
Peter