Blogs

“Guilty Until Proven Innocent”

By Anthony Ciabattoni posted Jan 27, 2018 06:49 AM

  

The evolution of technology has driven the necessity for highly available systems and
applications. Customer loyalty is directly related to their user experience which includes
application functionality, performance and availability. Customer interfacing applications are
often held to System Level Agreements (SLA’s) that are tied to application response times and
as well as the number of failed transactions in a specified timeframe.


Highly skilled Subject Matter Experts (SME’s) are required to design, implement and maintain
applications and systems for performance and availability. These individuals are highly skilled,
have deep technology and internal organizational experience and oftentimes specialize in a
specific infrastructure component. In addition to their technology knowledge, these
professionals are typically highly passionate and confident. The additional ability to make fast
and correct decisions are important characteristics needed to minimize customer disruptions.


Negative customer experiences such as slow application response, failed transactions or missed
SLA’s will lead an organization to an “all hands-on deck” methodology to correct the problem
and resume a positive customer experience. Establishing a “war room” approach, populated
with SME’s is a common procedure to help facilitate resuming normal customer experiences.
Executives will apply intense pressure to the troubleshooting team to restore availability as
quickly as possible. SME’s are forced to make quick sometimes premature decisions relying on
intuition and experience to diagnose the root cause of the customer impacting event and take
appropriate action. The initial quick diagnosis in time will either be proven or not disproven via
detailed root cause analysis.


Over the years as a customer as well as supporting my customers I have participated in
hundreds of negatively impacting customer events. Once a “war room” is established with
SME’s a popular initial diagnosis is to quickly assign the blame or guilt to the z/OS platform or
more specifically Db2. In our society, a person is innocent until proven guilty, so why is it so
easy to initially assign guilt to Db2?


To understand this behavior, analysis needs to be conducted via a transaction flow.
Database(Db2) processing is the last infrastructure component in a transaction sequence (see
diagram below). It is easy for all the other infrastructure SME’s to point the finger at the last
component in a transaction sequence, unless there are solid numbers to prove otherwise.

anthony 1.jpg

After the initial problem diagnosis, efficiently restoring an application to a normal status
following an “outage” or performance slowdown relies on strong troubleshooting skills and
methodology. Establishing a baseline of what is normal and of what granularity to compare are
important elements in reducing the time to identify and resolve the issue.


Using Db2 accounting Class 1, Class 2 and Class 3 can oftentimes be used to identify if Db2 or a
component outside of Db2 is the root cause of the problem. The diagram and description below
shows how Class 1, Class 2 and Class 3 times are accumulated.

anthony2.jpg

anthony 3.jpg

Organizations that have implemented a performance warehouse will have baseline metrics for
their most important top x OLTP transactions and their most valued batch programs. This gives
them the ability to capture data during a problem time and quickly compare to what is normal.
The level of granularity will vary depending on what is being researched, for example OLTP
could be as granular as 1 minute increments and in the batch environment could be less
granular (5, 15, 30 minutes).


The upcoming examples use Db2 Accounting Class 1, Class 2 and Class 3 traces to help identify if
Db2 is or is not the root cause of the problem. The examples are based on a CICS/Db2 specific
transaction(OLTP) captured in 1 minute increments via a Db2 batch accounting report. The Db2
accounting report is then tailored into an easily readable REXX output that is uploaded to a
Microsoft Excel spreadsheet for analysis. In lieu of a performance warehouse the analysis will
compare normal 1 minute increments with abnormal 1 minute increments.


Example #1 – Increased Db2 Accounting Class 1 elapsed time
• Db2 is typically innocent
• Elongated transaction elapsed time is due to something outside of Db2 (Class 1).
• Potential causes could be waiting for a 2-phase commit with CICS, waiting on a commit
notification from a non-z/OS application server or potentially a network issue.

anthony 4.jpg

Example #2 – Increased Db2 Accounting Class 1 CPU time

• Db2 is typically innocent
• Increased z/OS CPU time is due to a z/OS (Class 1) but not attributed with a Db2 process.
• Potential causes could be CPU spent in COBOL/CICS or another z/OS process.

anthony ex2.jpg

Example #3 – Increased Db2 Accounting Class 2 elapsed time
• In this case, increased Class 2 elapsed times does not pinpoint the root cause, further
research is needed.
• Db2 is unable to classify where the time is being spent and the time is placed in the Not
Accounted for bucket.
• Increased in Not Accounted for Time typically points to an event or configuration
outside of Db2.
• Potential causes be a CPU starved system, WLM policy needs to be configured or tuned
appropriately or in some occasion monitoring products could lead to increased Not
Accounted for Time.

anthony ex3.jpg

Example #4 – Increased Db2 Accounting Class 2 and Class 3 elapsed time
• Increased Class 3 elapsed times does not pinpoint the root cause, further research is
needed.
• The captured QTRN transaction below is a skinny distributed native SQL stored
procedure.
• Increased average sync i/o elapsed time can be associated to a storage subsystem issue,
additional storage research should occur.
• Db2 is typically a victim and innocent when the average sync i/o increases.
• The identified customer root cause below was incorrect configuration of new Parallel
Access Volumes(PAV) storage volumes resulting in a spike in average sync i/o.

anthony ex4.jpg

Example #5 – Increased Db2 Accounting Class 2 CPU time
• Db2 is typically guilty when Class 2 CPU increases
• In the scenario below the average number of get pages per DML increases substantially.
This could be a data driven event where Db2 needs to access more data for the desired
results.
• Variety of reasons could result in increased Class 2 CPU, increased DML per transaction,
increased updates/deletes per transactions are a few additional items to research.

anthony ex5.jpg

Example #6 – Increased Db2 Accounting Class 2 and Class 3 elapsed time
• Increased Class 3 elapsed times does not pinpoint the root cause, further research is
needed.
• Increase in Class 3 Db2 latch elapsed time is due to Db2 locking /latching contention.
• Db2 is typically guilty
• Additional research is needed to identify the root cause.

anthony ex6.jpg

 

In summary, database(Db2) processing will continue to initially be “Guilty Until Proven
Innocent”. Db2 Class 1, Class 2 and Class 3 accounting traces can be effectively used to identify
if Db2 is the victim of an outage or the cause of an outage. Proactively creating a performance
baseline and troubleshooting methodology using the Db2 accounting traces can assist in
minimizing the time to identify the root cause of an outage. In time with success, Db2
accounting traces can evolve from an instrument to prove Db2 innocence to an important
process in an overall troubleshooting methodology.

Anthony Ciabattoni
IBM SWAT team

0 comments
2 views