A self-fulfilling scenario: The risk of a "death spiral" when using CPU limits within a shared LPAR

Since I have been on a WLM dispatcher theme for the last few blog entries, I thought I would also write up this topic as it is an interesting albeit rare one. In this story, we will see how using CPU limits in a shared LPAR can sometimes result in a downward  “death spiral” with the hypervisor continuosly reducing the amount of CPU available to DB2 and thus, severely affecting DB2 performance.


The Players

Hypervisor adjusts CPU entitlement based on actual CPU consumptionWhen a logical partition (LPAR) is using a pool of CPUs shared dynamically with other partitions on the same machine, it is referred to as a shared LPAR. Each shared LPAR has an initial CPU entitlement which is its starting allocation of the underlying physical CPUs but if it uses less than that amount, the entitlement can be adjusted downward over time to allow the idle CPU to be used by the other LPARs on the machine. The hypervisor is the entity that manages the allocation of physical CPU to each partition sharing the same pool and it makes adjustments to CPU entitlement based on the observed CPU consumption of each shared LPAR.

CPU limits expressed as a percentage of CPU available to DB2Using the WLM dispatcher to apply CPU limits against executing work directs DB2 to prevent the CPU consumed by that work from exceeding the stated limit; the CPU limit itself is expressed as a percentage of the CPU available to DB2 as a whole.


The Scenario

A CPU limit of 50% is placed on a service class containing low priority work within a DB2 system executing in a shared LPAR environment. The DB2 system has an initial entitlement to the equivalent of 2 CPUs in this environment.

The CPU limit works wonderfully and constricts the low priority work to only 50% of the CPU entitlement for DB2 (e.g. 50% of 2 CPUs). However, the rest of the work running in DB2 does not need that much CPU at this time so the total DB2 CPU consumption is only 75% of its entitlement (e.g. 1.5 CPUs).

The hypervisor notices this low consumption and decides to reduce the amount of CPU available to DB2 in order to move the idle CPU to another LPAR that could benefit from it. Now, DB2 has an entitlement equivalent to 1.5 CPUs.

The change in entitlement causes a ripple effect within DB2 itself as the CPU limit, expressed as a percentage of CPU available to DB2, implicilty changes and now limits the low priority work to 50% of 1.5 CPUs. Unfortunately, this again results in DB2 not using its full entitlement as the other work still does not need the CPU.

The hypervisor notices this low consumption and decides to reduce the amount of CPU available to DB2…and so on.

This cycle continues until the defined LPAR CPU minimum is reached at which point it stabilizes. The outcome of all this is that the low priority work is continuously squeezed for CPU and its performance continuously degrades throughout the process.


The Lesson

What did we learn from this story? Well, first of all, hopefully to be cautious using CPU limits in a shared LPAR environment as they may have unintended consequences! :)

If the work being constrained is the majority of your CPU usage on the system or the unconstrained work cannot fully consume the rest of the CPU entitlement, then you could end up seeing symptoms similar to what I described above.

The recommendation we make when applying CPU limits in a shared LPAR environment is to ensure that you define the minimum CPU allocation for an LPAR at a level to limit how far you are willing to let the “death spiral” affect DB2 performance (should it occur).

Recent Stories
Things to consider when considering Db2 Native Encryption

An old Db2 Easter Egg: Setting the default isolation value for dynamic SQL

Interpreting total_extended_latch_wait_time