The AI Query Optimizer in Db2

Posted By: Calisto Zuzarte Technical Content,

This article describes how the Db2 optimizer is evolving into the next generation Db2 AI optimizer. Query optimizers have evolved from a heuristic rule-based algorithm to a cost-based algorithm and the next step in its evolution is AI-based. There will be subsequent blog articles that dive into more detail of the underlying functionality.

 
Background and Motivation 

IBM pioneered the concept of relational databases, SQL, and query optimization. Back in the 1970’s researchers, including Ted Codd, Don Chamberlin and Pat Selinger at the IBM research lab at Almaden, developed the Relational Database and envisioned querying the database system through SQL, a declarative language that did not burden the user with specifying how to access and process the data other than saying which tables to access and join to fulfill the query request. SQL has stood the test of time and is one of the most successful languages that is still widely used after more than 40 years. Prior to this the users needed to know exactly how the data was organized and specify how to access and process the data. Pat Selinger, the primary author of the pioneering SIGMOD 1979 paper on query optimization is credited with one of the most influential papers in the database field. The evolution and adaptability of the concepts behind the relational database, the SQL language and the cost-based optimization have endured and flourished over four decades. Database systems in modern-day applications handle data volumes so large and provide capabilities so advanced that one cannot but be in awe of these early visionaries from IBM.

The core idea in database optimization is that, when it is presented with the query that contains the semantics of what response is needed, the query optimizer figures out how to access and process the data. The query execution plan produced by the optimizer guides the execution engine how to access and process the data in an optimal way. With the pioneering ideas and implementation in Db2, there was a major shift from explicit user instructions and later rule-based approaches to a cost-based query optimizer.  Cost based optimizers have ruled the roost over the last 40 years. In the meantime, while optimizer technology in general has been doing great, coping with diverse environments, various configurations, massive increases in data volume, application complexity and the database user’s expectation of very fast responses is making it stretch at its seams.

The traditional Db2 query optimizer is a hand coded statistical model that tries to mimic the behavior of the execution engine to compute costs to the extent that it can. It uses estimates of physical (for example, CPU speed, sequential or random I/O rates, network speed and calibrated per-row costs of various low-level operations) and logical (for example, statistics about the data, computed number of rows processed, clustering, ordering of data, schema characteristics, and constraint information) characteristics of the database system to compute the costs of the various sub-plans. The final query execution plan is computed as one that the query optimizer believes is optimal.

Some of the hardware characteristics may not exactly match that which is used by various customers. Tweaking and calibrating the traditional optimizer code to account for advances in hardware, configurations and new run time capabilities is a major effort. Furthermore, there are many simplifying assumptions like column independence (that values in one column are not statistically correlated to values in another column) and inclusion (that a column from one table will have values that all exist in a column of another table that has more distinct values). These simplifying assumptions affect the accuracy of the estimates and could affect the optimal plan choice.

As such, both in terms of optimal plan choice and from an optimizer development point of view, it is desirable to overcome the limitations based on simplifying assumptions and to reduce the amount of laborious tweaking of the existing statistical model. From a database user’s viewpoint, the increasing complexity of workloads and increasing data volumes, makes any performance instability more noticeable. Demand for automatic tuning is also increasing as the number of skilled performance experts decrease with less centralized database systems. At the same time, significant advances in AI today has made it feasible to revolutionize the query optimizer. Machine learning (ML) as one methodology to enable AI has rapidly advanced in terms of technology and feasibility allowing us to realize the goal of infusing AI in the database optimizer.

 
Artificial Intelligence (AI) and Machine Learning (ML)

Artificial Intelligence (AI) is the simulation of human intelligence in machines that are programmed to think like humans. With advancements in hardware and software, AI is increasingly being used to perform tasks that can mimic and often exceed what humans can do. One way to do this is to use ML which provides AI systems the ability to automatically learn and improve from experience without being explicitly programmed. This is done by feeding in relevant data, suitably encoded to ML algorithms that build a model by detecting complex patterns. These models, when adequately trained, can make predictions even on input that it may not have seen before. The major benefit of ML is to train and build a model with workloads and data that is being used by the end user. As such, many of the assumptions and formulae that were adopted in the traditional optimizer are not needed.

Models may be used to cover various aspects of the database optimizer. The algorithms and input features of the models can vary, and the output inferred from the model and its accuracy depends on how well it is trained. While training the model on the actual workload and data is ideal, too much time might be spent training the model. As such, training a model on a user supplied workload or an automatically generated workload may be initially sufficient. To get better accuracy and to keep the model current, refining the model based on the actual workload and using runtime and optimizer feedback is an important consideration. Accuracy of the model due to changes in the data and workload is another reason that the model may need to be retrained.

Given the complexity of the Db2 optimizer, a “rip and replace” of the traditional optimizer with an AI optimizer would be a significant effort. Infusing AI into the Db2 optimizer is best done using a divide and conquer approach. This is achieved over time by using specific models to replace targeted optimizer functionalities. The approach adopted was to enhance key functions in the optimizer that are suited to using AI techniques to provide better and more stable performance with the query execution plans that it produces.

 

AI Optimizer Goals

It is difficult, if not impossible, to replace the entire existing traditional optimizer model in a short time with a completely new AI-based model without causing some churn with query performance regression in mission-critical production environments. The approach that is adopted in Db2 is a gradual infusion of AI for various aspects of the traditional optimizer over time. Integration with the traditional optimizer is key to a smooth transition towards use in a production environment. As design goals, the following were considered:

  • Making the optimizer component development in Db2 simple.
  • Making query optimization tuning simple by automating everything
  • Integrating within the Db2 optimizer in a phased approach to minimize instability and deliver mature functionality quickly. The traditional optimizer is used for any aspect not handled by the model.
  • Achieving reasonable (a) model size (b) training time (c) accuracy (d) prediction time (query optimization overhead) and (e) robustness.

Infusing AI into the Db2 optimizer is a gradual transformation of the traditional optimizer’s critical features. The AI capabilities will obviate the need to constantly tweak the traditional optimizer model with each new feature implemented in Db2 or with each new feature, hardware or configuration supported by Db2. While some models may be pre-built before Db2 is shipped, the models, as much as possible, will be trained and built on-site on the specific hardware, configuration, workload, and data characteristics for the user environment where Db2 is being used.

The following AI goals are what seemed appropriate to consider:

  • When many models need to be stored (for example at a table level) an aggressive model size target of less than 50 KB per model is a good goal to consider. This is aggressive considering popular boosting algorithms used in ML can produce models of several megabytes in size when trained on large amounts of data containing complex correlations and interdependencies. When fewer models are needed, this constraint may be relaxed.
  • Training time needs to be reasonable. When many models might need to be trained, significant attention needs to be paid to limit the training time to a few minutes (for example, 2 minutes per table if it is at a table level). When fewer models are needed, this constraint can be relaxed.
  • Accuracy needs to be such that using the ML models results in significantly improved performance compared to that using the traditional optimizer.
  • Prediction time should not add any significant overhead on the optimization time. It should not exceed a few microseconds.
  • Any tradeoff between accuracy and training time needs to be evaluated on a case-by-case basis and balanced in the context of the traditional optimizer’s accuracy.
  • Additional memory consumption limit of about 100MB during model training, and restricting model training to a single CPU or a small fraction of the number of CPUs if trained concurrently alongside user workloads.

Conclusions

Database query optimization is a challenging component and for those deeply involved in query optimization and performance, it is an exciting journey. The pioneers from IBM got the ball rolling with their vision of the query optimizer in the 1970s. AI infusion in the optimizer extends that vision beyond what they might have imagined at that time.

Stay tuned for a series of blogs that cover details of AI infusion in the Db2 optimizer that will be rolled out in Db2 12.1. These subsequent blogs will cover (a) the architecture and some experimental results, (b) functionality from an external user point of view and (c) information about the machine learning model.


Calisto Zuzarte has been in Db2 development since Db2 was born. He has mainly been driving innovation in the optimizer components including the query rewrite engine and the plan optimizer. In the last few years, He has been involved in the evolution of the optimizer to the next generation AI Query Optimizer. More recently he drives the optimization strategies in the Query Optimizer as-a-service enabling significant query performance improvements in Presto C++ in Lakehouse environments.