Modern indexes for modern data
The age of data is here, and though it actually started several decades ago, we can say that data is evolving, perhaps too quickly, and the need to understand it is increasing. There is a lot of research on new trends, new ways to store a table, new ways to extract it, and of course, with different ways to process data there are new propositions for index structures so that it can satisfy the requirements of its underlying DBMS and the nature of its applications (e.g. OLAP or OLTP applications, in memory databases and more). This article explores a few index proposals posted through publications located at the arXiv Cornell University Library with the purpose to illustrate the challenges we are facing if we are to evolve at the same rate our data does.
This type of index is intended for OLTP main-memory parallel databases and the paper located here (http://www.cs.cmu.edu/~./dga/papers/hybrid-index-sigmod2016.pdf) discusses its implementation on H-Store, an experimental in memory DBMS developed from a team of several universities (which you can try for yourself if you’re running in Ubuntu). Though main-memory databases could be considered another branch of databases and applications in comparison to banking environments with Db2 for z/OS and not entirely adequate for millions of transactions per second, it is true that the tendency to favor memory usage is gaining popularity, which we have seen in the Mainframe world in upgrades such as moving structures above the bar or backing up buffer pools with real memory.
For online and heavy transactional workloads, it makes sense that newly added or modified rows are to be accessed more frequently until they become static, where they will be most often accessed on ranged value queries or aggregate operations, thus confirming their static nature. This idea led to the creation of what the authors called “Hybrid indexes”, which they further developed into a three part index:
- A dynamic structure to store the hot data (i.e. the newly inserted and newly modified data, more likely to be accessed in the next transactions).
- A compacted structure (read only) to store static historical data. Now it is considered to be historical and since its nature is to be static.
- A bloom filter, which will redirect a query to the right structure containing the qualified rows so that a query does not have to scan both index structures.
The new hybrid index is very versatile in its internal implementation as the data structures can vary from a B++ tree or linked list to more complex structures. This paper discusses how such data structures can be converted into a compact structure suited for the static stage and several approaches to provide a light data merge process which will migrate from the dynamic stage to the static stage.
The general diagram for read queries is as follows:
The bloom filter receives the predicate and evaluates whether there are qualified rows and where they reside to redirect the search into the proper stage and lastly return it to the RDS. If the predicate from the query looks for non-existent data, the bloom filter will avoid any further scan to the index. And as previously mentioned, since both stages reside on different sources, it is required to count with constant merges from the dynamic stage to the static stage where it could be a ratio threshold or timed merge (and the authors recommend a ratio threshold for this operation).
The improved response time goes from 4% to 20% depending on the underlying structures used (B tree or another data structure) and the data type stored (tested on integers and chars), with a significant increase in response time between compressed and compact structures on the static stage.
The key point from this proposal is the ability to present an index that can fit into memory, which is an interesting approach used by H-Store and other in memory DBMSs. Hopefully in the near future we could see such an approach to be taken to other types of DBMSs since it’s tailored for transactional workloads.
Essentially, an index is a data structure that, when scanned, it returns a pointer or key to the location of a record in a table (or a column in the case of a columnar index), or, taken from a different angle, it is a model where the target feature is a key to a specific record or range of records and such a target feature is a continuous variable, while on the other hand, a bloom filter is in essence a binary classifier. Taking this approach could lead to a very interesting discovery: An index is a model that can be learned (or if you will, it can be machine learned). A team from Google recently posted a paper (found here https://arxiv.org/abs/1712.01208) proposing a new type of index that would combine the world of Machine Learning and Database Systems, which they named “Learned Indexes”. Their intention is to create or promote a new field of research on databases and the ever evolving Machine Learning field.
Their paper opens up a new possibility for hardware since nowadays there is a great deal of research towards making faster machine learning predictions and having a faster model training time, which often point to redirecting neural network training operations to GPU processors or to the new Google processor TPU (a Tensor Processing Unit).
Side note: What is a TPU?
As an introductory view, a neural network can be visualized as a series of matrix multiplications, and the output of a neuron can have the following form:
z = f( w1x1 + w2x2 + … + wnxn + b )
Where w is the input weight, b stands for the bias, the function f is an activation function for such neuron (sigmoid, Relu, tanh or other) and z is the output of the neuron (called logit).
The process of training a network involves a large sum of multiplications that are too costly depending on the number of layers for the network, and it can even sum up to millions of multiplications. For such a problem, and knowing that most of Google solutions are based on neural networks, they created a chip that can specifically deal with these operations by also providing a compiler that translates TensorFlow API calls to the instruction set of this new chip, hence called Tensor Processing Unit.
Viewed from the machine learning perspective, an index can be taken as a regression problem and a bloom filter is a binary classifier, though there is an important consideration, which is that a database index must not return false negatives (however, it could actually be acceptable to have false positives, in a low rate of course, which would result in unnecessary scans returning a +100 SQLCODE, though not as costly as having a false negative). In any case, the authors consider it to be complemented with the aid of another more traditional structure (and it would then be taken as another type of hybrid index).
Therefore, by switching terms, a B-Tree structure can evolve into a linear regressor or a neural network, a scan can return a min and max page pointer (and thus it could be considered the error function for the model) and rebuilding an index can be morphed into model training. On the other hand, if the tablespace is well clustered then learning the distribution of the data is similar to calculating from its CDF (Cumulative Distribution Function) and therefore it could be translated into a statistical problem to solve.
On the inside, the authors propose a strategy of a recursive index, where each layer takes the key as an input and it selects the model for the next layer until a prediction is returned, and for every layer advanced the index will reduce its error function. This is a flexible design as the top layer can be a neural network and the lowest layers could be linear regression models or even traditional B trees to reduce the complexity and number of operations required. Such an index can be considered to be a recursive hybrid index and its worst possible throughput would be that of a B tree index if all layers would be turned into B trees.
As expected from a machine learning problem, tuning hyperparameters and the model complexity can make a significant difference (however, a grid search can provide the most optimal hyperparameters), and using a more complex model can lead to unnecessary overhead that can be better solved by linear regression models over neural networks (and the same applies for networks: a single layer network can sometimes provide better response than a complex network). With this layout, experiments were done on secondary learned indexes using columns from three data types with the following results, taking a B-tree index in comparison:
Data type of the index column Space savings Response time (best result only)
Real numbers Up to 99% 69% faster
Integers Up to 99% 52% faster
Strings Up to 99% 44% faster
This proposal can be further extended into hash learned indexes or bloom learned filters, each with their own unique challenges: Using a CDF approach for the hash case, and as for the bloom filter, it could be translated into learning a function f such that it can deliver the same zero false negative rate (FNR) and perhaps even a zero false positive rate (FPR) as well.
However, for future development and research, dealing with overfitting models remains as a trade off. For a general Machine Learning model, overfitting means that the model will “memorize” the distribution and patterns of the training data to such level that its accuracy will be almost perfect but its ability to generalize over new unseen data will go down. For the case of indexes, this statement means that a learned index for a read only table could deliver optimal results as it has taken advantage of the distribution of its data, but it might be unable to process new inserted or updated data, and at the same time, the question remains: How often should I retrain an index?
Although this paper recommends learned indexes as a new type of index for Data Warehouses or read only indexes that need to be rebuilt once a day, the authors propose a few alternatives that could be further explored to tackle the challenge of creating an updatable learned index and how this generalizes to new changes in the data.
The new generation of databases began a few years and we hear about it almost every day, while at the same time new data trends are boosting database research to deliver suitable, innovative solutions that can evolve and meet new demands that arise overnight. The appearance of similar papers created what we now call Hadoop, MapReduce, Spark, Dockers and many other technologies that are changing our mindset regarding data manipulation and the way we now look at data, so it is worth staying tune to new proposals, new discoveries, new solutions, and contribute to fresh ideas so that we can experience a more uniform evolution.
Kraska, A. Beutel, Ed. Chi., J. Dean, and N. Polyzotis (2017). The Case for Learned Index Structures. ArXiv.org Cornell University Library, 1-27. Retrieved January 2, 2018, from https://arxiv.org/abs/1712.01208
Zhang, D. G. Andersen, A. Pavlo, M. Kaminsky, L. Ma, and R. Shen. Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016 K. S. (2017, May 12).
An in-depth look at Google’s first Tensor Processing Unit (TPU). Retrieved January 3, 2018, from https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
Buduma, N. (2017). Fundamentals of Deep Learning: Designing Next-generation Artificial Intelligence Algorithms / c Nikhil Buduma. Beijing, Boston, Farnham, Sebastopol, Tokyo: OReilly.
Muller, A. C., & Guido, S. (2017). Introduction to machine learning with python: a guide for data scientists. Sebastopol, CA: OReilly.