by Shaikh Quader
Many Data Scientists start their machine learning (ML) projects on an open-source stack installed on their laptop. They load a comma separated value(csv) file into a Python Dataframe, explore and clean the data, engineer new features, build and tune a few different models, and finally, choose the best model. This mode of machine learning is not an option for enterprise-grade machine learning with massive data. If your dataset fits into any of the following domains, you probably have large data and the challenges that come with it. IBM Db2, equipped with in-database machine learning capabilities, can help enterprises learn from data at scale.
Two major obstacles with handling big data are: (1) Slow data access and (2) Scalability. The rest of this article will discuss both challenges and explain how Db2 can help address them.
Slow access — copying big data across the network is slow
Network latency comes into the picture when big data moves from one system to another. When transferring data over a network, the cost is accumulated from multiple steps:
a. Executing the query at the data source to retrieve the necessary data records
b. Serializing the data before sending it over the network (What is serializing? — a good analogy would be Amazon packaging your order before shipping it to you.)
c. Moving the data over the network. Depending on how busy the network is, actual transfer times vary.
d. Once the data reaches its destination, a development process, for example, a Python runtime, deserializes (unpacks) and loads the data into the RAM. The time to load depends on the available RAM and CPU cycles on the destination system.
Recently, we retrieved 100M rows (25GB) from a database to a separate development system over a high-speed corporate network. It took 75 minutes for the data to show up in a Pandas Dataframe. The lifetime cost of data movement will be much higher if we multiply this by the number of times the models need to be refreshed as the source data is updated, and the number of different data science projects reusing the same data. Can this data movement cost be eliminated?
Eliminate Data Transfer cost by Building and Deploying your ML Models inside Db2
IBM Db2, with its recent “Nebula” (version 11.5.4) release, brings a suite of capabilities, including built-in machine learning algorithms. This helps build and deploy ML models inside the database, where the data is being stored. Db2 supports both supervised and unsupervised ML use-cases:
a. Classification: Db2 includes Decision Trees and Naïve Bayes algorithms for building classification models, empowering clients to build use-cases such as customer churn prediction.
b. Regression: for predicting a quantity (like the price of a house), Db2 comes with the Linear Regression algorithm.
c. Clustering: Db2 provides the ubiquitous K-Means algorithm for clustering unlabelled datapoints. For example, grouping customers based on their purchase behavior.
As of Db2 11.5.4, the following capabilities are available for in-database Machine Learning:
Scalability — when the Data is bigger than the Machine
Once the data arrives at the development system, the next challenge unfolds: insufficient computing resources to handle the big data. At the time of writing this blog, most consumer level laptops come with 16GB of built-in RAM. This puts a hard limit on the size of the dataset (< 16GB) these machines can process. For larger datasets, you would need to find a bigger development system or a distributed computing environment. In conjunction, Pandas, a popular Python package for data analysis, doesn’t scale beyond a single compute node. As a result, its computation is limited to the CPU and RAM of the node. In such a situation, many Data Scientists work off a subset (sample) of the entire dataset, which can lead to suboptimal models. Wouldn’t it be nice to be able to use a larger compute footprint as the data gets bigger?
Scale out Machine Learning Tasks with Db2
The benefits of in-database machine learning capabilities in Db2 aren’t limited to reducing network latency. It goes further. It provides data scientists the computational horsepower of the hardware where the database system is installed. This means access to processing power, memory, and distributed processing, if the dataset is partitioned into multiple Db2 nodes. Over the past three decades, IBM Db2 has optimized its query processing capabilities, including Db2’s latest ML-based query optimization for even faster query response time. This compute horsepower and database optimization are now accessible to data scientists to take advantage of when performing Big Data ML tasks.
The ML algorithms inside Db2 have been designed to work in a distributed environment when needed. Db2 decomposes machine learning algorithms into smaller steps, translates them into SQL queries when possible, and pushes these queries to each data partition for parallel execution of the machine learning tasks. When all nodes have completed their assigned tasks on their partition of the data, the head node will combine the computed results into a single model. This parallelism significantly accelerates the building of machine learning models.
As ML becomes mainstream for the Enterprise use cases, more and more companies will try to create AI applications. When working with large data, it’s impractical to expect that everyone will have infrastructure for processing the data for AI model development and deployment. With IBM Db2’s in-database ML capabilities, Db2 users can have a head start in building and deploying their AI models near the data and easily integrate the models with their existing business applications. In the end, this will help Db2 customers save on IT costs and accelerate the process of turning data into assets.
For more details on Db2’s in-database Machine Learning capabilities, please check out the following resources:
5-min Demo: Building and Deploying a Linear Regression Model inside IBM Db2
5-min Demo: Building and Deploying a Decision Tree Classifier inside IBM Db2
Product Documentation: Db2 11.5.4 Knowledge Center