Database services in the cloud are gaining increasing popularity for development, test, and production systems. One of the key technical challenges for cloud database adoption is fast and secure movement of large data volumes into the cloud. This article presents 7 methods for moving data into Db2 on Cloud, Db2 Warehouse on Cloud, and Db2 Hosted. We discuss the pros and cons of each of these methods in terms of performance and functional capabilities and conclude with guidelines that help you choose the right method for your needs.
The shift to cloud computing is inevitable. For many companies and use cases the question is not if the data will ever be stored in the cloud, but when. Although there may be certain applications and types of data that will be kept on premises for a long time to come, the benefits of cloud computing in terms of agility, elasticity, and cost are becoming increasingly powerful. At the same time cloud security is making rapid advances both in terms of regulations and in terms of technical implementations such as VPN tunnels into private cloud networks. These developments continue to bring more and more data into the cloud.
The IBM portfolio of cloud database offerings can be divided roughly into two segments:
- The Db2-based offerings: Db2 on Cloud, Db2 Warehouse on Cloud, and Db2 Hosted.
- The Compose platform for managed hosting of open source databases such as MySQL, PostgreSQL, MongoDB, Redis, ScyllaDB, JanusGraph, and others.
This article focuses on moving data into the Db2-based services in the cloud. No matter whether you are just starting to test Db2 in the cloud or are already embarking on development or production projects in the cloud, one thing is for sure: one of the first tasks on your hands is to move data into the cloud database. This article describes a select set of techniques for moving data into Db2 databases in the cloud.
Options for Moving Data into a Db2 Database in the Cloud
Figure 1 illustrates several different options for writing data into the tables of a Db2 database in the cloud. Some of these options populate Db2 tables directly, others move the data into a cloud storage location such as IBM Cloud ObjectStorage or Amazon S3 and from there into the Db2 database.
In most cases the bottleneck is bringing the data into the cloud. Loading the data from a cloud storage location into the database is usually very fast by comparison, especially if the cloud storage is in the same cloud data center as the database.
Since a Db2 database in the cloud behaves much like a traditional Db2 database on-premises, many of the same techniques for loading data can be used, such as ETL tools, the LOAD CLIENT command, or federation. Additionally, new tools such as Data Connect (Data Refinery) and Lift have emerged. And, if you need to ship very large data volumes over a large distance then sending a pile of disks with a courier service (“Cloud Mass Data Migration”) can be faster than a data upload over a WAN connection.
Figure 1: Options for Moving Data into a Db2 Database in the Cloud
Db2 on Cloud, Db2 Warehouse on Cloud, and Db2 Hosted have some differences that play a role in this context. For example, Db2 Hosted is not a fully managed database service and it does not have a graphical web console like Db2 on Cloud and Db2 Warehouse on Cloud. Hence, loading through the web console is not an option for Db2 Hosted. However, Db2 Hosted leaves all the control (and responsibility) for the database administration to the user which allows you to customize backup and restore operations to move databases from ground to cloud or reverse. Figure 2 summarizes which techniques for moving data into the cloud are applicable to each of the three Db2 offerings in the cloud. The grey tick marks in Figure 2 indicate that the option is not supported at the time of writing this article but likely to become available in the future.
Figure 2: Load options for each of the three Db2 offerings in the cloud
Certainly, there are more options for moving data into a Db2 cloud database than the ones shown in Figure 1 and Figure 2. For example, data could be moved with a traditional replication tool, with the IBM Data Conversion Workbench, with custom SQL applications, or other techniques, all of which we are not going to discuss in this article. Db2 Hosted also allows you to upload data files to its file system with sftp or scp followed by a load operation.
In the following, let’s look in more detail at the seven techniques listed in Figure 2.
Load from Desktop (Web GUI)
Db2 on Cloud and Db2 Warehouse on Cloud have a web console that includes a Load menu, which is shown in Figure 3. The option "My Computer" on the left allows you to select and load a file from a file system that is accessible from the machine where the browser is running the web console. The other options (loading from Amazon S3, Softlayer, and Lift) are discussed in subsequent sections.
This load interface allows you to load only one data file at a time and is useful for adhoc load jobs with the ease of a simple GUI. If you want to load multiple data files in a single load job then you can run a LOAD CLIENT command from a local command line processor (CLP) connecting to the Db2 server in the cloud.
Figure 3: Load Menu in the Web Console
After you select the source data file the web console offers you to select a target table to load into or to create a new table based on the columns and data types detected in the source file. Of course you can adjust the automatically inferred column names and data types as needed before loading the data.
What is Aspera?
A critical feature in the Db2 (Warehouse) on Cloud web console in Figure 3 is the option to enable Aspera. Aspera is a technology to accelerate the transfer of large data files over long network distances. One of the key characteristics of Aspera is that it does not use the TCP/IP protocol and avoids the associated bottlenecks.
The main throughput bottleneck of the TCP protocol is its flow rate control mechanism in which the sender requires an acknowledgement for each packet sent to the receiver. If such an acknowledgement is not received quickly due to high network latency, the sender assumes that it is sending packets too fast and reduces the send rate. This often results in a transfer rate that is much slower than necessary.
Depending on network conditions, the use of Aspera can improve the upload speed to the cloud by one to two orders of magnitude. The first time you enable Aspera in the web console you need to download and install the Aspera browser plugin of about 20 MB.
For more details on Aspera, see http://asperasoft.com/technology/transport/fasp/.
Then you launch the load and monitor its progress in the web console and, optionally, in the Aspera plugin (Figure 4).
Figure 4: Monitoring the load job in the web console
Load from Cloud
The “Load from Cloud” option can also use the web console, except that the input file resides in the cloud rather than on premises. Therefore, in the load dialog of the web console you choose either Amazon S3 or IBM Softlayer as the location of the input file. Then you specify the authentication endpoint together with the user name and access key for the cloud storage location, as illustrated in Figure 5.
Figure 5: Select a cloud storage location to load from
Command Line Option
Alternatively, loading data files from Amazon S3 or IBM Softlayer ObjectStorage can also be done from the command line, such as a CLP session connected to the Db2 database in the cloud. Figure 6 shows a sample LOAD command and some of its output.
call sysproc.admin_cmd('LOAD FROM
myfolder/tabledata.csv OF DEL
MESSAGES ON SERVER
INSERT INTO myschema.mytable');
SELECT SQLCODE, MSG
FROM TABLE(SYSPROC.ADMIN_GET_MSGS('62127043_1346722415_')) AS MSG
Figure 6: LOAD command to load data from Softlayer or Amazon S3
Let’s look at the LOAD command in Figure 6 in more details. First, note that this type of LOAD command needs to be submitted through the admin_cmd stored procedure. Second, the location of the input file needs to follow a particular syntax (shown in red), which for Softlayer has the following structure:
And third, this LOAD job will not generate all messages in the same way as you are used to from a local Db2 installation. Therefore we recommend that you specify the MESSAGES ON SERVER option of the LOAD command so that messages are collected and retained at the Db2 server. This also triggers the LOAD utility to spit out two SQL statements that you can use later. The first is a query to read out the messages from your load job. The second is a stored procedure call to remove those messages from the server.
The moveToCloud Script
Regardless of whether you invoke the load in the web console or in the CLP, this “Load from Cloud” option assumes that the input file already resides in the selected cloud storage location. For example, another component of your solution might store data files in the cloud on a regular basis for Db2 or other systems to consume.
If you want to upload existing local data files to Amazon S3 or IBM Softlayer ObjectStorage you can use, for example, the moveToCloud script that is available in the Db2 on Cloud Knowledge Center:
The moveToCloud script can run on Unix, Linux, or Windows with Cygwin. It requires Perl 5.10 or higher, cURL, and gzip. The script automatically splits large data files into smaller pieces and compresses them for the upload process. The split and compressed files are automatically uncompressed and recombined when they are loaded into your database. Figure 7 shows an example of how the moveToCloud script can be invoked.
moveToCloud.pl -source largeFile.csv
Figure 7: Sample invocation of the moveToCloud script
Lift is an IBM tool specifically designed for ground-to-cloud database migrations. Lift CLI is the second and vastly improved version of the Lift tool. Unlike the first version of Lift, Lift CLI is a text-based command line tool that you download from https://lift.mybluemix.net/ and execute locally to migrate data. Lift CLI runs on Windows, MacOS, RHEL, SUSE, and Ubuntu.
Lift CLI has several desirable properties:
- Aspera is built into Lift and automatically used to accelerate the data upload. No extra download of Aspera is required. (Fast)
- Lift CLI automatically recovers from network issues such as dropped connections and can resume broken uploads. (Resilient)
- 256-bit encryption is used for the data transport. (Secure)
- Lift CLI offers just 6 basic commands to migrate data into the cloud. (Simple)
- The 6 lift commands have a broad range of optional parameters that you can use if needed to customize the migration steps. (Flexible)
- Lift CLI is free. (Nice)
As of the time of writing, the only target database supported by Lift CLI is Db2 Warehouse on Cloud.
Figure 8 illustrates the basic steps of migrating data with the Lift CLI tool. The lift extract command can be used to export data from a source database into flat files in CSV format. However, you can also use any proprietary export tool to unload data into a CSV file. The lift put command is used to upload a data file –with Aspera acceleration- to a landing zone in the file system of the Db2 Warehouse on Cloud server. Then, the lift load command can be used to load the uploaded file into a Db2 table.
Figure 8: Overview of the Lift CLI tool
The table in Figure 9 summarizes the six available Lift CLI commands.
Lift Command Description
lift extract Extracts data from a source database into one or more CSV files
lift put Uploads data to the landing zone on the Db2 Warehouse on Cloud server
lift load Loads the data from the landing zone into a Db2 table
lift df Shows the amount of space available in the landing zone
lift ls Lists the files in the cloud landing zone
lift rm Removes file(s) from the cloud landing zone
Figure 9: Summary of Lift CLI commands
Figure 10 show an example of a lift put command and its output during the upload process. The –f option of the lift put command specifies the file to upload. The –pf option specifies a properties file that contains key value pairs of additional parameters that you might not want to type out on the command line for each invocation of the lift command. For example, the properties file might contain the database and hostname of the target database, the user name and password to authenticate to the target database, and so on.
The output of the lift put command is simple but useful. It includes a “progress bar” with an increasing number of # signs as well as a numeric progress indicator counting the Megabytes or Gigabytes already uploaded. Additionally, the output shows the current throughput in Mb/s and the estimated remaining time for completion. Further details are captured in a log file on disk.
Figure 10: Screenshot of a lift put command
Another technique for moving data from a local database to a cloud database -or between any two databases for that matter- is federation. Federation allows one database (the federation server) to access tables in other databases as if these tables were local tables. The federation server needs three ingredients to setup access to remote tables:
- Host name, port number, and database name of the remote database
- User name and password to access the remote database
- A local nickname definition for each remote table that should appear as if it was a local table. A nickname is an alias that looks like a local table but actually refers to a remote table.
Figure 11 shows two possible options for setting up federation between a local database and a Db2 Warehouse on Cloud database. In both cases the objective is to move data from the source table (yellow) into the target table (blue).
In option 1 the local database acts as the federation server and defines a local nickname that represent the target table in the Db2 Warehouse on Cloud. In the local database you can then copy the data from the source table to the nickname, using for example INSERT INTO SELECT * FROM . Inserting rows into the nickname actually transmits the data to the target system which inserts the data into the target table.
In option 2 the Db2 Warehouse on Cloud acts as the federation server and defines a nickname that represents the source table from your local database. In the cloud database you can then copy the data from the nickname to the target table, using for example INSERT INTO SELECT * FROM or a load-from-cursor operation. Reading rows from the nickname actually pulls the data from the source system.
Figure 11: Two options for setting up federation. Which one is better?
Now the obvious question is which of these two options is better? There are functional and performance considerations to answer this question.
Functionally, options 1 and 2 differ in the direction of the required network connection. In option 1, writing into the nickname in your on-prem database creates a network connection to the Db2 Warehouse on Cloud server. This is outbound network traffic from your on-prem data center. In option 2, reading from the nickname in Db2 Warehouse on Cloud means that the cloud database initiates a connection to your on-prem database, possibly setting off alarm bells with your local firewall or system administrator. Therefore, option 2 requires additional network and security considerations. For example, setting up a Secure Gateway is one option to establish a secure connection with fine grained access control.
In terms of performance, option 2 can easily be 10x to 50x faster than option 1. The reason is that fetching rows from a remote table is significantly more efficient than inserting rows into a remote table. Therefore, we strongly recommend option 2.
Data Connect / Data Refinery
Data Connect is a fully managed cloud-based application that enables users to read data from one or multiple data sources, evaluate the data quality, cleanse and transform the data as needed, and write the output to a target system. The data sources can be in the cloud or on premises or both. The target system can also be in the cloud or on premises (Figure 12). Data Connect is meant to be an easy-to-use self-service data preparation tool rather than a full-fledged ETL product.
The Data Refinery is the next generation of Data Connect and offers much tighter integration with other cloud services in the IBM Watson Data Platform. We expect that the Data Refinery will eventually fully replace Data Connect. In the remainder of this section we refer to Data Connect but everything is equally applicable to the Data Refinery as well.
Figure 12: High Level view of Data Connect / Data Refinery
Data Connect includes a broad range of connectors to access a diverse set of data sources and targets. This includes Db2 on all platforms, Db2 in the cloud, most other relational databases, Big Data systems such as Hortonworks, Cloudera Impala, and Hive, as well as cloud file storage systems such as Amazon S3 or IBM ObjectStorage.
When you add data sources to a Data Connect job, the tool reads a sample of the source data (1000 rows per source) and automatically analyzes, profiles, and classifies the data to visualize the content and quality of the data. For each input column, Data Connect calculates quality scores, value distributions, and classification metrics that help users understand the data they are working with. This is illustrated by the screenshot in Figure 13.
Data from multiple data sources can be combined with join, merge, or union operations. Additionally, data can be formatted, transformed, and cleansed with more than 60 data shaping operations. For example, you can remove duplicates, remove rows with NULL or empty values in specific columns, trim white space, change number or date formats, combine or remove columns, compute new columns, or define custom filters to move only a subset of the data. If the data source is a relational database like Db2, the filters are pushed down as SQL predicates when reading from the source. Under the covers Data Connect generates a Spark job that performs the actual data wrangling. Jobs can be scheduled for immediate, delayed, or periodic execution to perform the specified data transformation and movement.
Figure 13: Data quality and data distribution metrics in the Data Refinery
Clearly, Data Connect / Data Refinery offers much more than plain data movement “from A to B”. Some of the benefits include:
- Convenient UI for basic to moderate data preparation and cleasing jobs
- Easy to shape, cleanse, filter, trim, deduplicate, convert, format, transform your data
- Can read from many data sources, not just flat files or Db2
- Can combine data from multiple diverse data sources
- Can populate multiple tables in one job
However, the price for all of these benefits is performance. This tool has much more focus on data preparation and cleansing than maximum bulk load performance into the cloud. If the target is a relational database such as Db2 on Cloud, insert statements are performed rather than LOAD jobs.
Data Connect / Data Refinery can be a great tool for populating a Db2 database in the cloud when data cleansing and access to diverse data sources is more important than maximum bulk load performance for very large data volumes.
If you want to migrate a local Db2 database to a Db2 Hosted instance you can also use backup and restore operations with remote storage access to move the database. The general idea is that you create a backup of your local database in an IBM Cloud ObjectStorage followed by a restore into your Db2 Hosted database.
Such a migration takes the following steps:
- Create an IBM Cloud ObjectStorage account and a container in the same data center as your Db2 Hosted server.
- Use the CATALOG STORAGE ACCESS command in your local Db2 database to create an alias that points to your Cloud ObjectStorage container.
- Use the BACKUP DATABASE command to backup your local database to the storage alias that you created in step 2.
- Use the CATALOG STORAGE ACCESS command in your Db2 Hosted database in the cloud to create an alias that points to the same Cloud ObjectStorage container.
- Use the RESTORE DATABASE command in Db2 Hosted to restore the backup image from step 3 using the storage alias that you created in step 4.
Note that remote backup/restore is supported with both IBM Cloud ObjectStorage and Amazon S3. However, for data migration purposes it is best to use a cloud storage location that is as close as possible to your target database in the cloud. For example, if you are using Db2 Hosted on AWS then Amazon S3 would be the more suitable storage location for the backup.
For more details on using remote storage locations, see:
Cloud Mass Data Migration
If tens or hundreds of Terabytes of data need to be moved into the cloud at one time, chances are that the network is a significant bottleneck and the upload would take weeks - even with Aspera acceleration. In this case it can be faster to copy the data onto portable disks, compressed and encrypted, and ship the disks to the IBM Cloud data center of your choice. IBM offers two such options: Disk Transfer Service and Cloud Mass Data Migration.
The Disk Transfer Service (aka Data Transfer Service) allows you to send your own USB 2.0 or USB 3.0 compatible disks to an IBM Cloud data center. This process is initiated through a support ticket. When the data center receives your disks the IBM staff will connect them to your database server or private cloud network and mount the file systems. Subsequently you can initiate LOAD operations to move the data into your database tables. Note that that you cannot manipulate the files or file names on your disk once the disk has been shipped and attached. Therefore it is recommend that you test your LOAD commands with small samples of the data files before you ship the disk. After the data load the disks will be returned to you.
For further information see:
While the Disk Transfer Service uses customer provided USB drives, the Cloud Mass Data Migration uses a large-capacity portable storage device provided by IBM (Figure 14). This device has a usable capacity of 120 TB configured as RAID 6. RAID 6 is similar to RAID 5 but uses two parity blocks instead of just one to achieve even higher fault tolerance. The storage device uses LZ4 compression and AES 256-but encryption by default.
When you receive the device you can connect it to your network as a Network File System (NFS) or Server Message Block (SMB) share. Then you copy your data onto the device and request return shipment to the IBM data center – usually via UPS next day delivery. In the data center your data files will be copied to Cloud ObjectStorage from where the data can be loaded into Db2 tables. After successful migration your data will be deleted from the storage device with a 4-pass DOD-level data wipe procedure.
For further information see:
Figure 14: Portable storage device for Cloud Mass Data Migration
The most critical factors for the performance of moving large data volumes over a long distance network are the network latency and bandwidth. Be aware of what those boundaries are for your location by performing a speed test between your local network and the cloud data center. At http://www.softlayer.com/data-centers you find a pull down list of all IBM data centers per continent with a link to launch a speed test for the data center of your choice. Figure 15 shows a sample result of a speed test to the IBM data center in Frankfurt. The result of the speed test gives you an initial indication of the upload performance that you can expect.
Figure 15: Speed test to an IBM cloud data center (Frankfurt in this example)
The IBM Db2 servers in the cloud usually have a 1 Gbps connection to the public network; some of the larger Db2 configurations in the cloud have a 10 Gbps connection. Hence, that side of the network is typically not the bottleneck. Usually the bottleneck is on the route between your location to the IBM data center or within your own company network. You can use the Softlayer looking glass (http://lg.softlayer.com/) to gain more insight into the network path and its performance, such as:
- The trace route and all the intermediate networks hops. Fewer hops are (often) better.
- A breakdown of the network latency along the hops. This might help you understand network performance within your company vs. outside the comnpany network.
- The looking glas can analyze the network performance to any visible machine, not just the laptop on which your browser is running. For example, you could look at the traceroute between a specific database or file server in your company and the cloud data center.
Figure 16 shows an example of using the Softlayer looking glass to analyze the trace route from the IBM data center in Frankfurt to a local server (IP address) of my choice. The output below reveals that there are seven network hops. The first two hops from the Frankfurt data center have a low average latency of 1.7 and 2.4 msec, respectively. The subsequent hops are more costly with an average latency between 6.8 and 12.5 msec, with no packet loss.
Figure 16: Example of the Softlayer Looking Glass
In addition to the speedtest and the looking glas we recommend to perform small data movement tests before embarking on a major data migration project. Use a few Gigabytes of sample data to experiment and measure different options and settings.
Comparison and Guidelines
The table in Figure 17 provides a high level comparison of the different methods for moving data into the cloud that we discussed in this article. Some of the results visible in Figure 17 include:
- If you are looking for options with a graphical user interface then the Load dialog in the Db2 (Warehouse) on Cloud web console as well as Data Connect / Data Refinery are good options. However, these two options are not suitable if you prefer a command line interface where jobs can be coded and scripted.
- Of the methods described in this paper, Data Connect / Data Refinery is the only option that offers a good scheduling capability to run jobs at specific and/or repeating points in time. The Db2 (Warehouse) on Cloud web console allows you the schedule load jobs from a cloud location such as IBM Softlayer ObjectStorage or Amazon S3.
- If you want the data movement into the Db2 cloud database to also perform data cleansing or transformations then Data Connect / Data Refinery is a good choice – short of using a full-fledged ETL tool such as Datastage or Informatica. Federation also supports some data manipulation because the SQL statement that reads data from a source table can include predicates, functions, arithmetic, or any SQL-based data manipulation.
- Our performance tests have shown that Lift with its automatic use of Aspera offers the best upload performance, followed by load through the web console with the Aspera plugin. The Cloud Mass Data Migration service offers the best performance when the data volume is too large for any other option to be feasible.
- Data Connect / Data Refinery offers the broadest functionality in terms of supported data sources and data shaping operations at the expense of maximum performance. Still, it’s a great choice for moderate data volumes or incremental data movement when the functional richness of the tool is required.
Figure 17: Comparison of the seven methods
In this article we have described seven methods for moving data into a Db2 database in the cloud. These options differ in their functional capabilities, usage model (GUI vs, command line), and performance. For an initial bulk data movement from ground to cloud the best options are usually the Lift CLI or the Cloud Mass Data Migration service. Loading through the web console is convenient and reasonably fast but difficult to automate and not available for Db2 Hosted. Federation is a very flexible approach, not only for ground to cloud but also for cloud to cloud data movement between Db2 databases in the cloud. However, make sure that the target database acts as the federation server to achieve better performance. Data Connect / Data Refinery provides great capabilities and ease of use, esp. if you have non-Db2 data sources or require data cleansing. Don’t forget to perform a speed test to discover your physical network limitations and perform small upload tests (few GBs) before trying to move larger data volumes.
Documentation for loading data into Db2 databases in the cloud