Exploiting Amazon S3 with DB2 for LUW

It is pretty obvious to even the most casual observers of the IT landscape that there is a big focus on “cloud computing”, which is defined on the IBM website as “the delivery of on-demand computing resources—everything from applications to data centers—over the Internet on a pay-for-use basis”.

DB2 for LUW has undergone a lot of changes to become more “cloud friendly”. There has been a focus on ease of use and autonomics, which make running DB2 in a cloud environment easier, and the Data Server Manager (DSM) console has been adapted to provide a useful front end for IBM’s Software as a Service (SaaS) offerings under the dashDB banner.

This article focuses on one particularly useful aspect of the changes made to support cloud architectures – the ability for DB2 to read to and write from cloud object storage facilities directly.  DB2 directly supports two object storage facilities – SoftLayer ObjectStore and Amazon Simple Storage Service (S3). The principles and processes are very similar. However the IBM documentation focuses heavily on using SoftLayer (not surprising since this is IBM’s offering). This article will focus on how to set up and use DB2 with Amazon S3.

Introduction to Cloud Storage

There are different types of storage available from most cloud providers, including Amazon:

  • Block storagestorage that is the cloud equivalent of regular disks / file systems sourced from a SAN. You create block storage “drives” and attach them to cloud servers to provide persistent storage i.e. storage which is retained over cloud server reboots. Amazon AWS calls this Elastic Block Storage (EBS).
  • Network file systemsstorage that is the cloud equivalent of file systems sourced from Network Attached Storage (NAS), conforming to the NFS protocol. Can be mounted on multiple machines, either cloud servers or machines outside the cloud environment altogether. Can be used for situations where files need to be accessed from multiple locations. In a DB2 context a potential use for this would be for load copy datasets in an HADR environment. Amazon AWS has recently started to offer this facility, which they call Elastic File System (EFS).
  • Object storage – this is primarily what this article is about. It stores objects (files and associated metadata) using an API. Internal contents of the files are not directly accessible. Amazon AWS calls this Simple Storage Service (S3). We will say much more about this later.
  • Archive storagesimilar to object storage, but designed for long term, large scale, low cost archiving. To keep the cost to a minimum, access times are greater both because the hardware used is cheaper and the amount of network bandwidth dedicated to retrieving the objects from this storage is capped. Amazon AWS calls this type of storage Amazon Glacier.
  • Content Delivery Network (CDN) storagea specialized form of object storage used to deliver web assets efficiently and reliably. Many common components used by many web applications are available from a variety of CDN providers – for example the commonly used jQuery JavaScript library is normally sourced from a CDN rather than deployed as part of a bespoke application build. Amazon AWS provides this facility as part of their Amazon CloudFront offering.

The latest trend in cloud storage is to combine the capabilities of multiple types of storage to provide hybrid facilities. For example, Amazon combines S3 and Glacier to provide a multi-tier archiving solution. There has also been a focus on providing facilities that bridge the gap between on premise and cloud-based environments; recent changes in this space have included storage that can be accessed equally well from both environments and pricing changes to encourage this form of hybrid usage.

More About Amazon S3 Object Storage

As mentioned briefly above, cloud object storage stores “objects”, which are basically files and associated metadata.  Manipulation of the objects is through an API, originally programmatic, but more recently also RESTful APIs. Amazon S3 was the first offering in this space (introduced in 2006), and the S3 API is widely supported – so much so that even rival services such as SoftLayer Object Store have started to offer support of the S3 API in some of their environments.

Amazon S3 provides versioning capabilities, which opens up a range of possibilities for exploitation. Although DB2 does not use this capability, it is something that you need to be aware of. If you define your object store incorrectly, you could end up retaining objects (and paying for the storage involved) which you thought you had removed. This problem is probably familiar to users of Tivoli Storage Manager (TSM) with DB2, since the default definitions in TSM always retain one copy of a file, even when it has been released by the source location (many customers found they were retaining DB2 backups in TSM, even when DB2 thought they were gone, because of this default configuration).

Typical use cases of object storage with DB2 include backups, data exports, load, import and ingest files and archive log storage. We will see shortly that IBM has implemented direct support for all of these apart from archive log storage.

Amazon S3 Terminology

To understand the discussion that follows, a brief description of Amazon S3 terminology is required.

An object in Amazon S3 can be addressed using a URL (in the default region) of the format –

http://<bucket>.s3.amazonaws.com/<folder-path>/<object>

Other Amazon regions have similar URLs, but “s3.amazonaws.com” is replaced by a region specific reference.

At the top level of the Amazon S3 hierarchy is the “bucket”. It is the bucket that is associated with your Amazon account (determining who pays the charges for the storage used) and is also the aggregation level for usage reporting.

Within a bucket, you can arrange your objects in “folders”, much the same as in a regular file system.  The “folder-path” can be multiple levels deep.

Finally, there is the object itself, which is basically the name of the individual file that has been stored in Amazon S3.

There are security controls at all levels of the hierarchy. Public access can be given to a whole bucket if wished, but on the other hand access can be restricted to an individual file in a particular container.

To access non-public areas of Amazon, S3 requires two authentication components which Amazon calls the “Access Key ID” and the “Secret Access Key ID”. These are produced using the S3 console. It is worth noting that keys can only be saved at generation time – it is not possible to go back and retrieve these later, so if they are lost, a new set of keys must be generated.

It is worth pointing out at this point that Amazon S3, like most Amazon AWS services, is available in different geographic regions around the world. The default region is in the US (it is actually the Amazon US East region, which is in Virginia). If you are required by local regulations to store your data in a particular geographical region then you should ensure that you specify this when setting up your buckets and when using the DB2 commands to interact with S3.

Using DB2 with Cloud Storage Prior to Direct Support

Even before DB2 introduced direct support for Amazon S3, many customers were exploiting it. To do so, they had to develop their own scripts, calling the Amazon S3 API, and had to have enough local storage available to accommodate the files temporarily. While it is possible to use the native API directly, there are modules or packages available for most programming languages to simplify the process. As an example of what can be done, here is a Perl script that uses the Amazon::S3 module from CPAN to store a DB2 backup in S3 -

#!/usr/bin/perl
use Amazon::S3;   
my $aws_access_key_id     = ‘????'; # (20 chars)
my $aws_secret_access_key = ‘????'; # (40 chars)  
my $bucket_name           = 'scotdb.private';
my $bucket_backup_dir     = 'DB2Backups';
my $local_backup_dir      = '/db2back';
my $s3 = Amazon::S3->new({   
  aws_access_key_id     => $aws_access_key_id,    
  aws_secret_access_key => $aws_secret_access_key,   
  retry => 1,   
});

# Get a list of the bucket’s current files
# (in the backup directory in the bucket)
my $bucket = $s3->bucket($bucket_name);   
my $response = $bucket->list_all   
  or die $s3->err . ": " . $s3->errstr;
my @remote_backup_files;   
foreach my $key ( @{ $response->{keys} } ) {   
  $all_remote_files{$key->{key}} = 1;   
  if ($key->{key} =~
    m/^$bucket_backup_dir\/([-\w\s\.\/]+)$/) {   
    push @remote_backup_files, $1 if !($1 =~ /.*\/$/);   
  }   
}

# Add new files to bucket
foreach (@local_backup_files) {   
  my $local_file = $_;   
  if (!grep(/^$local_file$/, @remote_backup_files)) {   
    my $response = $bucket->add_key_filename(
"$bucket_backup_dir/$local_file",   
      "$local_backup_dir/$local_file",   
      { content_type => 'application/octet-stream', },        
    );   
    if ($response) {print "Added $local_file\n"; }   
    else {print "FAILED $local_file\n";}   
  }
}

Another example of this technique can be found at an early IDUG blog post, which describes how to write a DB2 user exit to handle archive logging to S3: http://www.idug.org/p/bl/et/blogaid=5. Because of the age of this post some cautionary notes are required.  The DB CFG parameter USEREXIT was discontinued in DB2 V9.7; from V9.5 onward the correct way to specify a user exit is to be used is to specify a value of USEREXIT for the DB CFG parameter LOGARCHMETH1. The user exit script must be called db2uext2 and must reside in the sqllib/adm directory. It is important that the script handles both of the requests that DB2 can send (ARCHIVE and RETRIEVE) and that it produces error messages are produced that DB2 can understand. It is also important that files are copied (not moved) to the archive, as DB2 expects them to be left in place for it to clean up in the active log directory.

First Signs of Integration - db2RemStgMgr

The first signs that IBM was developing closer integration with Amazon S3 came in DB2 V10.5 Fix Pack 5. At that point, a new standalone command “db2RemStgMgr” appeared. So far, this command is still undocumented in the DB2 Knowledge Center; however simply running the command without any options will give the syntax.

The db2RemStgMgr command is a command line interface for interacting with cloud storage, with functionality roughly equivalent to the db2adutl command for interacting with TSM.

The general format of the command for use with an S3 target is -

db2RemStgManager S3 <action> container=<bucket> <parameters>

Any mandatory parameters which are not specified on the command line are prompted for. This is useful if you don’t want to put the authentication parameters in clear text. The parameters for the credentials are auth1 (for the Access Key ID) and auth2 (for the Secret Access Key ID).

The actions which are available are –

  • put : uploads from local file system to cloud storage
  • get : downloads from cloud storage to local file system
  • delete : removes object from cloud storage
  • list : lists the contents of a bucket
  • getObjectInfo : retrieves information about the object

Here is an example of listing the full contents of a bucket called scotdb.sample -

db2RemStgManager S3 list auth1=<mykey1> auth2=<mykey2> container=scotdb.sample

===========================================================================
Total number of files fetched = 3.
---------------------------------------------------------------------------
Size(bytes) Object Keys                                                    
----------- --------------------------------------------------------------- 
0           DB2Backups/
25247744    DB2Backups/DPFSI001.0.db2inst1.DBPART000.20161006162924.001
6762704896  DB2Backups/DPFSI002.0.db2inst1.DBPART000.20161006162932.001
===========================================================================

It is also possible filter what is shown by list using the prefix parameter -

db2RemStgManager S3 list auth1=<mykey1> auth2=<mykey2> container=scotdb.sample prefix=DB2Backups/DPFSI001

===========================================================================
Total number of files fetched = 1.
---------------------------------------------------------------------------
Size(bytes) Object Keys                                                    
----------- --------------------------------------------------------------- 
25247744    DB2Backups/DPFSI001.0.db2inst1.DBPART000.20161006162924.001

Unless you specify otherwise, the command will only search buckets stored in the default S3 bucket location, which is US East (Virginia). To list a bucket in another Amazon region, you must specify the region using the “server” parameter.  For example, to find a bucket called scotdb.eu in the EU West (Ireland) region, you would run -

db2RemStgManager S3 list server=s3-eu-west-1.amazonaws.com auth1=<mykey1> auth2=<mykey2> 
container=scotdb.eu

===========================================================================
Total number of files fetched = 3.
---------------------------------------------------------------------------
Size(bytes) Object Keys                                                    
----------- --------------------------------------------------------------- 
0           DB2Backups/
177090560   DB2Backups/SAMPLE.0.db2inst1.DBPART000.20170105132156.001
177090560   DB2Backups/SAMPLE.0.db2inst1.DBPART000.20170106013801.001
===========================================================================

More Complete Integration in DB2 for LUW V11.1

Direct integration between the DB2 engine and Amazon S3 started to appear in DB2 for LUW V11.1.  Partial documentation for this feature appeared in the DB2 Knowledge Center when DB2 V11.1 became generally available, although not all of the documented functions were actually available for use with Amazon S3 until Fix Pack 1 (the GA release was more complete for SoftLayer).

At this point, DB2 can directly access cloud storage for backups, loads and ingests.  This is obviously still a work in progress. There is additional functionality available in the latest dashDB offerings, and we can expect to see this filtering through to on-premise DB2 soon.

Registering A Cloud Storage Location With DB2

A new command “CATALOG STORAGE ACCESS” is provided by DB2 to register a cloud storage location with DB2.  The syntax of the command is -

>>-CATALOG STORAGE ACCESS--ALIAS--alias-name--------->

>-VENDOR--+-SOFTLAYER-+--SERVER--+-DEFAULT--+-------->
          '-S3--------'          '-endpoint-'  

>-USER-storage-user-ID---PASSWORD-storage-password--->

>-+-------------------------------+-+---------------+>
'-CONTAINER-container-or-bucket-' '-OBJECT-object-'  

>-+-----------------------+-------------------------><
  '-+-DBGROUP--group-ID-+-'  
    '-DBUSER--user-ID---'

This command, much like the CATALOG NODE and CATALOG DATABASE commands, takes effect at an instance level. This means that any catalogued storage access alias is potentially available to any database on that instance. By default, a storage access alias is only accessible to users with SYSADM privileges.  However, the DBGROUP and DBUSER options allow either a group or user to be specified which can have access, allowing aliases to be created for use by non-privileged users.

When registering an alias pointing to Amazon S3, the <storage-user-ID> is the Access Key ID and the <storage-password> is the Secret Access Key ID. These credentials are stored in the same keystore as is used by native encryption, which is specified by the DBM CFG parameters KEYSTORE_TYPE and KEYSTORE_LOCATION. If these parameters are not set (i.e. if you haven’t already set up the instance for native encryption) then the CATALOG STORAGE ACCESS ALIAS command will fail.

Here is an example of registering an alias called “scotdbus” for a bucket “scotdb.scotdbus” located in the default Amazon S3 location (US East 1 Virginia) -

CATALOG STORAGE ACCESS ALIAS scotdbus VENDOR S3 
SERVER DEFAULT
USER <Access Key ID> PASSWORD <Secret Access Key ID>
CONTAINER scotdb.scotdbus;

To register a bucket in a non-default region, the SERVER parameter must be used to specify the appropriate region as documented at http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region.

For example, to catalog an alias “scotdbeu” for a bucket “scotdb.scotdbeu” in the EU West 1 (Ireland) region the command would be -

CATALOG STORAGE ACCESS ALIAS scotdbeu VENDOR S3 
SERVER s3-eu-west-1.amazonaws.com
USER <Access Key ID> PASSWORD <Secret Access Key ID>
CONTAINER scotdb.scotdbeu;

It is also possible to catalog an alias which points to an Amazon region, without identifying a specific bucket, by omitting the CONTAINER parameter altogether. For example to register an alias “s3euw1” which could be used to refer to any bucket with the EU West 1 (Ireland) region the command would be -

CATALOG STORAGE ACCESS ALIAS s3euw1 VENDOR S3 
SERVER s3-eu-west-1.amazonaws.com
USER <Access Key ID> PASSWORD <Secret Access Key ID>;

Whether you have registered an alias pointing to a specific bucket or a whole region leads to some subtle differences in the URLs used to refer to these aliases within DB2 commands.

To view all registered aliases on an instance, use the “LIST STORAGE ACCESS” command. To remove an alias, use “UNCATALOG STORAGE ACCESS ALIAS <alias>”.

It is possible to rotate the master key in the keystore used to encrypt the Amazon S3 login credentials using “ROTATE MASTER KEY FOR STORAGE ACCESS”.

Referring to Storage Access Alias in DB2 Commands

Once you have defined a storage access alias, it can then be referred to in various DB2 commands (currently the utilities BACKUP, RESTORE, LOAD and INGEST). This is done by providing a URL starting with the “DB2REMOTE://” identifier. There are two subtly different URL formats, depending on whether the alias you have defined is a “bucket alias” or a “region alias” (as described above).  The formats are –

DB2REMOTE://<bucket-alias>//<storage-path>/<file-name>

or

DB2REMOTE://<region-alias>/<bucket>/<storage-path>/<file-name>

Notice the highlighted parts of the URLs, particularly the two slashes when using a “bucket alias”. This is required because the “bucket alias” is the equivalent of a “region alias” followed by the “bucket”.

So, for example, consider the following object –

http://mybucket.s3.amazonaws.com/myfolder/myobject

It is possible to configure a region alias, in this case for the default region; assume this was called “s3region”. It is also possible to configure a bucket alias for the bucket in this example (“mybucket”); assume this was called “s3bucket”.

Then we could refer to the object using either –

DB2REMOTE://s3region/mybucket/myfolder/myobject

or

DB2REMOTE://s3bucket//myfolder/myobject

There does not appear to be any performance implications of either option. Which option is chosen will probably depend on who needs to access the alias – it is likely that the more restricted bucket aliases will be provided for non-privileged users to use, since this limits the scope of what they can access/

Note that the specification of an individual object in the URL is optional, being only required when DB2 needs to be given an individual file name (e.g. for a LOAD).  For backup and restore operations the URL only needs to go down to folder level, since DB2 controls the names of the files produced or consumed (as with all other executions of these commands).

Backing up a Database to Amazon S3

Let us assume that we have defined the following storage access aliases –

  • scotdb : bucket alias to bucket scotdb.backups on s3-eu-west-1.amazonaws.com
  • s3euw1 : region alias for s3-eu-west-1.amazonaws.com

To backup the SAMPLE database to a folder DB2Backups on the scotdb.backups bucket, either of the following two commands could be used -

BACKUP DATABASE SAMPLE TO DB2REMOTE://scotdb//DB2Backups

or

BACKUP DATABASE SAMPLE TO DB2REMOTE://s3euw1/scotdb.backups/DB2Backup

Note, as previously highlighted, the double slash (‘//’) when using a bucket alias.

As with all backups, DB2 determines the filename to be used (contrary to current documentation).

Loading from an Amazon S3 Bucket

Again, assume that “scotdb” has been catalogued as a bucket alias for bucket scotdb.loadfiles.  To load a file loaddata.ixf from folder loadfiles in this bucket, the following command would be used -

LOAD FROM DB2REMOTE://scotdb//loadfiles/loaddata.ixf OF IXF
REPLACE INTO MYTABLE … ;

This time, the file name as well as the bucket and folder path needs to be specified.

The equivalent command using region alias s3euw1 would be -

LOAD FROM DB2REMOTE://s3euw1/scotdb.loadfiles/loadfiles/loaddata.ixf OF IXF
REPLACE INTO MYTABLE … ;

Managing Cloud Storage Costs

It is important to understand the various factors in the Amazon S3 pricing model in order to make the most cost effective use of this resource. As a basic principle, Amazon S3 favors using services which are collocated physically (in the same Amazon region) – this is very much in line with pricing models from most cloud service providers.

Here are the basic principles of Amazon S3 pricing –

  • There is a charge per gigabyte of storage used. Large storage users (currently about 1 TB and larger) get a small price reduction per gigabyte used.
  • Uploads from external locations into Amazon S3 are free.
  • Downloads from S3 to servers in the same Amazon region are free.
  • There are charges for downloading to servers in other Amazon regions and to external locations (external charges are approximately four times higher than inter-region charges).
  • There is a very small charge for using API calls. This should not be a problem unless a process is written to constantly poll S3.
  • For downloads, there is a “requester pays” option which might be useful in some circumstances – for example if providing data services to clients.

Translating these principles into usage with DB2 the following should be noted –

  • If using servers located in Amazon AWS, then basically the only cost associated with using Amazon S3 is for the storage used (assuming the servers are all in the same Amazon region)
  • If using servers located outside Amazon, then there are no data transfer charges for backing up databases. However, there will be data transfer charges if a restore is required. Therefore it is good solution for environments where restores are infrequent, but may not be cost effective for environments where backups are frequently restored (e.g. where backups are used for restoring databases for running tests)
  • Unless you have servers in Amazon, it is probably not cost effective to store load files in Amazon S3 as there will be data transfer charges every time a load is carried out. However if you have to distribute data to customers then it may be worth exploring the “requester pays” option (or at least build the data transfer costs into your pricing for providing data).

To illustrate how much data download charges can affect the overall bill, consider the example of storing 1 TB of data in Amazon S3. At current prices, the cost of storing this for one month is approximately $30. However, if you were to transfer this 1 TB from S3 to an external server, then the cost just for transferring the data would be approximately $90 i.e. three times the cost of simply storing the file for a month.

Security and Data Location Considerations

All data transfers to and from Amazon S3 are encrypted (using HTTPS).

AWS S3 offers encryption at rest, but it isn’t clear whether DB2 exploits this. It is therefore recommended that DB2’s backup encryption be used.

Be very careful if your company has legal requirements for physical location of data.  Remember that the default location for most Amazon services is in the US. So you must ensure when creating buckets that these are created in the appropriate Amazon region. Then, when cataloging aliases, ensure that the URL for the appropriate region is specified using the SERVER keyword.

DB2’s Cloud Storage Support is a Work in Progress

The direct support for Amazon S3 is a great help, particularly for those who are already hosting services in Amazon AWS.   Similar functionality with SoftLayer also helps for those hosting services in that environment (and when using IBM Bluemix services hosted on SoftLayer).

However, there are still some missing pieces which are required before the services can be considered complete –

  • Ability to EXPORT directly to cloud storage
  • Support for direct archive logging (archive and retrieval) directly to cloud storage

There is also a need for DB2's instrumentation and error handling to fully recognize and support the cloud storage facilities, specifically –

  • Full information about bucket / folder paths in LIST HISTORY and associated SQL functions
  • Progress status of utilities needs to available when writing to cloud storage
  • Error messages returned need to be improved. Currently errors are returned as 8-character hex codes, which are internal DB2 codes. These codes either need to be documented as they currently exist or transformed into something that is suitable for external use.

Documentation also needs to be completed for the cloud storage facilities.  There is still no documentation for db2RemStgMgr. But more important is the fact that there is no documentation of considerations for the server environment when using cloud storage.   For example, it isn’t documented whether there is any local storage used as a temporary buffer when using cloud storage, and if so how much and where that storage needs to be available.   Information on the overall architecture and its impact on local servers would help in capacity planning.

Conclusion

IBM has made good progress in making DB2 more “cloud friendly”, but there are still some enhancements required to support the full range of services offered by on-premise storage management solutions such as Tivoli Storage Manager.

Care should be taken as to where and how cloud environments are configured to avoid regulatory problems concerning “out of region” storage. An understanding of the Amazon S3 pricing model is necessary to prevent unexpected costs, particularly when downloading and transferring data.

 

Recent Stories
DB2 12 In-Memory Feature: Fast Index Traversal

Looking at the DB2 12 Enhanced Merge testing

On the waves of DB2 12 for z/OS