It is rare that Information Management topics become discussion points in the mainstream media; however, Big Data is a popular topic that has even made it onto BBC's “Start the Week” discussion program on BBC Radio 4: right at the heart of the establishment. The discussion only discussed Big Data from the social implications, especially around privacy and targeted marketing, rather than the technicalities of what Big Data actually entailed, but it brought home how important it is to get to grips with this new set of technology.
This has been reinforced in recent weeks with a major social media campaign from IBM to raise awareness of Big Data in the Information Management community. This culminated in a briefing at the Almaden Research Center in April where the first fruits of IBM Research (codenamed BLU) were shared with an audience of technologists and analysts.
Among the first tranche of products to benefit from this research effort is DB2. In particular, the upcoming release of DB2 for LUW (10.5, formerly known as “Kepler”) will include exciting new functionality to extend DB2's capabilities in the Big Data space.
DB2's first support for Big Data functionality came in DB2 10.1 with the provision of an RDF triple graph store. This was a small first step, and one that addressed a specific need within an IBM product family (certain Rational tools rely heavily on RDF triples).
DB2 10.5 takes Big Data support a huge step forward with the addition of “BLU Acceleration.” Initial performance results of this functionality show dramatic performance for certain types of Business Intelligence workload. At the IDUG DB2 Tech Conference in Orlando, Florida, we have been getting to grips with what BLU Acceleration is and how it can be implemented.
What is BLU Acceleration? At the most basic level it is a columnar data store. In other words, data for a table is physically stored on disk in column rather than row order, meaning that for many analytical queries I/O is significantly reduced since only the required columns are retrieved. This in itself is a significant step forward for large scale analytic environments. IBM has wrapped a number of other interesting technologies around this basic feature which make it even better. We will look at some of these features in the following paragraphs.
The first key feature is “actionable compression.” IBM has developed compression algorithms for use on BLU columns which allow many common predicates to be evaluated against the data without having to decompress the data: this ability is what is meant by “actionable.” Along with this key feature, the compression algorithms are also designed so that the most frequently occurring values are compressed the most, thus saving as much space as possible. Since one of the main benefits of compression is that it reduces IO the actionable compression is a major contributor to the performance improvements offered by BLU. Indeed data presented at the IDUG DB2 Technical Conference by a customer involved in the DB2 10.5 Early Access Program showed how significant this feature is to the overall performance of BLU.
The next feature of BLU Acceleration is that it exploits the SIMD (Single Instruction Multiple Data) capabilities of modern CPUs. What this means is that a single CPU instruction can act upon multiple data items at the same time. For example, if we were performing a comparison on a column to check if it had a particular value (e.g. “WHERE YEAR_MADE = '2010'”) then we could check multiple instances of the column in one CPU cycle thus reducing the number of cycles required to check all column instances.
Many column store products in the marketplace today are essentially in memory solutions: you must have enough memory to store all the data. BLU Acceleration does not have this requirement. Although it will use as much memory as you have available to give maximum performance, it will also work quite well when your data volume exceeds the amount of memory you have available. This capability is complimented with a set of buffer pool page selection techniques designed specifically for this type of environment where the selection of victim pages is based not upon LRU (least recently used) but upon keeping IO to a minimum.
“Data skipping” is a feature also designed to reduce IO. By recording metadata about “chunks” of data records (about 2,000 records in one “chunk”), DB2 is able to skip over chunks of data that don't contain anything of interest to the current query. It's like when you are putting household goods into storage, packed in large boxes: you write details of what is in each box on the outside so that when you need to find your beachwear the day before you go on vacation you don't have to open every box to find it.
One encouraging component about the BLU Acceleration implementation is the amount of emphasis placed on maximizing the use of available resources. This goes right down to the level of ensuring that wherever possible we exploit every byte of CPU register and every available CPU cycle.
Perhaps the best thing about the BLU Acceleration implementation is that it is very easy to implement. Simply define a table as “ORGANIZE BY COLUMN” to create a BLU table. If you have a database where most of the tables would be best in columnar format you can make this the default using the DFT_TABLE_ORG database configuration parameter (you would then create any row based columns you needed using the new “ORGANIZE BY ROW” clause). You can also set the registry variable DB2_WORKLOAD to a value of ANALYTICS so that any database created in an instance will have BLU tables as the default, as well as setting a number of other parameters to values friendly to analytical workloads.
While BLU Acceleration is by far the biggest new feature within DB2 10.5 it is by no means the only one. There are a number of other important features that have been introduced.
Perhaps the most significant of these is the ability to combine pureScale and HADR. This combination is a real world-beater in terms of scalability, high availability and resilience. In a typical scenario, a pureScale cluster would provide local resilience and the ability to scale up (and down) as required, while a second pureScale cluster would be maintained as a disaster recovery option at transcontinental distances if required. The pureScale offering has also been enhanced to allow addition and removal of members and allow rolling upgrades (application of fixpacks) without interrupting service. It truly is moving towards a continuous availability solution.
Other improvements in DB2 10.5 broadly fall into the category of SQL Compatibility. These include an “EXCLUDE NULL KEYS” option for “CREATE INDEX” (also bringing significant storage saving for sparse indexes), expression based indexes (i.e. indexes including functions) and extended row size support (i.e. being able to store more data in a row than fits in the maximum page size).
But perhaps the most exciting of all is a late addition to the DB2 10.5 timeframe announcements (and so hot off the press that it may not actually be available until a fixpack after GA). This is the addition of a JSON data store, and support for accessing this through both SQL functions and via native APIs. IBM announced at the IDUG DB2 Tech Conference in Orlando, Florida that a technology preview of this functionality should be available around June. It's great to see that IBM has reacted rapidly to the demand for JSON support and has fast tracked this development effort. What is even more promising is that this support will not only be available for DB2 for LUW but also for DB2 for z/OS.
IBM is currently running a closed beta of this new release. Initial comments from those involved in the beta program have been very favourable and I look forward to this exciting new version of DB2 becoming generally available later this year.
Of course, the place where much of this great information was shared by IBM was at an IDUG DB2 Tech Conference. The IDUG DB2 Tech Conference in Orlando, Florida held last week had a number of keynotes and sessions devoted to this new technology. Expect to see even more, including early user experiences, at the IDUG DB2 Tech Conference in Barcelona this October. Plan to attend to get up to speed with this great new release.