ACID transactions in Hadoop vs. ACID transactions in DB2
ACID transactions have been a standard for databases since the seventies. The standard requires transactions to have the following properties:
- Atomicity – the database may only be affected when the transaction completes correctly
- Consistency – transactions will bring the database from one valid state to another one
- Isolation – Concurrency should not impact the result of the transaction
- Durability – A transaction should stay committed once it has been committed
Until recently, this ACID compliancy wasn’t fulfilled by Hadoop transactions, but the advent of Hive 2 (June 2016) and others has changed this. Now you can have an equally valid acidity on Hadoop as you would expect on DB2.
This brings up once more the question on the relevance of DB2. Should we still invest in this technology or is it at the end of its lifetime?
This article will provide you with an overview of the main reasons why DB2 is still relevant in the current era of Big Data.
The Origin of the Data
A very important element in the story of Hadoop-based transactions versus DB2 transactions is the origin of the data. Even if we do not touch the technical aspects of both types of transactions, we can already notice major differences between them.
DB2 data originates from other online transactions. The data in these databases has been structured according to relevant standards, allowing good quality assurance. This data is called enterprise data, as it forms one of the major assets of the enterprise.
Hadoop data originates from any possible resource. It is a garbage collector that is being structured by applications in a more natural way. The data it stores is the result of analytics, providing an interpreted picture of the reality. This data can become enterprise data, but isn’t considered as such until someone decides it needs to be.
One can understand these two types of data are situated in two completely different paradigms: one in the paradigm of structured relational data and the other in what is commonly known as ‘big data’. These two paradigms complement each other.
Limitations of Transactions in Hive
BEGIN, COMMIT, ROLLBACK
Ever since the introduction of ACID transactions in Hive, all transactions are considered to be auto-committed. BEGIN, COMMIT and ROLLBACK statements are forbidden. In DB2 we take these statements for granted and flexibility would be decreased significantly if we would choose to rely solely on Hive.
In Hive only snapshot level isolation is supported; this means that data is read from the database and maintained as a snapshot in the transaction. The transaction can be committed only when the original data didn’t change intermediately. If the data overlaps with another set, this can lead to so-called write-write issues.
In DB2, isolation is more diverse. The isolation levels are implemented by means of a locking mechanism. Only with the UR isolation level will you risk dirty reads. Cursor Stability will read only committed data, preventing you from reading parts of the data that is currently being changed. Read Stability will make sure that the data targeted by the queries in your transaction remains unchanged and Repeatable Read reserves all related data sources for your transaction.
If you compare both isolation mechanisms, you can see that the Hive implementation focuses on concurrency while the DB2 implementation focuses on consistency and isolation. Both have their reason to exist and snapshot isolation is to be found in the relational world too (e.g., in SQL Server).
Oracle and Microsoft implemented a combined version of the snapshot isolation with Read Committed, adding a certain of degree locking to the snapshot isolation in order to improve concurrency without weakening the isolation and consistency.
IBM has also seen the importance of favoring concurrency for some workloads and recently gave us the option to choose a “concurrent access resolution” of “currently committed” allowing the users to read the committed version of data that is currently being inserted, updated or deleted by a transaction that has not yet completed. This mechanism does not yet work for updates in DB2 for z/OS, but the completion of this implementation will probably follow soon.
Transactions in Hive can only exist in a given context. Tables need to be internal and in the Optimized Row Columnar (ORC) file format. The ACID nature of a transaction is controlled by a session. Writing to tables that are designated to be ‘ACID’ from a non-ACID transaction is not possible.
DB2 assumes all transactions on the database to be ACID whatever their source or target, internal or federated and whatever the parameters used to create the table.
The Apache Hive project describes clearly three use cases in which Hive can be used:
- Streaming ingestion of data – As Hadoop can capture hundreds of records per second and Hive can only import in large portions, ACID transactions allow Hadoop to write in the same partitions as those where readers are currently reporting on.
- Slow changing dimensions – In a typical star schema setup, dimension data can change over time because of evolutions in the data domain
- Data restatement – Data cleansing that focuses on the correction of data following business events
Do we need more information to conclude that DB2 has a lot of other use cases which could also cover the slow changing dimensions and the data restatement? It is clear that Hive is not designed to do Online Transaction Processing and DB2 is. It is only for very heavy analytics that Hive comes into play.
Recently, Splice Machine and other companies did launch RDBMS implementations on Hadoop, providing similar features as DB2. The difference is the hardware the RDBMS runs on and the technology that is used to achieve the RDBMS functions (e.g., HBase and Spark).
Although the use case of RDBMS on Hadoop is quite strong, the implementation of such a Hadoop RDBMS requires similar or equally complex measures as you need to implement your traditional DB2.
Scaling out on commodity hardware is one of the key selling points, but this can be equally achieved by using DB2 federation, IDAA or integration with BigInsights or other Hadoop products. Moreover, DB2 will allow you to scale your OLTP environment according to your needs, allowing you to save money wherever you need it, whereas Hadoop assumes (for data integrity) at least three copies of the data and hence, maybe cheaper, but definitely requires more hardware and support licenses, undoing the cost benefits.
Considering the above, RDBMS on Hadoop is more of an enhancement to SQL-on-Hadoop than it is a replacement for your traditional RDBMS.
HBase is often regarded as a database that is fit for transactions. Unfortunately, HBase itself is not ACID compliant. Following this void, Yahoo! coded a transaction processor for HBase, namely Omid. Omid leverages standard HBase functions with Snapshot Isolation. Phoenix is the SQL interface to HBase. The combination Phoenix, Omid, HBase could help you to establish similar functionality as an RDBMS.
Phoenix is ANSI SQL compliant, except for INTERSECT and MINUS and several built-in functions that you need to program yourself.
The downside of Phoenix/Omid/HBase (in opposition to DB2) is once more isolation (cf. the discussion on Snapshot Isolation above). As this combination is oriented towards analytics, not towards online transaction processing, and ACID compliancy is only foreseen for similar use cases as described for Hive, DB2 remains the primary choice for your enterprise applications.
HAWQ (a.k.a. Pivotal HDB) is derived from the same database engine as is used by Netezza, Greenplum and PostgreSQL and provides robust ANSI SQL compliancy. Its guarantee for ACID transactions and its combination with massively parallel processing technology could make you believe that you have found the ideal replacement for DB2.
Unfortunately, HAWQ, as with most Hadoop products, is designed for analytics and not for online transactions.
Despite its compliance with ANSI SQL and its ACID transactions, HAWQ does not support the use of UPDATE and DELETE, following the Write-Once-Read-Many paradigm that exists in Hadoop, rendering the database useless for enterprise applications.
DB2 and Hadoop were very different in nature at the initial release of Hadoop ten years ago. Recent evolutions made DB2 include semi-structured data and broaden its isolation options, whereas SQL-on-Hadoop is becoming increasingly more ANSI SQL 2011 compliant and is increasingly performing better.
The cost models of the technologies differ. With DB2 you can start on a very small scale with very little to no license costs and very little hardware. For a production-ready Hadoop implementation, you need at least five master nodes and eight to ten slave nodes for a small cluster. This initial investment for Hadoop is rather large, although the costs will decrease significantly following the growth of the amount of data you want to process. At a given point, maintaining all data in DB2 will likely be more expensive, but in that case you could scale out to less costly options (e.g., IDAA, Cloudant).
Although at first sight they may look much alike, Hadoop and DB2 both serve different purposes. Hadoop is meant for in-depth analytics on vast amounts of data, whereas relational databases are primarily intended for OLTP (Enterprise applications) and OLAP on moderate amounts of the data.
Will Hadoop eventually replace DB2? There is definitely a tendency towards a merger of both the realms of Big Data and OLTP, but so far no valid replacements offering major benefits and providing more or identical features as DB2 have been introduced.
What the future will bring is yet to unfold, but it is evident that this evolution will continue and that we will see even more elasticity in the data realm, combining the scaling up and scaling out of databases following a company’s requirements.
It is good that IBM is being challenged in a domain in which it traditionally prevails. This opens the way for new types of implementations and innovative use cases and allows DB2 to evolve and grow in the same way it did for the past 36 years.