Using Kafka/Kinesis for DB2 replication

Rui Chen

Using Kafka/Kinesis for DB2 replication

Hi Folks,

just curious if anybody tried using Kafka/Kinesis for db2 (luw on Linux platform) replication before, and it would be great if you could point us to some reference, eg. how LOB/XML behaves, latency, admin efforts, etc.. We understand CDC could publish directly to Kafka, but the replay component sounds like the tricky part. Does IBM certify any replay tool?

Not sure if we would use it any time soon, since it's a whole new world and we don't have much experience with it. But still, Kafka/Kinesis sounds like tempting alternatives to ibm native replication solutions, especially when we don't have a stringent replication latency requirement. 

btw, some benefits Kafka/Kinesis bring to the replication pipeline:

  1. free record of what DMLs are applied to replica, up to the limit of Kafka/Kinesis data retention policy limit, which could be helpful to identify trouble makers causing high replication latency;
  2. easy to scale out large number of replicas, just need spin up from recent source snapshot, and read from Kafka/Kinesis to catch up.
Edited By:
Rui Chen[Organization Members] @ Nov 07, 2017 - 03:11 PM (America/Eastern)

Pete Suhner

RE: Using Kafka/Kinesis for DB2 replication
(in response to Rui Chen)

Hi Rui,
I have lately tinkered with Kafka, using Db2 data as a producer for Kafka Topics and attached a few different consumers to it. This was limited to a Proof of concept, and the components I used were Kafka, Flume (for the NG connector to access the database via JDBC) and obviously Zookeeper for cluster management.

The setup was mostly straightforward with only minor hiccups and the performance was very good. However, from a relational guys' perspective, it had various limitations like:

  • NG connector has issues with LOBs and even native XML columns
    • This can possibly be overcome with other existing Kafka connectors (e.g. Confluent, maybe others) which avoid the Flume component - I have not (yet) tested these
  • JDBC based approach is only querying the table for new rows
    • Latency can be controlled by parameterizing the polling frequency
    • i.e. this is basically for "WORM" type of data, detection of updates depends on keys (timestamps might work) and deletions are not detected at all
    • According to Confluent documentation, their JDBC adaptors are implemented on the same concept and therefore are assumed to have a similar limitation in this area
    • A trigger based approach with staging tables could be implemented to overcome this conceptual issue, but needs the respective changes on the data model, affects the workload on the database and would be all manual work
    • A log reader based Kafka connector is available and would be highly preferred, but comes at a cost, as you mention (vendor CDC capture component)
  • Kafka can basically support "at least once" or "at most once" delivery, but not guarantee "exactly once"
    • Possibility of duplicate deliveries must be considered
  • Advanced tooling for efficient configuration and management of larger environments is not available (or at least unknown to me)
  • I did not look at the available monitoring solutions, but assume that it can integrate with typical Hadoop stack monitoring (one would need to check what level of detail these provide in terms of replication consistency, etc.)
  • Alternatives like Debezium (which also builds on top of Kafka) are still in very early stage, but at least show that there is movement in this area

From the limitations I saw on the producer side, I did not delve deeper into the consumer part, as my scope was more on asynchronous data streaming from a database into a Hadoop stack.

Best regards,

Pete Suhner
IDUG Board of Directors
IBM Champion for Analytics

Pete Suhner

RE: Using Kafka/Kinesis for DB2 replication
(in response to Pete Suhner)

...ah, one more point: Schema changes can go undetected, as you can specify the SQL statement based on which you select the data to be replicated.This can incorporate a certain stability against schema extensions.

Using "SELECT * " is possible and will automatically propagate schema extensions.

However, altering or dropping existing columns will result in error situations wich seem difficult to tackle on the Kafka level and consumer side. Obviously, the latter situation also applies when you specify the column names.

Best regards,

Pete Suhner
IDUG Board of Directors
IBM Champion for Analytics

Rui Chen

RE: Using Kafka/Kinesis for DB2 replication
(in response to Pete Suhner)

Hi Pete,

Thanks a lot for your detailed explain, glad to know the topic is already on db2 experts' radar :) Debezium sounds a nice tool, especially if it attracts more attention on replication apply/replay side, and of course adds db2 support. 

Sounds like capture-side is relative straightforward, given we pick the right cdc connector (or just to publish to Kafka/Kinesis/DynamoDB Stream/etc.).

Guess apply side would be trickier, considering the complexity arranging parallelized replay, transactional guarantee (rollback/commit should a msg replay fails), dependency across multiple TXs if there's unique constraint / enforced FK / conflict. Single replay thread would make things easier, but i can imagine people asking for parallelization ..... 

Anyway, thanks again for all the details, and we'll update back if something interesting pops up.