Selecting Appropriate Data Architecture Technologies

By John Maenpaa posted Dec 06, 2017 01:28 PM


One of the more difficult parts of developing a data architecture for a new project is choosing which technologies to use. You'll have application developers that have their opinions, management with a different view, and legacy technology already in place. The people supporting the infrastructure, of course, would like you to stick with something you already have. If you do go with something new, then you need to define how that interfaces with your existing platforms and figure out who will support it. Additionally, you need to take into account the likely future requirements that will come along after the new project is implemented.

We're mostly DB2 people here, so we'll start with the assumption that you have one or more databases in DB2. In many ways, DB2 provides an ideal repository for business applications to reliably store data for business critical operations. Let's keep our businesses running and continue with DB2 for our operational data.

With DB2, we have the ability to query data during live production activity with little impact to operations. We can search using primary and alternate keys supported by indexes that are built simply by defining them as part of the database. These indexes, however, do not support massive index searching where the data we're looking for could be in other fields (perhaps inside a VARCHAR column used to store comment text) or across an unexpected set of fields. As the requests for this capability are made, we could add more indexes, but that would impede performance in other ways and these requirements are not usually part of the main business applications. If we could get our data into an optimized index/search platform then we can meet this need if it comes along.

A second issue with DB2 is the desire to purge off older transactions so that we can maintain good performance in our running systems. Here we tend to strike a balance between keeping data available for reporting purposes and removing it quickly to optimize performance. Many years ago, we'd unload these from the database and store them on tape datasets. More recently, DB2 archive tables have become an option. But it can be less expensive to store those archival transactions in a data lake. As a bonus, we can offload data more quickly if the data lake can handle the reporting workload. So, some type of big data technology will be useful and if it also supports the index/search capability then we're on our way to augmenting DB2. We'll add a data lake to our configuration to support long-term storage and large search capabilities.

Now we need an efficient means of copying our data from DB2 into the data lake. The first approach often selected involves using a database extract tool to make a point in time copy of the original. Basically, dump the data from the tables and load it into a new set. If the original is infrequently updated, this can be a useful mechanism for copying data. It does not require much more than a client network connection to the source. But we probably want a good portion of our data to flow from the source into the target.

A second possibility we can use to get a flow of data is using data replication tools. The good part of data replication tools is that we don't have to modify the source application to get the data. The tools usually have a log scraper on the source end to pull recent updates, a queue transport layer, and a target update component. The difficulties of replication tools include high administrative costs, difficulty creating something other than a copy of the original, and difficulty dealing with transactional consistency across tables. Full data replication is expensive to operate in a high-volume environment.

A third possibility is modifying the source application to send appropriate data updates and build something that can take these transmitted records and store them. There are tools that will work with streaming technologies to pull the records, invoke optional encapsulated data translation logic, and store the results in one or more targets. We've also used this method to successfully replicate DB2 for z/OS data into DB2 for LUW in a high-volume environment.

The data lake can probably handle much of our reporting needs, but it isn't always the fastest way to produce reports or run business intelligence queries. A data warehouse still makes sense as a place to implement complex queries. This could, of course, be in another copy of DB2 or in a different RDBMS depending on your environment. If we chose the third data replication option, we can try to use the streaming data to feed both the data lake and a data warehouse. If we chose to use a data replication product, then we're probably going to want to land the data somewhere and build transformation routines from that.

What do we have so far? We have our primary application database, one of three replication mechanisms, a way to stream the data from the source to multiple targets, a data lake target with indexing capability, and a data warehouse. We can draw this up as a conceptual data architecture showing the critical flow of data and the systems involved. If you already have a set of technologies that integrate with all of these in your environment, you may be in luck and can just go with those. Or, you may need to re-evaluate each of the products to verify it will work for your project if there are questions about scalability, reliability, or availability. For projects that don't need all of the components (e.g., there's no need for data warehouse integration) you can mark those components as being potential future growth options.

At this point you should be able to do very rough cost estimates on a proof of concept to prove out the choices that relate to the project. The proof of concept would address the feasibility of using the selected product (or products) for each purpose. It should verify that the new products interact properly with the products that already exist. This is a good opportunity to discover the size of the environment that will be needed to support the production implementation so you can be sure the project has enough funding to be implemented.

The key to getting your product decisions right is to think beyond the immediate project. Consider the ways your business tends to grow from simple projects into complex interrelated systems. Prove the products do what they are supposed to do and integrate the way you need them to. Push for the solution that gets you to the right solution rather than a quick solution that will need to be rewritten.