I’d like to draw some parallels and highlight important differences between the architecture of the mainframe and the architecture of a data lake. Hopefully, this will help us understand where and how performance gains can be made in one environment versus the other.
First, let’s look at the mainframe. Each machine has multiple CPUs that support hundreds of multi-tasked programs. A single application will generally use a portion of the available CPU, but multithreading allows us to use more than one for very large tasks. The ability to switch quickly and efficiently between programs is a shining capability here. Each program uses an address space that retains register and memory mapping throughout its execution. Virtual memory is mapped into physical memory so that it only uses what it currently needs.
The ability to hold a large amount of data in memory is a major factor when looking at the performance of applications (or analytical queries) that need a lot of data. Portions of this memory are shared across address spaces to optimize sharing. Combining the large memory available with DB2 for z/OS allows us to satisfy many diverse data access requirements.
The mainframe is central to an ecosystem that channel attached disk controllers, tape drives, high speed network interfaces, and other peripheral devices. Unlike desktop computers and simple server configurations, the mainframe typically does not include internal disk drives. Instead, it uses an external disk subsystem capable of holding many terabytes of data. An important consideration here is that the mainframe can have hundreds of channels that can connect the mainframe processors to the disks. Each available channel increases the number of parallel I/O operations that can occur. The more we can run in parallel, the more data we can push from the disks to the applications that are running.
In relation to data lakes, it is the filesystem that is the most interesting. The filesystem places a logical layer on top of the physical disks. All files on the mainframe exist in a shared namespace where each file has a unique name. Each file is stored on one or more logical volumes and information about its location is recorded in a “catalog” and in a VTOC (volume table of contents). File names use a structured naming convention that is used to establish ownership and access privileges in the security system. The good part about storing your data on the mainframe is that it is available to any program that runs there. I’ll come back to this point later.
On top of the filesystem, we run DB2 for z/OS to provide an efficient data access mechanism for application programs and ad hoc query users. As a DBMS, DB2 takes the logical view of tables and fits that into files, including indexes that optimize data retrieval requirements.
On the mainframe, the keys to performance are to run many parallel programs against many files. As much as possible, you want to have the data in memory as close to the application program as possible. Within the disk subsystem, you want to have the data that is being retrieved in the memory cache before the I/O request is issued from the mainframe. Within DB2, you want the data in the bufferpools before the running SQL statement needs to see it. Any time the data is not immediately available in memory, we end up with a delay while the hardware interfaces do their work. When you do need to perform physical I/O, you want the least amount of data passed upwards as possible if the program is already waiting, but if it isn’t waiting yet then pre-retrieving data can also improve overall performance (by ensuring the soon to be needed data is already in memory).
UNIX servers, commonly used to run distributed DB2 for LUW, have a simpler architecture when compared with the mainframe. We still have a machine with processors and a physical I/O layer. The disk controller can be internal (local disks) or external (NAS/SAN) to the server. We should expect local drives to perform better than external drives, but the total storage capacity would be less. Regardless of the location of the disk controller, these machines usually have a smaller number of channels (up to 32 compared to hundreds for the mainframe) available between the processors and the devices.
On UNIX servers, the filesystem is limited to that system. The physical location of each file is stored within the local filesystem using a hierarchical directory structure. Sharing across nodes requires network-based file sharing services. Here we have diverged from the mainframe’s ability of any program to simply access a particular file because that file could be on a different server.
Here, the keys to performance are like those on the mainframe. Keep the data in memory as close to the application as possible.
If you have more work than one server can handle, you likely must add more servers. That will require that you spread the data across servers using some type of data partitioning mechanism. If you use DB2 to do the partitioning, then administrative overhead is a little higher, but your use can scale well.
In a multi-node configuration, the use of the network for data access adds to the time it takes to transfer the data. Network connections are usually slower than disk channels and each server usual has fewer network interfaces. Commonly, the database will be placed on a separate server from the application, forcing data access to go across the network. Using stored procedures to move application logic (or even just data filtering logic) to the database server improves overall performance.
Data Lake Architecture
There is the view that the mainframe and UNIX server architectures are more expensive than necessary. With this in mind, is it possible to use lower-cost hardware to accomplish the same (or perhaps more) work? Doing so requires that we cluster machines together and spread the work out across those machines. The architecture of each machine is simpler. The software that runs on each machine should also be simpler. Data nodes within the cluster would have local disks to optimize I/O, though the smaller servers would likely have fewer I/O channels resulting in limits to how much data should be stored on a node. Additionally, to handle the potential outages that occur, we duplicate the data across multiple data nodes.
The distributed filesystem layer gives us a cross-server view of our files. The data gets partitioned across the data nodes and across disks where software manages the location of each record of the files. The physical location of each file is still stored within each node, but there is usually also a master node that tracks the location of data across nodes.
Edge nodes provide the execution environment for our application programs, or the nodes may be host generic services that our distributed applications call. Comparing this with the mainframe architecture, the edge node would be the mainframe and the data nodes would be the disk subsystems. The big difference is the ability to push filtering logic down into the data nodes rather than needing the data nodes to push the data upward for filtering. When you look at it this way, you can see that there is a lot more flexibility to what we might accomplish within a data node compared to the simpler predictive caching of data in a disk subsystem.
The distributed filesystem is where we start to see the distributed platforms starting to resemble the mainframe. Once again, you want all your data available in the environment so that you can use the power of that environment. Labeling this cluster of services as a “data lake” becomes a marketing tool to convince the business and IT areas to take advantage of this shared infrastructure resource.
You’ll notice in this diagram that the SQL Execution occurs on the edge node along with the application program. If you are using an SQL-like tool in this type of configuration, some of the work that is normally done by the DBMS internally would be pushed down into the data nodes as map-reduce filtering logic. The higher-level SQL processing would need to occur at the point where the data comes together. This is going to be like our DB2-based considerations of whether predicates are stage 1, stage 2, etc.
The key to performance in this type of server cluster is to push as much work as possible down into the data nodes. The idea is to scale out horizontally as much as possible. With sufficient memory resources in each node, perhaps you could examine millions of records in the same time that a standard UNIX server might filter thousands. Also, keep the data in memory at all levels as much as possible and bring the processing to the data.
If you need better performance in your cluster, add more nodes and distribute the data across them. Additionally, the duplication of data across nodes allows filtering logic to run in more than one node if there are sufficient resources to do so. Adding more duplicate copies can help alleviate hot spots.
Like UNIX server configurations, network latency should be considered. Minimizing the amount of data going across the network is a distinct advantage of the map/reduce implementation. Using a dedicated network within a cluster could minimize the impact of other network activity on our throughput.
Performance Similarities and Differences
Across the architectures, some simple rules ensure optimum performance for a single process:
- Keep the data in memory
- Filter the data at the lowest level possible
- Use as many parallel processes as possible
Then, to allow for concurrent applications, you need to make sure you don’t let a single process monopolize all the resources. If you need to accomplish a lot of work, you’re going to need a lot of processing power. Each of these architectures scales differently and has different cost considerations. Any of them is a major investment.
Application design needs to take the platform into account. If you use a data lake, you don’t want a record-oriented program on one edge node looking at every single record. If you have a DBMS, let it do as much of the work as possible.