By David P. Morris, Jr., Leidos Health Software Development Analyst
Db2 Batch processing has long been considered a mainframe only domain. Many enterprise domain applications require bulk processing of large volumes of data for mission critical environments. But, what if we could offload some of the batch process handling of mainframe jobs to Linux servers using Java? In the past, Java developers created custom built Java batch applications that did not followed any standards, were often buggy, and could not be easily repurposed for other Java batch processing applications.
The Java Specification Request (JSR) 352 standard started in 2013 specifies the programming model for batch applications and a runtime for execution of batch jobs. Partnerships were established with the spec lead for the Java Batch Specification, who provided major influence on the development of this standard by using their extensive experience in supporting batch processing. Other partnerships offered significant contributions to the standard as well as collaboration on the development of a JSR 352 implementation batch framework. Decades of experience with batch processing frameworks with COBOL/Mainframe, C++/Unix shaped the framework architecture. These partners committed resources to driver support, enhancements, and a roadmap for the JSR 352 implementation. The standard is fully supported in a comprehensive batch framework.
JSR 352 Batch Architecture
JSR 352 defines APIs with which to implement batch business logic. The ItemReader, ItemProcessor, and ItemWriter define the basic runtime contracts for reading, processing, and writing data items for each batch step. Batch steps are organized into jobs, which defines the run sequence. The JobOperator runtime interface runs and interact with jobs. Job executions are stored in a Spring Batch metadata repository, enabling queries of current and historical job status.
The key concepts of JSR 352 batch processing standard visualized in above diagram:
- A Job is an instance that encapsulates an entire batch process.
- A Step is a domain object that encapsulates an entire batch process.
- JobOperator provides an interface to manage job processing including operational commands, such as start, restart, and stop as well as job repository commands, such as retrieval of job and step executions.
- JobRepository contains metadata about jobs currently running and jobs that have run. The JobRepository metadata describes details such as job status, job completions, job failures, developer defined data. The JobOperator accesses this repository.
- Reader-Processor-Writer pattern is the primary pattern used and is called Chunk-oriented processing. Chunk-oriented uses an ItemReader reads one item at a time, ItemProcessor processes the item based on business logic, the ItemWriter aggregates the data and persists based on the ‘chunk’ number of items and the transaction is committed.
- A JobLauncher is a simple interface for launching a Job with a set of job parameters
Spring Batch is a lightweight, open source JSR 352 Java framework for batch processing. It is based on the same development approach from the Spring Framework, build applications from “plain old Java objects” (POJOs), and apply enterprise services non-invasively to POJOs. The framework provides a robust enterprise-scale solution.
Spring Batch API supports:
- Transaction management
- Chunk based processing
- Declarative I/O for reading and writing resources
- Web based administrative interface
Spring Batch Advantages:
- Implementations covering vast options for input and output (File, JDBC, NoSQL, JMS)
- Can be executed from shell commands
- Ability to start/stop/restart jobs and maintain state between executions
- Ability to skip and retry records as they are being processed
- Message notification
- Ability to scale
- Clear separation of concerns
- Batch Environment
- Developers concentrate on the business logic, let the framework take care of the infrastructure
- Common core services as interfaces
- Default implementations provided ‘out of the box’
Spring Batch processing fundamentally is ‘chunk’ oriented process driven. Chunk oriented refers to reading data one at a time, creating chunks of data within a transaction boundary, and persisting those chunks to a database or other file system.
One item is read from an ItemReader, handed to an ItemProcessor, and aggregated. Once the number of items read equals the commit interval, the entire chunk is written out in the ItemWriter, then the transaction is committed.
A job consists of one to many steps. Each step has an ItemReader, an Item Processor, and an ItemWriter. The chunk is defined within each step.
Depending on your use case with your Db2 data, you may have only one step defined for your transaction boundary. The extract, transform, load (ETL) project had only one job with one step defined. The batch application read one row from a Db2 table using a cursor in the ItemReader to get primary and secondary keys A lookup is performed for child Db2 tables in the itemProcessor, and the data is aggregated and business logic is applied also in the ItemProcessor. After a certain commit interval defined by the ‘chunk’ size for the transaction, the data is then persisted data in JSON format in the ItemWriter as a ‘chunk’. A database commit is also executed on a chunk boundary, and information is persisted in the meta-data repository. The itemReader position is preserved via a held cursor.
The ItemProcessor in the ETL project contains all the business logic to process the data and required reading secondary tables in Db2 for processing. Creating another step to use an ItemReader to reading secondary tables was not feasible in our use case because it would have violated the transaction boundaries and been impossible to chunk.
For each chunk, the transaction is started, the chunk is processed, a checkpoint is taken, and then the chunk ends. Chunk processing with Spring Batch metadata support provides the developers the ability to checkpoint/restart jobs, skip data, retry, and persist application data in the Spring Batch Metadata repository.
Fetch Size should be used to give the Db2 JDBC driver a hint as to the number of rows that should be fetched from the database when more rows are needed for the ResultSet. If the fetch size is zero, the JDBC driver ignores the value. However, if you are dealing with large datasets and have a query that returns a large ResultSet, keep in mind the entire ResultSet will be read into memory. This would not be a good idea leaving the default fetch size set to 0. In most use cases, the number of rows to be fetched should be limited to the number of rows that are required by the application when more rows are needed. If the fetch size is smaller, the Db2 JDBC drive will make additional roundtrips to the Db2 database to process the same amount of data. Therefore, if the chunk size is very small the fetch size should be a larger multiple of the chunk size.
Performance on the ETL project was improved by updating existing Db2 queries and Java DAO classes to implement parameter markers. Just a minor tweak to Db2 SQL queries resulted in a huge performance gain when processing millions of rows of data.
Spring Batch Metadata
Spring batch provides a schema to monitor batch job information on what is occurring during batch application runs in terms of functionality, who is using it, performance, and detailed stack trace of warnings/errors. Spring batch infrastructure maintains monitored job information in a database.
A Db2 schema is provided by Spring batch:
CRUD operations must be granted to the Spring Batch applications for the meta-data. It is advised to create one meta-data database per application within the same schema as the data tables.
The following schema diagram is defined:
Spring batch job execution information is persisted in the meta-data.
- BATCH_JOB_INSTANCE: This table holds all information relevant to a JobInstance, and serves as the top of the overall hierarchy.
- BATCH_JOB_EXECUTION_PARAMS: This maintains information related to each set of job parameters. It maintains the key/value pairs of job parameters to be passed to a job.
- BATCH_JOB_EXECUTION: This maintains the job execution information for each instance of the job. It maintains individual records for each execution of the batch job by connecting with BATCH_JOB_INSTANCE.
- BATCH_STEP_EXECUTION: This maintains the step execution information for each step of a job execution. It connects with BATCH_JOB_EXECUTION to maintain the step execution information for each job execution instance.
- BATCH_JOB_EXECUTION_CONTEXT: This is the information needed for each job execution. This is unique for each execution, so the same information as that of the previous run is considered for retry jobs. Hence, it connects with BATCH_JOB_EXECUTION to maintain an instance per execution.
- BATCH_STEP_EXECUTION_CONTEXT: This is similar to BATCH_JOB_EXECUTION_CONTEXT, except that it maintains the context information for each step execution. Hence, it connects with BATCH_STEP_EXECUTION to maintain the unique instance with each step execution instance.
Application specific data can be persisted in the BATCH_JOB_EXECUTION_CONTEXT and/or the BATCH_STEP_EXECUTION_CONTEXT.
Batch applications perform key background and bulk-oriented processing tasks for the Db2 enterprise. JSR 352 defines a powerful programming model and runtime for the Java platform. Spring Batch fully supports the JSR 352 standard that provides an easy to build, deploy, and run mission-critical batch applications. JSR 352 specifically has a separation of concerns. The batch runtime fulfills common infrastructure concerns, freeing developers to concentrate on the core business logic. This specification fills an important gap in the Java platforms.
1.) The Framework would be installed on Linux and it has some LUW database part to manage/monitor batch? The framework can be installed on Linux or Windows. The database logical unit of work (LUW) would be created on a Db2 instance using the Spring Batch provided metadata Db2 SQL Script. From the metadata, Spring Batch manages the batch jobs. Monitoring batch jobs can be handled by Spring’s TaskScheduler API or externally by any batch schedulers (i.e. Control-M).
2.) Is data read thru JDBC or some kind of connections to/from Db2 Mainframe and in batches or broken down into small chunks? The data is read via JDBC call by Fetch Size. If the Fetch Size is 0, all data from the SQL query result set is read into memory. If you are working with big data, the server may encounter an “out of memory” error. Recommended that the Fetch Size be set to a size > 0 and a reasonable size based on the server load and the application expected performance. In other words, know the data, the application and the expected query result size.
3.) Once framework processes the batch data, can it store results back to Mainframe? Yes, the results can be persisted in a mainframe Db2 database. The java developer has the option to write the output to many file formats that the Spring Framework supports including flat files, JSON, XML, NoSQL, RDBMS (Db2, MySQL, Oracle, MS-SQL, many others).
4.) Is there any limitation on large batch processing? Large batch processing is limited to server capabilities and network bandwidth. Spring Batch can be scaled out using JMS and ActiveMQ across many servers. There was no server environment setup to prototype a very large batch process in our use case.
5.) When job executions get persisted in the Spring Batch metadata repository, what kind of errors, return codes, warnings or other pieces of information get stored? Particularly for jobs that did not get a good end of job status? Error information is really basic in the repository, with COMPLETED or FAILED statuses. Information can be stored in the _CONTEXT tables via provided methods. This is how we handled checkpoint/restart with restart data in the _STEP_CONTEXT table for either key (cursor or file driven) or count (file driven). There are also “Listeners”, which are basically similar to exits in the mainframe world, and there are listeners that can trap errors. You can trap an error and write to the _CONTEXT tables if you wanted to store more detailed information. Keep in mind, as with typical Java applications there is a log file that is produced which would contain detailed information.