The Logical Data Warehouse and IBM Fluid Query
Mike McCarthy, IBM
There is much written about the concept of Data Lake these days, often comparing and contrasting it with the traditional concept of Data Warehouse. Viewing a data lake as a replacement for a data warehouse, or as an evolution of the data warehouse, entirely misses the point of what enterprises need and what modern technology is able to deliver.
A more useful concept and architecture has emerged over the past few years. It’s called by various names, such as Data Reservoir or Information Fabric, but the name that is most descriptive and is easiest to comprehend is Logical Data Warehouse.
Fundamentally, Logical Data Warehouse is a data management architecture for analytics. It involves a collection of data repositories and associated data services for management, governance, and access to data. The data repositories can include traditional data warehouses, Hadoop-based data stores, transaction systems data stores, content management systems, cloud-based data, columnar, in-memory data, noSQL data, and even streaming data-in-motion that has not been landed on storage.
The premise behind Logical Data Warehouse is to provide an agile, robust, self-service analytics capability to the enterprise and do so in a way that takes advantage of data where it is, delivering analytics function to the data, and not requiring rip-and-replace solutions, or monolithic, one-size fits all design.
This article will focus on technology from IBM Corporation that makes it possible to work with data across this diverse data landscape. The technology provides the ability to query data housed on platforms remote from the source platform and also includes functionality to move data between Hadoop storage and the relational data warehouse analytics appliance known as IBM PureData System for Analytics.
Federated Data Access … and more
IBM Fluid Query is a concept introduced by IBM in March 2015 at the same time as IBM announced a new software component called IBM Fluid Query for the IBM PureData System for Analytics, powered by Netezza technology. The Fluid Query concept describes what is frequently known as federated database or federated query capability, but notably extends this concept beyond traditional relational database systems to Big Data. In addition, the concept includes capabilities beyond federated query. Specifically, there are capabilities to define, perform, and control bulk data movement between data repositories.
Based on the above definition, IBM Fluid Query capabilities are available in several IBM data repository systems including: IBM DB2 for z/OS, IBM DB2 for LUW, DB2 for i, Informix, Big SQL, and PureData System for Analytics. Today, only the PureData System for Analytics (PDA) provides the bulk data movement capabilities of the IBM Fluid Query component.
The Fluid Query Data Connector
The implementation of IBM Fluid Query on IBM PDA consists of Data Connector technology and Data Movement technology. First, let’s focus on the Data Connector aspects.
Fluid Query 1.5 was released by IBM in July 2015, enhancing the original March 2015 release. It supports SQL Query access from PDA systems to Hadoop systems and relational data warehouse systems. The Hadoop systems supported are Hortonworks, Cloudera, and IBM BigInsights. On the relational side, Fluid Query supports IBM DB2, Oracle, IBM dashDB, Spark SQL, IBM PureData System for Operational Analytics (PDOA) and other PDA family systems, including the TwinFin, Striper, and Mako generations.
Federated or remote database query connections are implemented through SQL syntax and the use of Netezza table functions. The following steps must be completed to enable Fluid Query Data Connectors on a PDA system.
- Install Fluid Query using the installer script (./fluidquery_install.pl)
- Install JDBC drivers for the remote data sources on PDA
- Configure a connection to the remote data service using fqConfigure.sh script.
- Register the data connector function in your local PDA database using the fqRegister.sh script
- Run the SQL query using the registered data connector.
Here are some examples of what a Fluid Query might look like in an application. In these examples, fqRead is the registered data connector function. This is a default name. The connection (not shown) is to a remote Hadoop system.
SELECT * FROM TABLE WITH FINAL ( fqRead(’DEMO’, ’customers’));
In this basic query, the fqRead table function will issue a connect to the database DEMO located on the Hadoop system. It will run a query of “SELECT * FROM CUSTOMERS”, pulling all records from the CUSTOMERS table, which are then processed by the “SELECT * FROM TABLE” main query on PDA.
SELECT * FROM TABLE WITH FINAL ( fqRead(’’, ’’,’select parts.c1, orders.c1 from parts join orders on parts.c1 = orders.c1’));
In this query, the amount of data being moved from the Hadoop system to PureData for Analytics is limited by the SELECT statement specified in the fqRead function call. The join is processed on Hadoop, with the resulting c1 values being sent to PDA. Note that database name was not specified in this example. The database name is picked up from the value specified in the connection configuration created by fqConfigure.
CREATE OR REPLACE VIEW new_view AS SELECT * FROM hadoop_table WITH FINAL (fqread('', '', 'select * from hadoop_table where month = "may"'));
The use of an explicitly defined view can simplify the application coding in the user application and also provides a level of transparency regarding the actual location of the data. This eliminates a SQL round trip to retrieve the data schema from Hadoop on each execution. Only the data needs to be retrieved.
The PDA application would query against the VIEW as follows:
select * from new_view;
On Hadoop, the following query will be executed against the database specified in the connection configuration.
select * from hadoop_table where month = 'may';
As you can see, nothing in the application query indicates that the data isn’t local. This also means that BI tools can work with Fluid Query without changes to their SQL generation.
Fluid Query Connector Tips:
When using Fluid Query connector, try to move as much of the processing to the remote Hadoop data source as possible in order to limit the size of data sent back to PDA. This is accomplished by putting column selection and row predicates in the fqRead function call, rather than (only) in the local SELECT statement.
Any SQL syntax able to run on the back-end query engine can be included in the View definition or ad-hoc SQL statement.
Function Support: Built-in aggregate functions like count(*) and analytic functions like OVER can be easily used in the fqRead function.
For more details on the setup and use of Fluid Query data connector see the IBM Netezza Fluid Query User Guide, available with the product.
Hot off the Press
A significant, new release of Fluid Query was announced by IBM and became available on November 13. Fluid Query 1.6 includes a generic connector to allow users to connect IBM Fluid Query to any data provider using a JDBC driver. Users needing support for Microsoft SQL Server, MySQL, or Teradata, can now utilize Fluid Query to access those data sources. Other new examples of data sources that can participate in a remote or federated query are Informix, PostgreSQL, memSQL, and MapR.
In addition to the new generic JDBC connector, compressed data that has been moved from PDA to IBM BigInsights can now be read using Big SQL, IBM’s massively parallel processing SQL engine on Hadoop.
Fluid Query Data Movement
In a Logical Data Warehouse, where data is stored on multiple systems, such as Hadoop and data warehouse appliances, there will be times when analytics performance or application simplicity is best served by moving the data to one of the data servers. Fluid Query includes capability on PDA for moving data to and from these Hadoop systems: IBM BigInsights, Hortonworks, and Cloudera.
The Data Movement feature supports two directions of data flow. Import provides data transfer from PDA to Hadoop. Export provides data transfer from Hadoop to PDA.
There are three modes of data transfer that deal with data compression. In Text mode, the NPS tables are transferred in text format and saved to HDFS in text format. In Mixed mode, the NPS tables are transferred in compressed format and saved to HDFS in text format. In Compressed mode, the NPS tables are transferred in compressed format and saved to HDFS in compressed format.
The procedure to setup and use data movement function varies based on whether data movement is initiated from the Hadoop side or the PureData System for Analytics side. To initiate data movement from the Hadoop system, perform the following:
- Install the Fluid Query package on the Hadoop system and run the installer script (./fluidquery_install.pl)
- Customize the XML configuration files for the import and export commands
- Run nzcodec.jar with the relevant import or export XML configuration file parameter.
To initiate data movement from the PDA system, setup is similar to data connector configuration.
- Install the required Hadoop libraries, Hive libraries, and client configuration files to the PDA system.
- Customize the sample fluidquery/conf/fq-remote-conf.xml configuration file.
- Configure a connection using fqConfigure.sh using the data movement service type fqdm and the above configuration file.
- Register the functions for import and export using fqRegister.sh
- Call the registered functions from the application or query.
It is also possible to initiate data movement from any system that runs Java. For details on that, and additional detail about the above Fluid Query data movement configuration, see the IBM Netezza Fluid Query User Guide.
Hot off the Press
Fluid Query 1.6 contains enhancements to data movement. An entire database can now be imported from PDA into Hadoop. In addition, the append mode supports appending data to tables in Hadoop, allowing for incremental import of table data changes. There are several more functional, operational, and usability improvements in 1.6.
The Logical Data Warehouse provides an architecture for integrating a variety of data sources with the goal of serving the need for analytics and reporting in today’s modern enterprise. It provides flexibility to utilize existing data stores side-by-side with noSQL, Hadoop, and Spark, whether in the cloud or on-premises. IBM Fluid Query provides the most robust capability to perform queries across this heterogeneous environment, and includes fast data movement technology to position data on Hadoop or IBM PureData for Analytics as needed. With Fluid Query, data residing in Hadoop distributions from Cloudera, Hortonworks and IBM BigInsights for Apache Hadoop can be combined with the data residing in PureData using standard SQL syntax.
For existing IBM PureData System for Analytics customers, the latest version of Fluid Query is available for download at IBM FixCentral (www.ibm.com/support). Others interested in Fluid Query and the IBM PureData System for Analytics can download the free PDA Emulator virtual image and Fluid Query 1.6 via instructions on this page at IBM developerWorks - https://ibm.biz/PDAEmulatorDownload.
This material has not been submitted to any formal IBM test and is published AS IS. It has not been the subject of rigorous review. IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a client responsibility and depends upon the client's ability to evaluate and integrate them into the client's operational environment.