Simple, Highly Scalable and Distributed Query Processing with IBM Queryplex


Data is the cornerstone of business and it exists everywhere. You can find it in databases and big data clusters, in branch offices or manufacturing devices, in ATMs, in cars, trucks and in sensors. The typical enterprise may have hundreds of different data repositories with different formats for the data and in different lines of business. Data scientists and reporting applications have a need to query across many of those sources to derive insights for the business. 

Queryplex will allow you to virtualize many repositories as if they are one single database, doing so with simplicity, scale and speed. You can now query many repositories as if they are one system without sacrificing the performance or moving the data.

In current technology, there are two usual patterns for performing multi-database analytics and virtualization.

The most common pattern is to pull all the data into one single repository so that the data can be accessed in one place. This paradigm for multi-repository analytics adds complexity to your business and incurs significant cost to orchestrate the movement, storage and security of the data. Beyond those issues the data is always stale, making your results anywhere from minutes to days out of date. Even more insidious is that you are reducing the amount of compute available to perform your analytics. Thousands or millions of processors in the data repositories getting reduced to a relatively small number in the central repository.

Another typical pattern is an edge computing model with a coordinating server connecting directly with each repository. Each repository only knows about its own data and cannot work with any other source in the system. This paradigm requires a very large coordinating server to manage the multitude of connections to all of the individual data repositories. It also typically suffers in performance as a result of higher data transfer requirements between the sources and the coordinating server.

IBM Queryplex is a unique new data virtualization technology to disrupt these traditional paradigms and to allow real time distributed analytics accessing many sources without the need to move or copy the data. 

Game changing Virtualization

There are three key aspects which Queryplex brings as a data virtualization solution.

First, Queryplex allows you to query data anywhere. Each data repository becomes a member of the Queryplex constellation via a small software agent. These sources can be local to one data center or broadly distributed throughout the world. They can be accessed with remarkable speed due to the fault tolerant and latency aware nature of the constellation. Queryplex agents operating in the network will find each other automatically and collaborate to perform distributed computation on the full set of data repositories.

Second, many different types of data repositories can participate in the constellation. This allows your data analytics to operate against many different kinds of sources at once, without the need to connect and code specifically for the different databases. But even more powerfully, this allows the analytics to perform work across multiple types of sources without having to explicitly deal with the differing types and formats. The Queryplex beta release has support for many of the major relational data repository types and will soon include various non-relational sources as well.

Third, due to the close network ties between data repositories within the Queryplex constellation, those sources are able to work collaboratively to answer the analytical queries in your applications. Nearby nodes will work together to perform aggregation, joins and other operations, instead of flowing data back to a coordinating service for processing. Other technologies allow remote processing at a single source, but only Queryplex allows cross source remote processing at the sources themselves. This collaboration allows full use of the processing capability from each repository, leading to enhanced performance for your analytics.

Queryplex-image1.png

IBM Queryplex processing model in comparison to traditional federation or edge computing.

Simplicity is Fundamental

The Queryplex constellation is comprised of the service coordinator and a number of agents that are provisioned in the network. The coordinator receives queries and provides results back to the application while the agents work collaboratively to perform the parallel distributed processing for analytics.

You may think it will be difficult to implement Queryplex, but simplicity is one of the key tenets of the system. As the number of data repositories grows, traditional technologies suffer from increasing complexity in the management of the configurations, connections and data mappings. Queryplex takes the approach of making most of the configuration automatic and for any operations that do require direct input, that interaction will take place via the service coordinator interface.

There are several facets of the configuration that Queryplex makes simple.

Queryplex is a docker based deployment. Provisioning the service coordinator only requires pulling the image and running it. Once the container has been provisioned, the only thing left to do is provide Queryplex with two network ports for the constellation to use and the hostname of one server near the data repositories where a Queryplex agent is going to be installed.

Establishing Queryplex agents in your network, nearby or collocated with the data repositories, is the next critical step to establishing the constellation for distributed processing. The agent install can be done via docker or using a direct install and it is a single command to start, configure and secure the agent. As each node starts up, it receives a TLS certificate from the service coordinator for secure encrypted communications, it then automatically finds other nodes that are nearby in the network and starts building the constellation automatically.

Once the constellation is established Queryplex will automatically find data repositories that are running near the agents and it gives you a simple wizard to configure and connect to those repositories in bulk. Credentials for the data repositories are always stored encrypted and only within the agent actually making the connection. Queryplex maintains these connections and allows you to add new repositories at any time. For the Queryplex beta, many common relational sources are supported.

Queryplex-image2.jpg

Table data discovery is perhaps the most critical aspect of simplification that Queryplex gives you. Production databases may contain thousands of tables and the requirement to add those tables manually to a virtualization system is often not feasible. Instead Queryplex finds the tables available in the connected data repositories for you. But beyond simply listing them, it will merge tables from different repositories, accounting for differences in schema, column names, data types, and the number of columns to give you a truly unified definition. The unified definition can be used to query all possible variations of the data as it exists in different repositories in the constellation.

For example, you may have table variations in different repositories with different capitalization of the column names, differing data types or extra columns. These variations may be a result of different application versions which have forced changes in the underlying data format or even differences caused by the repositories themselves. Queryplex will account for these differences and provide a unified view of your data.

Finally, a key aspect of simplification that Queryplex provides is for the queries. Your applications do not need to change the queries that they generate. Queryplex uses the IBM common SQL engine, supporting queries in many common language variants including DB2, Oracle, Sybase and Netezza versions of SQL. Beyond that, the queries do not need to specify the particular data repositories that contain the data and they do not need to be written in a special way to enable distributed computation. Queryplex handles that all for you.

Due to the nature of query processing in Queryplex, your custom applications and existing analytical tools, including R, Spark, Cognos, Tableau, Data Science Experience and Plot.ly, can use Queryplex without modification.

Query Performance

A virtualization system cannot help your business if the time to query across repositories is slow. Queryplex uses a collaborative computation mechanism that enables repositories in the constellation to work together. For many query patterns, those queries will be able to execute far faster in Queryplex than in traditional multi-source data processing architectures. Queries that contain operations which can execute in parallel, such as aggregations and joins, and those which tend reduce data volumes at source are ideal. The geographic distribution of data repositories is not a key factor in performance for Queryplex. The work happens in parallel at the individual repositories and is further processed collaboratively with results from other nearby agents and sources. As a result, only query results ever have to be transmitted across the network and never the raw data.

Queryplex-image3.jpg

Demonstration constellation spanning from the United Kingdom to Toronto, Canada
and to San Jose, California with a mix of data source types.

As queries are processed in the constellation, results from each of the data repositories are further combined and processed at each agent. The results from the local repository are combined with that of nearby Queryplex agents making full use of the computational capability of all the data sources. This keeps processing local to the source and results in increased performance.

Security

In our world today, security is critical for any data virtualization solution.

Queryplex uses strong encryption for all communications and credentials. Data always stays secure in the original data repository using the security mechanisms you already have in place. Beyond the source, Queryplex provides the ability for custom policy filters to further control access and has independent user authentication and data access permissions for the applications that connect to it. 

Conclusion

In conclusion Queryplex is an exciting new technology to simplify the analytics in your organization. It can efficiently virtualize anywhere from a small number of data repositories or scale to thousands of connected sources which are spanning large geographic distances. Query performance does not suffer as the number of sources grows and it allows advanced analytics without moving the data.

I invite you to join the next wave of distributed analytics by participating in our open beta, visit our website at http://www.queryplex.com/ to view demo videos and join the free trial. 

We would appreciate your feedback to help guide the future of data virtualization at IBM.
https://www.surveymonkey.com/r/ibm-data-virtualizaiton


Robert Neugebauer
Development Architect for IBM Queryplex and Data Virtualization
neugebau@ca.ibm.com
IBM Corporation

 

Recent Stories
SQL Query Writing Tips

ORM, Hibernate and JPA

Simple, Highly Scalable and Distributed Query Processing with IBM Queryplex