The Challenge in a Nutshell
As I was sitting in on a deep dive session on Netezza at the IBM zIM (z Information Management) Boot Camp in late January 2013, I was having a bit of a déjà vu from way back when I was first introduced to DB2 UDB EEE (Enterprise Extended Edition). My experience was that when it behaved well it was extremely fast, which luckily enough was mostly the case. However, when a query turned bad there were certain fundamentals about MPP (Massive Parallel Processing) and data distribution that one needed to understand in order to be able to address the challenge.
Quick Introduction to IDAA
IBM’s IDAA offering is based on the Netezza technology and as an appliance it provides a turnkey, high-performance data warehouse/advanced analytics engine. The integrated database, server and storage solution is based on the IBM S-Blade architecture using x86 processors.
The Netezza box behind IDAA, (also known as the Accelerator when on the z Platform) is based on two key design elements to address the need for speed. On one side it’s all about dividing and conquering, using an AMPP (Asymmetric Massively Parallel Processing) approach to process workloads. Whereas the other differentiator is to use of FPGAs (Field Programmable Gate Arrays), which are specialized processors separate from normal CPU processing.
The functions that the FPGA performs on the Netezza box are to efficiently decompress data and apply relational operations like project based on the select list (i.e. reduce columns to what is actually needed) and restrict data using the where clause (i.e. return only rows that are relevant). Once data has been reduced to just what is needed it is passed on in uncompressed format to normal CPU cores for more complex processing like joins and aggregation.
The bottom line is that in terms of performance IDAA provides unprecedented response times to enable 'train of thought' type of analyses that otherwise could be prevented by having poor query performance.
In addition, there is deep integration with DB2 for z/OS, which provides transparency to all applications. This transparency may be considered at two levels. Where the query executes (DB2 vs. Accelerator) is handled seamlessly, but obviously performance may be noticeably different.
Quite often data being copied onto the Netezza box is live operational data and the need for data to be continuously maintained (i.e. trickle-feed) has been a high priority requirement for many users. Now that trickle-feed updates have been enabled using log based near real-time replication (new feature added in V3.1), the query results will be consistent regardless of where the query executes.
We are talking about self-managed workloads where queries are executed in the most efficient location. Once the system is enabled and sessions can be established there are three other preconditions that need to be met for a query to be selected to run on the Accelerator:
- All tables referenced in the query need to be on the Accelerator.
- Heuristics (complex vs. transactional queries) will determine if the query is allowed to run on the Accelerator.
- The SQL syntax being used must be supported by the Accelerator (there are still some gaps to be closed, but with each release of IDAA fewer gaps remain).
Finally, the solution includes simplified administration through the application of hands-free operations, eliminating most database tuning tasks. Although, as already alluded there are cases where we need to understand what goes on under the hood.
This combination of traditional mainframe technology and blade technology is also referred to as hybrid architecture and is the way forward when it comes to bringing additional speed to the z/OS platform.
What Can Go Wrong?
In this article we will only focus on query related poor performance, which in essence can be caused by the following factors:
- Inefficient data distribution
- Ineffective data organization/clustering
- Intermediate skew
- Cartesian product joins (aka cross joins)
- Nested loop joins (aka expression joins)
- Disk based hash joins
These challenges lead to two fundamental design objectives; maximizing distribution both from a data and a processing perspective, as well as ensuring that related data is collocated within so-called data slices. From a processing point of view, data slices are also referred to as SPUs (Snippet Processing Units); however, before addressing these design objectives in more detail we will first take a look at data organization and clustering. In the last part of this article we will go into some more join specific challenges.
Managing Data Organization/Clustering
Organization and clustering of data is defined with the ORGANIZE ON clause, specifying a set of columns. The order of the columns is immaterial, since IDAA clusters records inside a chunk using Hilbert space-filling curve in N-dimensional space. Hence, a predicate subset of organizing columns has benefit as well, in terms of reducing the chunks that need to be scanned.
IDAA automatically initiates re-clustering of the table as needed and it is only done for larger tables (i.e. cardinality of millions and above). Up to four columns may be used to perform this multidimensional clustering of data into chunks (i.e. with more than four columns you start to see a diminishing effect).
Netezza processes data in chunks of approximately 600MB (per data slice) at a time and as an end result the min-max range of so-called Zone Map values for each extent (3MB) has been minimized. Such Zone Map values can be exploited to minimize the amount of extents that need to be scanned. Zone Map values are automatically maintained for Date, Timestamp and any size of Integer.
To improve performance of incrementally updating tables, it is recommended to use one or more columns of the primary key as part of the organize clause.
Managing Data Distribution
When defining the table the DISTRIBUTE ON clause can be used to control how data is distributed. Data placement is determined based on a hash algorithm taking into account how many SPUs (aka data slices) are available in the MPP configuration.
Since hashing is deterministic (i.e. same value always results in the same hash) and it is done based on binary values (i.e. hash (0) <> hash (‘0’)) it is important to choose columns that have a high cardinality and support good even data distribution, as well as ensuring alike columns are defined with consistent data types.
Furthermore good candidate columns for distribution keys are those frequently used as join criteria and aggregated on. When combining columns together it only makes sense when you join on all of them.
Finally, one also should avoid columns that are often used in the where clause, since such restriction of data could lead to skewed processing (i.e. even though the data as such has a perfect even distribution we may end up processing on only a subset of the SPUs).
IDAA’s default choice of distribution is RANDOM, which may be convenient, but could lead to disaster. In particular, when trying to join large tables, since the data has in fact been distributed using round-robin (slightly confusing but this is what RANDOM implies on Netezza) and as such all alike rows must first be put together onto the same SPU for join and aggregation processing. If joins are typically only between a single huge fact table and a number of small dimension tables the default distribution would work quite well as a self-contained analytic application.
Managing Data Collocation
When performing joins between a set of large tables it is essential from a design point of view to do everything possible to maintain the divide and conquer principles, which basically means rows that need to be joined together should be collocated within the same SPU/data slice.
For example, this could be from the banking industry where you have client accounts and business transactions. By organizing both tables on the account number being the typical join criteria, this will ensure that the join operation can take place locally within each data slice. As no data needs to be moved around between SPUs performance is maximized.
If data is not collocated on the joining column(s) some level of data movement is required to take place between data slices to complete the query. There are two different methods used for data movement, either redistribute or broadcast. As a general rule of thumb redistribute typically happens when two or more large tables, not already collocated, are being joined together, whereas broadcast is considered more for small tables like dimension tables.
For instance if only one of the tables in our banking example was distributed on account number then rows from the non-distributed table would have to be redistributed based on hashing to collocate with relevant data on other data slices. Such a scenario could also happen with both tables being distributed on account number, but used different data types for storing data.
If both tables are non-distributed (i.e. RANDOM) then we have the worst case scenario as both tables are candidates for redistribution. This would also be the case if both tables are distributed, but completely different from the join criteria being used.
Redistribution takes place automatically on the fly, but there is a performance penalty to this, which has to be paid every time the query is run. It should however be noted that there is optimization going on to efficiently handle data redistribution.
Within each data slice the relevant part of the table is scanned, restricted to only scan the records of interest and projecting only relevant columns, and then only particular resulting rows are send to other data slices as required.
If the optimizer recognized that small dimension tables are being joined with a large fact table then no redistribution will take place, but rather the smaller table will be broadcasted to all SPUs for local processing.
The SPUs will send their individual records from the table to be broadcasted to the host processor, where they will be consolidated and then re-broadcasted to each SPU.
Some Specifics on Joins and Side Effects
Before going into the specifics let’s quickly refresh what various types of join access paths entail:
- Nested loop join is the process of going through each qualifying row in the outer table and use the join criteria to match up with rows in the inner table.
- Cartesian product join is when all qualifying rows in one table are being matched up with all the qualifying rows in another table
- Hash join is where each of the qualifying rows in the inner table is being stored (typically in memory) using a hash key for direct memory access and then qualifying rows from the outer table can quickly be matched up.
Cartesian product joins work well on small tables, but as tables get bigger the cost grows exponentially – a challenge we are familiar with.
The challenge with disk based hash join is typically a result of the optimizer having chosen a wrong course. In a hash join (normally performed in memory), you want the large table to be the outer table. If the inner table is the larger table, you’ll tend to have more data values to hash causing the hash join to have to go disk based (i.e. sequentially scanning to find the hash value, since indexing doesn’t exist on Netezza).
Within IDAA, when referring to the nested loop Join challenge it typically has to do with the cases using range predicates, where collocation of data is not predictable and results in data being sent all over the place. Following are a couple of cases leading to such a join strategy.
- Where t1.ColA > t2.ColB
- Where t1.DateCol Between t2.DateCol And t2.DateCol + 14 Days
- Using ‘OR’ with join conditions containing columns from different tables on both sides of it
When data becomes skewed as a result of data redistribution during query execution it is referred to as intermediate data skew. Table joins can be performed only if matching rows of the tables being joined are physically located on the same SPU. Prior to joining tables, IDAA will have to dynamically ensure that related data is collocated, which as mentioned earlier can be done in one of two ways. Either broadcast one of the tables (i.e. creating a copy on every SPU) or redistribute data based on joining key(s). Such redistribution of data might cause data skew and has two typical types of incarnations:
- During equijoin, where the optimizer decided to redistribute the bigger (and possibly skewed) table instead of broadcasting the smaller of the two tables
- During an outer join that forces a skewed redistribution
To summarize what we have learned, here are some best practice design guidelines for IDAA and Netezza:
- For your DISTRIBUTE ON clause, select columns that ensure good distribution across data slices and always choose data types consistently.
- For your ORGANIZE ON clause, select columns that help reduce the amount of data to be scanned by means of having additional Zone Map values maintained.
- Always keep in mind how tables will be joined together and which queries may be your most performance critical in order to drive good design decisions to maximize colocation of data and therefore optimize performance.
Some Useful Links