Advanced Pattern Matching with Regular Expressions in DB2 11.1 for LUW

Regular expressions are an integral part of the pattern matching features in all manner of sophisticated software, including operating system commands, text editors, programming languages, and even database engines. As of DB2 11.1 for Linux, UNIX, and Windows (and DB2 for i 7.1 TR9), regular expressions are finally available in DB2 as built-in SQL functions that don't rely on XQuery wrappers or external routines (Java or otherwise). Using sample queries from DB2 11.1 Fix Pack 1 for Windows, this article demonstrates DB2's new, built-in regular expression (REGEXP) functions and offers some points to consider when retrofitting your SQL workload to take advantage of them.

All of the regular expression patterns in this article are linked to descriptive SVG diagrams generated by Regexper, a helpful online tool created by Jeff Avallone. I highly recommend Regexper for explaining and documenting the regex patterns in your project.

Functions to match, count, and replace text patterns

The REGEXP functions are scalar SQL functions in the SYSIBM schema that take in an input string, a regex pattern, and other parameters that vary by function. They closely resemble the regular expression functions first introduced in Oracle 10g Release 1.

  • Pattern matching (returns a boolean result)
  • Extracting/transforming a pattern (returns a string result)

IBM's regular expression cheat sheet for DB2

DB2's regular expression syntax is based on POSIX Extended Regular Expressions, a widely implemented IEEE standard. This grammar is quite versatile and puts a tremendous amount of regular expression capability directly into DB2's SQL engine. The IBM Knowledge Center for DB2 11.1 has a detailed list of the control characters that can be used when building regular expression patterns. Control characters and their behavior can vary between regular expression engines, so it's best to refer to the DB2-specific page when working with the REGEXP functions. In case the Knowledge Center URL changes for that page, the list of control characters can also be found by searching that site for "regular expression control characters."

Converting many LIKEs into one REGEXP_LIKE

The LIKE predicate in SQL provides a simple pattern matching grammar with only three options: literal character, single character wildcard, and multi-character wildcard. Consequently, it doesn't take much for a filter/validation rule to become too complex to describe as a single LIKE predicate. REGEXP functions, on the other hand, use regex patterns that are far more capable.

Example: A retail company wants to extract (Query 1) and geocode their customer address list to help identify potential locations for a new store. To reduce the time and expense of the geocoding process, the extract query will ignore post office box addresses, which wouldn't benefit from geocoding. Instead of assembling a series of LIKE predicates to describe the myriad ways in which a PO box address could be formatted, the query calls REGEXP_LIKE just once so it can evaluate the address strings against a more flexible regex pattern.

WITH cust(street) AS (
-- Six unwanted PO BOX address rows, each formatted slightly differently
VALUES ('PO BOX 1'), ('PO  BOX 2'), ('P.O. BOX 3')
,
('P O BOX  4'), ('P. O.BOX 5'), ('po box 6')
-- and one legitimate address row that belongs in the result set
,('POBOXTON CT ROAD NO. 3')
)

SELECT street FROM cust
WHERE
NOT REGEXP_LIKE( street, '^\s*P\.?\s*O\.?\s*BOX\b', 'i' );

Output:

STREET
----------------------
POBOXTON CT ROAD NO. 3

1 record(s) selected.

Query 1. REGEXP_LIKE evaluates street addresses against a case-insensitive search pattern to identify PO box addresses that need to be filtered out of the result set. The flags parameter is the third argument in this example, but may also appear in the fourth argument when the starting character position is provided in the third argument.


Extracting and replacing text patterns

If your SQL statement only needs to validate an input string against a simple search pattern, you might be able to get by with just some basic string-handling functions, a couple LIKE predicates, and perhaps TRANSLATE. That approach, however, generally won't provide any details beyond a boolean result. If your statement also needs to extract specific portions of the input string that match a pattern, a regular expression function may be your only practical option. REGEXP_EXTRACT, REGEXP_SUBSTR, and REGEXP_REPLACE are three built-in functions that make quick work of complex pattern matching tasks that would otherwise be burdensome or even impossible without regular expressions.

Example: The technical support feature of a mobile app launches a new email message on the customer's phone to fill out and send to the company's help desk. The app pre-populates the message body with the string "Robot Food:" and the customer's ID for future reference. After a batch job on the company's server inserts the received emails into a DB2 table, the email messages can be joined to the customer table (Query 2).

SELECT c.custid, c.fname, c.lname, c.emailaddr, m.msgbody
FROM hdesk.rcvemails AS m
LEFT OUTER JOIN core.customers AS c
ON c.custid = REGEXP_SUBSTR(
m.msgbody, 'Robot Food:\s([0-9a-f]{10})', 1, 1, 'c', 1
)
WHERE m.msgstatus = 'RECEIVED';

 Output:

CUSTID     FNAME  LNAME EMAILADDR          MSGBODY
---------- ------ ----- ------------------ -------
3d5f2ac11e Jolene User  juser58@sample.com Hello, I have a question...

1 record(s) selected.

Query 2: REGEXP_SUBSTR captures and returns a string of 10 hexadecimal characters out of an email message body. The sixth argument limits the output to just the portion of the pattern that occurs within the first capturing group's parentheses, and not the Robot Food: that is matched outside of the capturing group.


An OR clause for regular expressions

Instead of using OR to chain multiple REGEXP_LIKE functions together when there's more than one possible search pattern, the alternation operator, which is the pipe symbol |, can combine those search patterns together as options within a single regex. The regular expression engine will work through the options in the alternation from left to right, exiting the alternation as soon as one of its options is fully matched.

Not only can the REGEXP functions evaluate multiple search patterns in one call, but they can also work across multiple input columns if you concatenate the columns together in the first parameter (Query 3).

Example:

SELECT isocountry, postcode
, CASE WHEN REGEXP_LIKE( isocountry || postcode,
'^US\d{5}(?:[- .]?\d{4})?\b|^CA[A-Z]\d[A-Z]\s\d[A-Z]\d\b'
) THEN 'VALIDATED' ELSE 'FAILED' END AS postcodevalid
FROM TABLE( VALUES
-- These three rows will match one of the options in the alternation
('US','90210'), ('US','97201-3339'), ('CA','D2A 4F2')
-- and these three rows will match neither
,('','12345'), ('US','1234o'), ('CA','1A3 C4G')
) AS unfilledorders( isocountry, postcode );

 Output:

ISOCOUNTRY POSTCODE   POSTCODEVALID
---------- ---------- -------------
US         90210      VALIDATED
US         97201-3339 VALIDATED
CA         D2A 4F2    VALIDATED
          12345      FAILED
US         1234o      FAILED
CA         1A3 C4G    FAILED

  6 record(s) selected.

Query 3: REGEXP_LIKE uses alternation to evaluate country-specific postcode validation patterns for Canada and the United States. Concatenating ISOCOUNTRY and POSTCODE in the first argument makes it possible for both column values to be evaluated within a single REGEXP_LIKE call.


Beware the golden hammer

As tempting as it may be to jump in and embrace a powerful new DB2 enhancement, especially one that's as long-awaited as the REGEXP functions, they can be overkill for simpler tasks. The power of regular expressions comes at the expense of a syntax that is unapologetically terse, perhaps even cryptic, making it more of a challenge to troubleshoot SQL statements that use them. Queries that manage to get the job done via modest SQL functions like TRANSLATE and REPLACE without becoming unwieldy may not be worth the effort and risk of rewriting.  

Performance considerations

Just as seemingly trivial adjustments to SQL statements can greatly impact their access paths, choosing the wrong operator or quantifier for even a small part of a regular expression can incur higher CPU costs due to backtracking. The extra CPU cost of a grossly inefficient regular expression inside a REGEXP call can add up fast when the SQL statement is calling that function on millions of rows of input. Fortunately, there are plenty of books and articles about writing well-behaved regular expressions, and the majority of the advice they offer is applicable to DB2's REGEXP functions. Understanding how and why backtracking occurs, the overhead caused by lazy quantifiers, and the benefits of possessive quantifiers will make it easier to write more efficient regular expressions for DB2 or any other engine.

The need for workarounds has passed

When the task at hand calls for regular expressions in DB2 SQL, I recommend the REGEXP functions over any alternative approach. For many years, embedded XQuery expressions were an excellent way to shoehorn regular expressions into SQL, but it's arguably time to rip those XQuery regex wrappers out of your statements and replace them with REGEXP function calls that are more straightforward and just as capable. The same goes for any Java UDFs or stored procedures that were created solely for regex processing. In terms of pathlength and resource overhead, DB2's SQL query engine can call built-in SYSIBM functions like REGEXP_LIKE and REGEXP_REPLACE far more efficiently than it can execute a Java method running in a fenced JVM process. Wherever possible, adopting the REGEXP functions as a replacement for older workarounds will not only make your SQL statements easier to manage, but it could also lead to better access plans from the optimizer.

About Fred Sobotka
Fred Sobotka is a data management consultant with FRS Consulting. He is an award-winning IDUG conference speaker, an IBM Gold Consultant, and an inaugural IBM Champion who writes and speaks regularly about advanced DB2 topics, business application development, and high-availability monitoring. He started his IT career in 1990 and has worked with DB2 for LUW as a developer and DBA since DB2 Common Server V2.1 in 1996.

 

Recent Stories
SQL Tuning for Developers – Part 1

The Key To Mastering Db2 Locking On Row-Store Tables

Lessons Learned of Locking and Latches by Adrian Burke