Introduction to Regular Expressions in OPAL

This month’s topic covers regular expression operations found in OPAL. Regular expressions, or regexes for short, are a way to match text against a user-specified pattern. It’s a huge topic and one that many people struggle with because the syntax can be confusing at times. But when you need it, there are few better tools than regexes.

If you haven’t used regular expressions much, you might not know there isn’t one universal syntax. While the basics generally work with all implementations, each has a slightly different set of options and extensions. Since Observe is built on Snowflake, OPAL’s regular expressions use the same POSIX Extended Regular Expression syntax Snowflake does.

Check out this Snowflake docs page if you use regexes with other tools to see how POSIX ERE differs. And if you need a general refresher on how regular expressions work, Regular-Expressions.info has many beginner tutorials and examples.

Regular Expressions in OPAL

In our last OPAL blog post, we introduced regex filter expressions to filter datasets, but there are also ways to match and extract text from individual fields. We won’t cover everything you can do with regexes in OPAL, but here are a couple of functions (and one verb) that you might find useful in your scripts.

Function: match_regex() and match_regex_all()

Use match_regex() to determine if the contents of a field contains your desired pattern. The expression will return true if the pattern matches and false if it doesn’t. This function is often combined with if() to identify data for further manipulation. For example, look for failed webhook requests and classify them by the type of failure:

// Look for these two error messages:
// The first is a request that failed to send at all
// The second is a request that got a response, but it was an error

filter msg1 ~ /Error sending triggered notification webhook request | Status code not successful when sending triggered notification webhook request/ 

// Identify which type of error by looking for "Status code"
// New field msg2 contains a new, shorter, message

make_col msg2:if(match_regex(msg1,/Status code/),"Response Error", "Sending Error")

On the other hand, match_regex_all() returns an array containing all matches in the target string. Use this if you want the matching values found in a particular field. Here’s an example that extracts a list of IP addresses from the field log:

make_col ip_list:match_regex_all(string(log), /([0-9]{1,3}\.){3}[0-9]{1,3}/)

Function: replace_regex()

Unsurprisingly, replace_regex() replaces matching patterns with new text. This example removes extra whitespace in a log of SQL queries, and updates the existing QUERY_TEXT field in place:

make_col QUERY_TEXT:replace_regex(QUERY_TEXT, /\s+/, ' ')

You don’t have to replace the contents of the field, you can simply create a new field instead. If you want to see which values got updated, compare old and new:

make_col query_text_revised:replace_regex(QUERY_TEXT, /\s+/, ' ')
make_col whitespace_updated:if(QUERY_TEXT = query_text_revised, false, true)

Verb: extract_regex

Our next verb, extract_regex creates new fields from the matched data. Yes, you can combine match_regex_all() with make_col(), but extract_regex allows you to use named capture groups to extract multiple values.

What are named capture groups? Well, most regular expression parsers allow you to reference multiple matches by their position in the regex (e.g. 1, 2, 3, etc.), but some allow assigning names to these matches, called capture groups. In theory, this makes them much easier to work with rather than dealing with indices.

In OPAL, extract_regex creates new fields for each named capture group. This example creates two new fields, jobName and buildId, from build data found in Jenkins logs:

extract_regex path, /\/var\/lib\/jenkins\/jobs\/(?P<jobName>[^\/]+)\/builds\/(?P<buildId>\d+)\/log/

To better understand how named capture groups work with extract_regex, let’s look at the complete example.

The source data (Jenkins Build Logs) contains JSON objects with several properties. However, none of those properties are the jobName or buildId. To get those, you have to extract them from the path.

{
  "_account_id": "12345",
  "_fluentd_host": "ip-n-n-n-n",
  "_instance_id": "i-0e4459876c3ad679c",
  "_tag": "observe.tail.jenkins",
  "_time": "2022-06-30 14:46:41.951628404 -0700",
  "message": "[server] 6/30/2022 - 9:46:41 PM - \u001b[32minfo\u001b[39m: 9kn9uvygxcuwagaj4fc667jluh6mc3e5 - URL - /releases/localdev/mainapp/icons-chrome~835610e3.js",
  "path": "/var/lib/jenkins/jobs/master-periodic-integration-cypress-test/builds/3522/log"
}

Here’s how those values are extracted using OPAL:

// New column for build log path (not strictly required, but useful)
make_col path:string(event.path)

// Extract the job name as jobName and build ID as buildId from the path field
extract_regex path, /\/var\/lib\/jenkins\/jobs\/(?P<jobName>[^\/]+)\/builds\/(?P<buildId>\d+)\/log/

// Filter to eliminate null job names
filter not is_null(jobName)

Learn More

This guide has hopefully given you what you need to start using the power of regexes in OPAL. As always, we can’t cover everything in a single post so for more examples and best practices for regexes, check out the docs page for our regex functions, as well as the page for extract_regex.

If you’d like to know more about regexes in OPAL, or are stuck and need a bit of help, visit our Slack support channel (#opal) and one of our engineers should be able to assist you.

All Blogs
In this article