Example: matching queries with a regular expression

Introduction

Let's assume that we have a request filter that needs to be invoked whenever a query for the purchase_orders table is received, because we want to do something with the notes column.

The first query we see is:

select id, customer_name, status, notes from demo.purchase_orders

So the obvious thing to do is an exact match. If that's the only query we want to match, great -- we're done.

But it's common to need more flexibility. Maybe the queries are more diverse, e.g.:

select id,notes from "demo"."purchase_orders"

select id,notes, status from [demo].[purchase_orders]

SELECT * FROM DEMO.PURCHASE_ORDERS WHERE STATUS <> 'S'

Matching all of these is possible using regular expressions.

Basic regular expression

Regular expressions in Gallium Data start with the string "regex:". Any special characters they contain must be escaped to keep their original meaning.

We can turn the first query into a regular expression with:

regex:select id, customer_name, status, notes from demo\.purchase_orders

The only difference is the backslash before the period. That's because a period is a special character in regular expressions: it stands for any one character.

By turning this into a regular expression, we've already gained case independence: this regular expression will match any of the following:

select id, customer_name, status, notes from demo.purchase_orders

SELECT ID, CUSTOMER_NAME, STATUS, NOTES FROM DEMO.PURCHASE_ORDERS

Select ID, customer_name, status, notes from DEMO.Purchase_Orders

Sometimes that's not desirable -- if you want to be case-dependent, you can use the prefix REGEX: instead.

Matching columns

That's a good start, but we have a number of other queries we need to match. In particular, we don't really care which columns are selected, only whether or not they include the notes column. We can broaden our regular expression with:

regex:select.*\bnotes\b.*from demo\.purchase_orders

This will now match any of the following queries:

select id, customer_name, status, notes from demo.purchase_orders

select Notes from demo.purchase_orders

SELECT notes, status from DEMO.PUCHASE_ORDERS

But that's not enough: we also need to handle the * operator with:

regex:select\s+(\*|(.*\bnotes\b.*))\s+from demo\.purchase_orders

This will now also match:

select * from demo.purchase_orders

Note the expression at the beginning:

select\s+(\*|(.*\bnotes\b.*))\s+from

Let's break it down:

  • \s stands for any white space character, i.e. space, carriage return, tab, etc...

  • therefore \s+ stands for at least one white space character

We then have an expression between parentheses:

(\*|(.*\bnotes\b.*))

which contains two parts separated by a vertical bar:

  • \*

  • (.*\bnotes\b.*)

The vertical bar means "or", and the parentheses form a group, so this expression means either the first part or the second part.

The first part \* is escaped with a backslash because asterisk is a special character, therefore it simply means the asterisk character. It takes care of the asterisk in select * from...

The second part (.*\bnotes\b.*) is surrounded by parentheses, which group everything they contain, which is:

  • .* will match zero or more characters

  • \b will match a word boundary -- this is because we don't want notes to match if it's part of something else, like other_notes

  • notes will match the name of the column we're interested in

  • \b is another word boundary to make sure the word notes is free-standing and not part of another word

  • .* again, will match zero or more characters

Matching the table

The table name may be expressed in a variety of ways:

demo.purchase_orders

"demo.purchase_orders"

[demo].[purchase_orders]

"demo".[purchase_orders]

To match all of these, we need to specify:

["\[]?demo["\]]?\.["\[]?purchase_orders["\]]?

This surrounds the schema name and the table name with:

["\[]? which will match either a double quote, or an open square bracket (which needs to be escaped with a backslash), or nothing (because of the ? operator)

["\]]? which will match either a double quote, or a close square bracket (which needs to be escaped with a backslash), or nothing

Surroundings

So far we haven't taken into account the fact that the query could be more specific, e.g.:

select notes from [demo].[purchase_orders] where id > 3 or notes in ('S', 'T')

The easiest way to match this is to simply accept anything after the table name:

regex:select\s*(\*|(.*\bnotes\b.*))\s+from\s+["[]?demo["\]]?\.["[]?purchase_orders["\]]?.*

We can do the same at the beginning, in case there is anything in front of the query.

Conclusion

Our final expression is:

regex:.*select\s*(\*|(.*\bnotes\b.*))\s+from\s+["[]?demo["\]]?\.["[]?purchase_orders["\]]?.*

which will match a wide variety of select queries.

Whenever you have a regular expression that contains wide open expressions like .*, keep in mind that it will match anything. The more specific the regular expression, the better.