Example: matching queries with a regular expression

Introduction

Let's assume that we have a request filter that needs to be invoked whenever a query for the purchase_orders table is received, because we want to do something with the notes column.

The first query we see is:

select id, customer_name, status, notes from demo.purchase_orders

So the obvious thing to do is an exact match. If that's the only query we want to match, great -- we're done.

But it's common to need more flexibility. Maybe the queries are more diverse, e.g.:

select id,notes from "demo"."purchase_orders"

select id,notes, status from [demo].[purchase_orders]

SELECT * FROM DEMO.PURCHASE_ORDERS WHERE STATUS <> 'S'

Matching all of these is possible using regular expressions.

Note: a SQL parser is also an option

In this article, we're going to cover regular expressions, but there is another option. Regular expressions are fairly easy to put in place, but they are relatively imprecise. If you need to do a complete analysis of the SQL, you can use a SQL parser, as explained in this other example. It's more work, but it's also much more precise.

Basic regular expression

Regular expressions in Gallium Data start with the string "regex:". Any special characters they contain must be escaped to keep their original meaning.

We can turn the first query into a regular expression with:

regex:select id, customer_name, status, notes from demo\.purchase_orders

The only difference is the backslash before the period. That's because a period is a special character in regular expressions: it stands for any one character.

By turning this into a regular expression, we've already gained case independence: this regular expression will match any of the following:

select id, customer_name, status, notes from demo.purchase_orders

SELECT ID, CUSTOMER_NAME, STATUS, NOTES FROM DEMO.PURCHASE_ORDERS

Select ID, customer_name, status, notes from DEMO.Purchase_Orders

Sometimes that's not desirable -- if you want to be case-dependent, you can use the prefix REGEX: instead.

Matching columns

That's a good start, but we have a number of other queries we need to match. In particular, we don't really care which columns are selected, only whether or not they include the notes column. We can broaden our regular expression with:

regex:select.*\bnotes\b.*from demo\.purchase_orders

This will now match any of the following queries:

select id, customer_name, status, notes from demo.purchase_orders

select Notes from demo.purchase_orders

SELECT notes, status from DEMO.PUCHASE_ORDERS

But that's not enough: we also need to handle the * operator with:

regex:select\s+(\*|(.*\bnotes\b.*))\s+from demo\.purchase_orders

This will now also match:

select * from demo.purchase_orders

Note the expression at the beginning:

select\s+(\*|(.*\bnotes\b.*))\s+from

Let's break it down:

\s stands for any white space character, i.e. space, carriage return, tab, etc...
therefore \s+ stands for at least one white space character

We then have an expression between parentheses:

(\*|(.*\bnotes\b.*))

which contains two parts separated by a vertical bar:

\*
(.*\bnotes\b.*)

The vertical bar means "or", and the parentheses form a group, so this expression means either the first part or the second part.

The first part \* is escaped with a backslash because asterisk is a special character, therefore it simply means the asterisk character. It takes care of the asterisk in select * from...

The second part (.*\bnotes\b.*) is surrounded by parentheses, which group everything they contain, which is:

.* will match zero or more characters
\b will match a word boundary -- this is because we don't want notes to match if it's part of something else, like other_notes
notes will match the name of the column we're interested in
\b is another word boundary to make sure the word notes is free-standing and not part of another word
.* again, will match zero or more characters

Matching the table

The table name may be expressed in a variety of ways:

demo.purchase_orders

"demo.purchase_orders"

[demo].[purchase_orders]

"demo".[purchase_orders]

To match all of these, we need to specify:

["\[]?demo["\]]?\.["\[]?purchase_orders["\]]?

This surrounds the schema name and the table name with:

["\[]? which will match either a double quote, or an open square bracket (which needs to be escaped with a backslash), or nothing (because of the ? operator)

["\]]? which will match either a double quote, or a close square bracket (which needs to be escaped with a backslash), or nothing

Surroundings

So far we haven't taken into account the fact that the query could be more specific, e.g.:

select notes from [demo].[purchase_orders] where id > 3 or notes in ('S', 'T')

The easiest way to match this is to simply accept anything after the table name:

regex:select\s*(\*|(.*\bnotes\b.*))\s+from\s+["[]?demo["\]]?\.["[]?purchase_orders["\]]?.*

We can do the same at the beginning, in case there is anything in front of the query.

What if we don't trust the queries?

In some scenarios, you may not know exactly what the SQL queries look like, which means that they could be formulated in a way to evade the regular expressions.

For instance, if you wanted to catch any queries from the table Customers, you might think that a regular expression like:

regex:.*\bselect\b.*\bfrom\b.*\bcustomers\b.*

would

Conclusion

Our final expression is:

regex:.*select\s*(\*|(.*\bnotes\b.*))\s+from\s+["[]?demo["\]]?\.["[]?purchase_orders["\]]?.*

which will match a wide variety of select queries.

Whenever you have a regular expression that contains wide open expressions like .*, keep in mind that it will match anything. The more specific the regular expression, the better.

Google Sites

Report abuse