In SQL Server 2016, Microsoft introduced a new feature called dynamic data masking, which allows you to mask the values of certain columns and keep some data hidden from certain users without having to modify your applications.
Let's take a look at how SQL Server does data masking, and compare it to the way Gallium Data goes about it.
SQL Server data masking has the main advantage of being built-in
Gallium Data has the main advantage of being much more flexible
For this article, we'll use the following trivial database :
create table demo.customers (
id int not null,
name nvarchar(100) not null,
country char(2) not null,
primary key (id)
)
insert into demo.customers values
(1, 'Andrea', 'AR'),
(2, 'Bruno', 'BE'),
(3, 'Charles', 'CA'),
(4, 'Daniella', 'DK'),
(5, 'Eric', 'ES')
Let's assume you want to partially or completely hide the customers' names. With SQL Server, you can define a dynamic mask with:
alter table demo.customers
alter column name
add masked with (function = 'partial(1,"XXXXXXX",0)')
Any user (other than the owner of the table and DBAs) will then see the data as masked, e.g.:
execute as user='test_user'
select * from demo.customers
revert
id name country
1 AXXXXXXX AR
2 BXXXXXXX BE
3 CXXXXXXX CA
4 DXXXXXXX DK
5 EXXXXXXX ES
You may wish to exempt some users from this type of masking with:
grant unmask to test_user
but that's a database-wide permission: that user will now have unmasked access to all masked data.
Overall, this is a nice feature, and it's fairly easy to use, but it does have some limitations. What if we need a mask that depends on the value, or on the user? You can only use a few built-in functions to do masking, and you cannot use your own functions.
Data masking works well if your users are limited to a fixed set of SQL queries, which is often the case. But if database users can create their own SQL queries, they can easily bypass data masking with queries such as:
execute as user='test_user'
select * from demo.customers where name like 'Br%'
revert
id name country
2 BXXXXXXX BE
which allows the user to determine that the second letter of customer 2's name is 'r'. Obviously it's pretty easy to extend that and make an end-run around the masking. That's fine: data masking is not meant to be a security measure at the database level, it's just meant to help you make your applications more secure.
Gallium Data takes a very different approach. It acts as a smart proxy between database clients and database servers, and executes user-defined logic (expressed as filters and JavaScript code) that can modify that network traffic when appropriate.
For data masking, Gallium Data has two options: either modify the SQL command on its way to SQL Server, or filter the result set coming back from SQL Server.
A simple request filter in Gallium Data can look for the statement:
select id, name, country
from demo.customers where name like '?'
and rewrite it to:
select id, substring(name, 1, 1) + 'XXXX' as name, country
from demo.customers where name like '?'
This works well if you know in advance what SQL commands to expect.
You can of course be as fancy as you want in how you rewrite the query:
select id,
case
when country in ('ES', 'DK', 'BE') then substring(name, 1, 1) + '€€€€'
when country in ('US', 'CA') then substring(name, 1, 1) + '$$$$'
else name
end as name,
country
from demo.customers
where name like '?'
which will make the mask depend on the country column. Note that this rewriting can change depending on who the user is, the time of day, or any other conditions.
The other (often complementary) option is to change the result set as it comes back from SQL Server. This is more expensive because we'll potentially execute code for every row in the result set, but modern computers are amazingly fast.
The JavaScript code for the filter might look something like:
let pkt = context.packet;
if (pkt.country == 'BE' || pkt.country == 'DK' || pkt.country == 'ES') {
pkt.name = pkt.name.substring(0, 1) + "€€€€";
}
else if (pkt.country == 'US' || pkt.country == 'CA') {
pkt.name = pkt.name.substring(0, 1) + "$$$$";
}
You now have complete freedom over how or whether to mask the data.
Built-in
Best performance
Requires another system
Not as performant as SQL Server
Masking is completely up to you
Masking can be depend on the value of any columns
Masking can be different depending on the user, the user's IP address, etc...
Masking can be done on computed columns
Masking/unmasking can be done with surgical precision
Masking can only be done using a limited set of functions
Masking is uniform -- all rows get the same mask
There is only one masking option per column -- everyone gets the same
Masking only works on stored columns
Masking/unmasking is all-or-nothing per user
These two approaches have different merits. They are not exclusive: it's possible to use both to implement different aspects of the requirements.
SQL Server has a simple mechanism that is pretty much all-or-nothing. This is appropriate for many applications that require nothing more.
Gallium Data gives you a lot more flexibility and may be preferable for the more complex cases when the masking requirements are not straightforward. Perhaps you need to mask only some rows for some users, or perhaps the masking depends on the data itself. You get to decide, without limitations.
If you're intrigued, you may be interested to see how Gallium Data can also do row-level security, or work with SQL Server's data classification feature to restrict data access, and how Gallium Data can easily restrict queries based on prior behavior.