Data masking:

static vs dynamic

The problem of data masking comes up surprisingly often in the world of IT. Any time you need to share some potentially sensitive data, you may need to hide, obfuscate, randomize or otherwise dissimulate some of that data -- we'll call that the secret data.

In this article, we're going to focus on the mechanics of data masking, and gloss over a massive issue, which is data classification -- knowing who can access what data. Data classification is a whole different problem, especially in organizations that have huge amounts of sensitive data. I'll refer you to a different article that touches on this topic. For the rest of this article, we'll assume that this problem has been solved, and that we do in fact know who can access what data. The question is -- how do we hide the secret data?

Data masking is not just for databases -- it can be applied to documents, spreadsheets and so on, but here we'll focus on databases.

There are many ways to do data masking, but in general they can be divided into two categories, each one with its own upsides and downsides.

Static masking

Static masking is the simplest solution. Given a data set that contains some secret data, you replicate that data set and edit the replica to mask whatever data needs to be masked.

With a database, the process usually consists of copying the database in question into a new database, and then removing or masking the secret data in the copy. You can then give the copy to the client, and they can do whatever they want with it.

Of course, for a large data set, this may not be a trivial process. Imagine a relational database with thousands of tables and billions of rows (or more). But there are some (expensive) tools that will help you with that process.

Advantages

It should be obvious that this is a very clean concept. It's the same idea as taking a pair of scissors and cutting out parts of a document. The secret data is simply not present, so there is no risk of leakage. The final user simply does not have the secret data.

For simple databases, you may not even need any tools: a few simple SQL scripts might be enough.

Because the secret data is not present, you can give a physical copy of the masked database to the client and let them run it on their own machines.

Disadvantages

The duplication of the data can be a problem. It requires more storage, and it's one more copy of the database floating around. This is not usually a big problem if, for instance, you are releasing a data set to the public, and therefore there will be only one version of the masked data set.

But if different clients have different requirements, you may need to make many copies of the data set, each one with a potentially different set of rules about which data is masked. And of course, if you have different rules for different clients, you now have to worry about each client getting access only to their own custom version of the data set, and not anyone else's. It can get challenging to track all that.

Another problem is that the copies are usually snapshots of the database, and therefore may need to be updated at regular intervals. Each time you do this is an opportunity for a mistake.

Finally, we live in the era of big data. Some data sets are truly enormous, and making a copy of such data sets can be a daunting proposition.

Dynamic masking

Dynamic masking takes a different approach. Instead of making multiple copies of the data, one for each masking scenario, the data is modified on the fly, as it is accessed, thereby providing each user of the same database with a potentially different view of the data.

This obviously assumes that the client does not control the database and is accessing it through some sort of network.

Generally speaking, dynamic masking can be done either by the database itself, or by a layer between the database server and the database client.

For instance, Microsoft SQL Server offers some dynamic data masking capabilities, which may be sufficient for many scenarios. I've gone over data masking in SQL Server in a previous article: it's a powerful feature, but it does have some limitations.

There are some third-party solutions that provide data masking outside of the database, but they typically rely on special drivers or special clients. A more generalized approach is based on proxy filtering, which relies on deep packet inspection and editing to mask data before it reaches the client.

Advantages

The biggest advantage of dynamic masking is that, in theory at least, it allows you to use just one database for everyone. This solves the issues we identified earlier with static masking.

Dynamic data masking also means that you can update the data masking rules, typically on the fly, and restrict or broaden access to certain data for certain clients at any time. Masking can be dependent on more than just who the user is: it can also depend on their IP address, or the time of day, or what DEFCON level we're at -- you get the picture.

Obviously, clients get access to real-time data, so the problem of data currency disappears.

And because dynamic data masking implies that you are controlling the database, you can (and probably should) also monitor what the clients are doing. This is critical for forensic analysis if there is a problem later on (think Cambridge Analytica).

Disadvantages

Dynamic masking is potentially less secure, since users are in fact connecting to a database that contains the secret data. It turns out to be non-trivial to mask data reliably if the client accesses it using a sophisticated query language such as SQL. In fact, Microsoft specifically warns about this issue in their SQL Server documentation. This issue can be brought under control with query control, if that's an option.

Dynamic masking is also a more complex solution overall, with more moving parts. The more complex the solution, the more likely it is that something will go wrong.

Conclusion

As is so often the case, there is no perfect solution: there is only a series of trade-offs that need to be weighed against the requirements.

If your data set is of manageable size (which is very much a relative concept here), it may be practical for you to make a copy of your database and do the masking on the copy. If you're OK with the disadvantages we have outlined, that's a great way to do it. Simple solutions are often the most secure.

But if it's impractical to duplicate the data set, especially if you have multiple clients with multiple masking requirements, then dynamic masking may be your only realistic option. In that case, you'll have to consider whether the database can satisfy your requirements, or whether a third-party solution is required. Even if you end up using the data masking provided by your database, you may still benefit from using a third-party tool to manage permissions and data classifications.