Microsoft SQL Server 2012 introduced a feature called data classification, which allows you to mark certain columns with labels, indicating that these columns contain sensitive or special-handling data.
This is a useful way to mark certain columns as having special attributes. For instance, you may want to mark a column containing credit card numbers as "confidential", or sales numbers as "management only".
The problem is that you can mark these columns all you want, but it won't have any immediate effect. SQL Server does not actually do anything with these labels: it's up to you to actually do whatever you need to do in your applications. In fact, data classification can be used for any purpose, not just security.
In this article, we're going to solve this problem by enforcing these classifications, even for database clients that do not support this feature.
1 - Creating data classification
Data classifications can be created either with SQL commands, or with a graphical tool.
For instance you can annotate the first_name and last_name columns of the customers table with:
You can do the same thing using graphical tools such as Microsoft's SQL Server Management Studio:
which allows you to set the classification of whatever columns you want:
That's all fine and dandy, but as mentioned before, these classifications have no effect unless your applications take them into consideration.
This is a situation where Gallium Data can be a big help. We can easily change the traffic between database clients and database servers so that the classifications actually do have an effect, without changing these applications.
2 - How does data classification work?
When your database client executes a SQL command, such as select * from gallium_demo.customers, the response from SQL Server contains some metadata about the data being returned. This metadata includes the names of the columns, their data types, sizes, and so on.
If your client supports it (and some drivers don't, like Node and Python), this metadata also includes the data classifications for the data -- for instance, the fact that column customers.first_name has information type "Personal" with sensitivity "Confidential".
But that's as far as SQL Server goes -- it's up to the database client to then do something with that information.
So there are three main problems here:
This data classification business seems pretty abstract since it has no effect on how the data is handled by the database
The data classification information is not even available to some clients, because their database drivers do not support it.
Even if the database driver supports it, the applications still need to include some special handling to reflect the meaning of the data classification.
Let's fix all that and make data classification actually do something.
3 - What we're going to do (a preview for the impatient)
We're going to start three Docker containers:
one for SQL Server (obviously)
one for SQLPad, which is a generic SQL client (note: anything that can talk to SQL Server would work equally well)
and one for Gallium Data, which will be the proxy between the database client and the database server
We'll then define a filter in Gallium Data to show and hide certain rows and columns, depending on their data classification. By doing so, we'll change how SQL Server seems to work: it will now appear to honor the data classification labels.
The end result will be that the data coming back from SQL queries will change depending on its data classification:
Without a filter
With a filter
4 - Setup
For this tutorial, we'll assume that you have already run the Gallium Data tutorial for SQL Server. If you haven't, we highly recommend that you do so before continuing, to get familiar with the basic operation of Gallium Data.
Also, make sure that Docker is up and running.
Start the Docker containers
⇨ Run the following from a command line:
(we could use e.g. Docker Compose or Kubernetes, but this is the simplest way for a demo)
We now have three Docker containers running:
⇨ Execute a simple query in SQLPad (http://127.0.0.1:3014) to make sure that everything is up and running :
(Note: if it doesn't work right away, give it a few seconds and try again. If it still doesn't work, look at the Docker logs. The most common problem is running out of volume space in Docker, which can usually be solved by running docker system prune --volumes).
5 - Basic level: Enforcing data classification
Let's say you have some application (in our case, SQLPad) that accesses data in SQL Server. You've decided that data classification is a nifty feature that would enhance your application, but you don't want to change that application. After all, it takes a non-trivial amount of work to make these changes, and it may not even be an option if your database driver does not support this feature (as is the case for SQLPad), or if it's a third-party application.
Let's use Gallium Data to do the job.
Step 1: Deploy a response filter
Data classification is often used to restrict access to sensitive data.
In our case, we have a (very) simple table called customers, as shown here. We want only certain people to have access to certain parts of that table.
The database already contains data classification for two columns, as shown in the screenshot below (from Microsoft's SQL Server Management Studio):
For this demo, we're going to enforce the following rule on the customers table:
User "sa" cannot see the contents of columns marked as Confidential if the customer is in Belgium (BE), France (FR) or Spain (ES).
This can easily be done with a simple result set filter.
⇨ In Gallium Data (http://localhost:8089/web/index.html), open the Data classification demo project and open the response filter called Hide customer name
⇨ Look at the Parameters tab
The filter will fire if all the following conditions are met:
the rows are from the customers table
the country column has one of three values (BE, FR, ES)
the current database user is "sa"
⇨ Look at the Code tab
You'll see that, when this filter fires, it will replace the value of the first_name and last_name columns:
⇨ Activate this filter
⇨ Publish to Gallium Data
Step 2: re-run the query
⇨ Re-run your query in SQLPad
The data being returned to the client has now changed, though the database is completely unaffected.
You can play with the code and change how things work, like changing the "hidden" value, or changing the value of other columns.
6 - Intermediate level: using data classification dynamically
So far, we've made assumptions about the data. In particular, our code assumes that there are columns named first_name and last_name. That's fine for simple cases, but it's not really a generalized solution.
For that, we need to consult the metadata. Thankfully, that's really easy.
⇨ Change the code in the filter to:
⇨ Re-run your query in SQLPad
The result is similar to last time, but this time we didn't hard-code the column names: if we add or remove classification information for the columns in this table, they too will automatically get filtered. In addition, this solution works if the SQL query uses column aliases, e.g.:
whereas the first solution did not handle that case.
This filter can now potentially work with any table that might have data classification.
7 - All together now: a generic solution
Let's go from start to finish. We're going to add some data classification to the products table, then we're going to add a filter that hides a column, or even entire rows, depending on that data classification.
Take a look at the products table
⇨ Execute the following in SQLPad
Notice that some products have a high list price -- we may not want everyone to see that.
Add some data classification
⇨ Execute the following in SQLPad
Add a filter
⇨ In Gallium Data, create a new response filter of type MSSQL result set filter
Name it whatever you want, e.g. Filter products according to classification
⇨ Set the parameter Query pattern to: regex:.*select.*gallium_demo.*products.*
This will ensure that this filter will only be executed for the products table. We could make it much more general, but we'll keep it simple for this tutorial.
⇨ Set the code to:
This code does two things:
If the row has a numeric column marked as Sensitive and its value is greater than 1000, the entire row is hidden
If the row has a numeric column marked as Sensitive and its value is greater than 200, its value is set to null and an asterisk is added to the name column, if present
⇨ Re-run the query in SQLPad:
Notice that row id 5 (Elixir of youth) has now been hidden because its list price is over $1,000, and the products with prices over $200 now have null prices, and are highlighted with an asterisk in the name column.
This is a good place to experiment with the filter and make it do whatever you think might make sense here.
I hope you can see by now that there can be much more to data classification than just a cataloging of data -- it can actually inform how your applications work.
You can change the way your applications work without modifying them in any way. This can be a huge boon for third-party applications, or applications that are difficult or impossible to maintain.
If you work in a security-conscious environment, you may also be interested in other things that Gallium Data can do for you, such as restricting queries to a well-known set.
Gallium Data is a very open-ended environment and can be endlessly extended. It's fun to make SQL Server do things that would be almost impossible otherwise, especially when using your own database rather than a simple demo database.
Once you're done with this tutorial, and you want to remove everything that was installed,
⇨ Execute the following commands from a command line:
This will stop all the Docker containers started during this tutorial.
If you also want to remove the Docker images:
This will remove everything downloaded by this tutorial.