Overview

Microsoft SQL Server 2012 introduced a feature called data classification, which allows you to mark certain columns with labels, indicating that these columns contain sensitive or special-handling data. 

This is a useful way to mark certain columns as having special attributes. For instance, you may want to mark a column containing credit card numbers as "confidential", or sales numbers as "management only".

The problem is that you can mark these columns all you want, but it won't have any immediate effect. SQL Server does not actually do anything with these labels: it's up to you to actually do whatever you need to do in your applications. In fact, data classification can be used for any purpose, not just security.

In this article, we're going to solve this problem by enforcing these classifications, even for database clients that do not support this feature.

1 - Creating data classification

Data classifications can be created either with SQL commands, or with a graphical tool.

For instance you can annotate the first_name and last_name columns of the customers table with:

add sensitivity classification to

gallium_demo.customers.first_name, gallium_demo.customers.last_name

with ( label='Confidential', information_type='Personal', rank=critical )

You can do the same thing using graphical tools such as Microsoft's SQL Server Management Studio:

which allows you to set the classification of whatever columns you want:

That's all fine and dandy, but as mentioned before, these classifications have no effect unless your applications take them into consideration.

This is a situation where Gallium Data can be a big help. We can easily change the traffic between database clients and database servers so that the classifications actually do have an effect, without changing these applications.

2 - How does data classification work?

When your database client executes a SQL command, such as select * from gallium_demo.customers, the response from SQL Server contains some metadata about the data being returned. This metadata includes the names of the columns, their data types, sizes, and so on.

If your client supports it (and some drivers don't, like Node and Python), this metadata also includes the data classifications for the data -- for instance, the fact that column customers.first_name has information type "Personal" with sensitivity "Confidential".

But that's as far as SQL Server goes -- it's up to the database client to then do something with that information.

So there are three main problems here:

Let's fix all that and make data classification actually do something.

3 - What we're going to do (a preview for the impatient)

We're going to start three Docker containers: 

We'll then define a filter in Gallium Data to show and hide certain rows and columns, depending on their data classification. By doing so, we'll change how SQL Server seems to work: it will now appear to honor the data classification labels.

The end result will be that the data coming back from SQL queries will change depending on its data classification:

Without a filter

With a filter

4 - Setup

For this tutorial, we'll assume that you have already run the Gallium Data tutorial for SQL Server. If you haven't, we highly recommend that you do so before continuing,  to get familiar with the basic operation of Gallium Data.

Also, make sure that Docker is up and running.

Start the Docker containers

Run the following from a command line:

(we could use e.g. Docker Compose or Kubernetes, but this is the simplest way for a demo)

docker network create gallium-demo-net

docker run -d --name gallium-mssql --network gallium-demo-net galliumdata/gallium-data-demo-mssql:4

docker run -d --name gallium-data --network gallium-demo-net -p 8089:8080 -e repository_location=/galliumdata/repo_classif galliumdata/gallium-data-engine:1.8.3-1851

docker run -d --name gallium-sqlpad --network gallium-demo-net -p 3014:3000 -e SQLPAD_AUTH_DISABLED=true -e SQLPAD_CONNECTIONS__demo__name="Gallium Data demo" -e SQLPAD_CONNECTIONS__demo__driver=sqlserver -e SQLPAD_CONNECTIONS__demo__host=gallium-data -e SQLPAD_CONNECTIONS__demo__database=GalliumDemoDB -e SQLPAD_CONNECTIONS__demo__username=sa -e SQLPAD_CONNECTIONS__demo__password=Password1 sqlpad/sqlpad

We now have three Docker containers running:

Execute a simple query in SQLPad (http://127.0.0.1:3014) to make sure that everything is up and running :

select * from gallium_demo.customers

(Note: if it doesn't work right away, give it a few seconds and try again. If it still doesn't work, look at the Docker logs. The most common problem is running out of volume space in Docker, which can usually be solved by running docker system prune --volumes).

5 - Basic level: Enforcing data classification

Let's say you have some application (in our case, SQLPad) that accesses data in SQL Server. You've decided that data classification is a nifty feature that would enhance your application, but you don't want to change that application. After all, it takes a non-trivial amount of work to make these changes, and it may not even be an option if your database driver does not support this feature (as is the case for SQLPad), or if it's a third-party application. 

Let's use Gallium Data to do the job.

Step 1: Deploy a response filter

Data classification is often used to restrict access to sensitive data.

In our case, we have a (very) simple table called customers, as shown here. We want only certain people to have access to certain parts of that table.

The database already contains data classification for two columns, as shown in the screenshot below (from Microsoft's SQL Server Management Studio):

For this demo, we're going to enforce the following rule on the customers table:

This can easily be done with a simple result set filter.

In Gallium Data (http://localhost:8089/web/index.html), open the Data classification demo project and open the response filter called Hide customer name

Look at the Parameters tab

The filter will fire if all the following conditions are met:

Look at the Code tab

You'll see that, when this filter fires, it will  replace the value of the first_name and last_name columns:

let pkt = context.packet;

let hiddenValue = "<hidden>";

if (pkt.columnHasSensitivity("first_name", "Confidential")) {

    pkt.first_name = hiddenValue;

}

if (pkt.columnHasSensitivity("last_name", "Confidential")) {

    pkt.last_name = hiddenValue;

}

Activate this filter

Publish to Gallium Data

Step 2: re-run the query

Re-run your query in SQLPad

The data being returned to the client has now changed, though the database is completely unaffected.

You can play with the code and change how things work, like changing the "hidden" value, or changing the value of other columns.

6 - Intermediate level: using data classification dynamically

So far, we've made assumptions about the data. In particular, our code assumes that there are columns named first_name and last_name. That's fine for simple cases, but it's not really a generalized solution.

For that, we need to consult the metadata. Thankfully, that's really easy.

Change the code in the filter to:

let pkt = context.packet;

for (let meta of pkt.columnMetadata) {

    if (pkt.columnHasSensitivity(meta.columnName, "Confidential")) {

        pkt[meta.columnName] = "#";

    }

}

⇨ Publish

⇨ Re-run your query in SQLPad

The result is similar to last time, but this time we didn't hard-code the column names: if we add or remove classification information for the columns in this table, they too will automatically get filtered. In addition, this solution works if the SQL query uses column aliases, e.g.:

select first_name as fn, last_name as ln, country from gallium_demo.customers

whereas the first solution did not handle that case.

This filter can now potentially work with any table that might have data classification.

7 - All together now: a generic solution

Let's go from start to finish. We're going to add some data classification to the products table, then we're going to add a filter that hides a column, or even entire rows, depending on that data classification.

Take a look at the products table

Execute the following in SQLPad

select * from gallium_demo.products

Notice that some products have a high list price -- we may not want everyone to see that.

Add some data classification

⇨ Execute the following in SQLPad

add sensitivity classification to

gallium_demo.products.list_price

with ( label='Sensitive', information_type='Commercial', rank=high )

Add a filter

⇨ In Gallium Data, create a new response filter of type MSSQL result set filter

Name it whatever you want, e.g. Filter products according to classification

⇨ Set the parameter Query pattern to: regex:.*select.*gallium_demo.*products.*

This will ensure that this filter will only be executed for the products table. We could make it much more general, but we'll keep it simple for this tutorial.

⇨ Set the code to:

let pkt = context.packet;

for (let meta of pkt.columnMetadata) {

    if (pkt.columnHasSensitivity(meta.columnName, "Sensitive") && 

            meta.typeInfo.typeName === 'numeric') {

        if (pkt[meta.columnName] > 1000) {

            pkt.remove();

        }

        else if (pkt[meta.columnName] > 200) {

            pkt[meta.columnName] = null;

            if (pkt.name) {

                pkt.name += " >>";

            }

        }

    }

}

This code does two things:

⇨ Publish

Re-run the query in SQLPad:

select * from gallium_demo.products

Notice that row id 5 (Elixir of youth) has now been hidden because its list price is over $1,000, and the products with prices over $200 now have null prices, and are highlighted with an asterisk in the name column.


This is a good place to experiment with the filter and make it do whatever you think might make sense here.

Conclusion

I hope you can see by now that there can be much more to data classification than just a cataloging of data -- it can actually inform how your applications work. 

You can change the way your applications work without modifying them in any way. This can be a huge boon for third-party applications, or applications that are difficult or impossible to maintain.

If you work in a security-conscious environment, you may also be interested in other things that Gallium Data can do for you, such as restricting queries to a well-known set

Gallium Data is a very open-ended environment and can be endlessly extended. It's fun to make SQL Server do things that would be almost impossible otherwise, especially when using your own database rather than a simple demo database.

Cleanup

Once you're done with this tutorial, and you want to remove everything that was installed,

⇨ Execute the following commands from a command line:

docker stop gallium-mssql
docker stop gallium-sqlpad
docker stop gallium-data
docker rm gallium-mssql
docker rm gallium-sqlpad
docker rm gallium-data
docker network rm gallium-demo-net

This will stop all the Docker containers started during this tutorial.

If you also want to remove the Docker images:

docker rmi sqlpad/sqlpad
docker rmi galliumdata/gallium-data-engine:1.8.3-1851
docker rmi galliumdata/gallium-data-demo-mssql:4

This will remove everything downloaded by this tutorial.