Hiding data in Cassandra:

 hands-on tutorial

This tutorial is the hands-on companion to the article titled Hiding data in Cassandra

It will make a lot more sense if you read the article first.

This is an advanced tutorial: it assumes some minimal familiarity with Gallium Data. If you have never used Gallium Data before, we strongly recommend you go through one of the basic tutorials first (it only takes 10 minutes)

In this tutorial, we're going to implement some of the data masking techniques described in the article.

We're going to start three Docker containers:

The running system will look like this:

There are many ways of managing multiple containers, such as Docker Compose, Kubernetes or Nomad, but for this tutorial we will simply start them by hand, to make sure every component is visible and clearly understood.

Step 2: Start the three Docker containers

If this is your first time running this tutorial, note that it will download about 1GB of Docker images, which can take a while on slower connections.

1 - Create the Docker network

Run this from a command line:

docker network create gallium-net

2 - Start the Cassandra database 

Run this from a command line:

docker run -d --rm --name gallium-cassandra --network gallium-net galliumdata/cassandra-demo:1

3 - Start Gallium Data

Run this from a command line:

docker run -d --rm --name gallium-data --network gallium-net -p 8089:8080 -e repository_location=/galliumdata/repo_cassandra galliumdata/gallium-data-engine:1.8.3-1851

4 - Start the query parser

Run this from a command line:

docker run -d --rm --name gallium-cqlparser --network gallium-net galliumdata/galliumdata-cqlparser:1.0.0-18

We now have all three containers running and ready to go.


Run a Cassandra query

Let's connect to Cassandra using cqlsh and make sure everything works properly:

Run this from a command line:

docker exec -it gallium-cassandra cqlsh -k gallium_demo gallium-data

Hint: if this fails, try again after a few moments, Cassandra can take a few seconds to warm up.

We are connecting to Cassandra through Gallium Data, not directly -- this will allow us to do all kinds of interesting things.

You will get the Cassandra command line prompt:

Connected to Test Cluster at gallium-data:9042

[cqlsh 6.1.0 | Cassandra 4.1.1 | CQL spec 3.4.6 | Native protocol v5]

Use HELP for help.

cqlsh:gallium_demo>

Make sure you can see some data:

Run this from the Cassandra command line:

select * from customers;

You will see some data:

 id | country | first_name | last_name     | phones

----+---------+------------+---------------+--------------------------------------------------------

 23 |      WF |      Wanda |      Williams | {'home': {country_code: 123, phone_number: '123-45-...

  5 |      ES |       Eric |       Edmunds | {'home': {country_code: 55, phone_number: '123-45-6...

 10 |      JP |     Juliet |     Jefferson | {'home': {country_code: 110, phone_number: '123-45-6...

 16 |      PE |   Patricia |         Pérez | {'home': {country_code: 116, phone_number: '123-45-6...

etc...

Everything is in place, now let's do some data masking.

Rejecting queries

We saw in the article that one approach is simply to reject any queries that try to access data they're not supposed to access. Let's create a filter that does just that.

⇨ Connect to Gallium Data at: http://127.0.0.1:8089

⇨ Create a new request filter of type Query filter - Cassandra (call it e.g. "Reject customer phones queries")

⇨ Set the Query patterns parameter to: regex:.*\bSELECT\b.*\bcustomers\b.*

⇨ Set the code to:

let json = context.utils.doPost("http://gallium-cqlparser:8099", context.packet.query);

let cqlInfo = JSON.parse(json);

if (cqlInfo.columns.length === 0 || cqlInfo.columns.find(e => e === "phones" || e.startsWith("phones."))) {

    context.result.success = false;

    context.result.errorCode = 1234;

    context.result.errorMessage = "You are not allowed to select certain columns in this table.";

}

This code sends the CQL from the query to Gallium Data's CQL parser service, which can parse any CQL command and tell us exactly what that command does, and which tables/columns are involved.

If the query involves the phones column, or one of its attributes, we reject the query with an error message.

Make sure this new filter is active and published.

Run these queries, one at a time:

SELECT * FROM customers;

SELECT id, country, first_name, last_name, phones from customers;

Both will fail with our error message:

ErrorMessageSub: You are not allowed to select certain columns in this table.

In contrast, a query that does not try to access the phones column will work fine:

SELECT id, country, first_name, last_name FROM customers;


 id | country | first_name | last_name

----+---------+------------+---------------

 23 |      WF |      Wanda |      Williams

  5 |      ES |       Eric |       Edmunds

 10 |      JP |     Juliet |     Jefferson

 16 |      PE |   Patricia |         Pérez

etc...

Rewriting queries

Instead of rejecting any query that tries to access the phones column, we can modify those queries  on the fly so that they do not expose the column.

Change the code in the filter to:

let cql = context.packet.query;

let json = context.utils.doPost("http://gallium-cqlparser:8099", cql);

let cqlInfo = JSON.parse(json);


if (cqlInfo.columns.length === 0) {

    context.result.success = false;

    context.result.errorCode = 1234;

    context.result.errorMessage = "You are not allowed to select all columns in this table.";

}

if (cqlInfo.columns.includes("phones")) {

    cql = cql.replaceAll(/\bphones\b/g, " blobAsText(textAsBlob('****')) as phones ");

    context.packet.query = cql;

}

Make sure you've published, and run a query:

SELECT id, country, first_name, last_name, phones FROM customers;


 id | country | first_name | last_name     | phones

----+---------+------------+---------------+--------

 23 |      WF |      Wanda |      Williams |   ****

  5 |      ES |       Eric |       Edmunds |   ****

 10 |      JP |     Juliet |     Jefferson |   ****

 16 |      PE |   Patricia |         Pérez |   ****

etc...

Eagle-eyed readers may notice that this will not work properly if the query uses a subselector like phones['country_code']: this is easily remediated, but we wanted to keep this tutorial as simple as possible.

IMPORTANT: De-activate this filter before proceeding further to avoid confusion.

Modifying result sets 

We can change result sets as they come back from Cassandra -- this gives us enormous flexibility.

In Gallium Data, create a new response filter of type Cassandra result set filter (call it "Mask customer phones")

⇨ Set its parameters to: Keyspace pattern=gallium_demo, Table pattern=customers

⇨ Set the code to:

let phones = context.row.phones;

if (phones && phones.home) {

    phones.home.phone_number = "####";

}

Publish the new response filter

⇨ Re-run the query:

SELECT id, country, first_name, last_name, phones FROM customers;


 id | country | first_name | last_name     | phones

----+---------+------------+---------------+---------------------------------------------------------

 23 |      WF |      Wanda |      Williams | {'home': {country_code: 123, phone_number: '####'), ...

  5 |      ES |       Eric |       Edmunds | {'home': {country_code: 55, phone_number: '####', ...

 10 |      JP |     Juliet |     Jefferson | {'home': {country_code: 110, phone_number: '####', ...

 16 |      PE |   Patricia |         Pérez | {'home': {country_code: 116, phone_number: '####', ...


etc...

Notice how the phones column is now visible, but the phone_number has been hidden.

We can also hide specific rows, based on whatever condition makes sense to us.

Add the following code to the end of the filter code:

// Hide customers whose home phone number is in Japan

if (phones && phones.home && phones.home.country_code === 110) {

    context.row.remove();

}

⇨ Don't forget to publish

Re-run the query -- the customer in Japan has now disappeared.

Nothing has changed in the database -- we're only changing the network traffic between the database client and the database server.


Again, eagle-eyed readers may notice that this will not work as expected if the query sets an alias for the phones column, e.g.:

SELECT id, phones as myPhones FROM customers;

This is easily remediated by using the parsing service again to determine what aliases, if any, were part of the query:

let json = context.utils.doPost("http://gallium-cqlparser:8099", context.connectionContext.currentQuery);

let cqlInfo = JSON.parse(json);


let colName = "phones";

let colIdx = cqlInfo.columns.indexOf(colName);

if (colIdx == -1) {

    return;

}

if (cqlInfo.aliases) {

    let alias = cqlInfo.aliases[colIdx];

    if (alias) {

        colName = alias;

    }

}


if (context.row[colName].home) {

    context.row[colName].home.phone_number = '####';

    if (context.row[colName].home.country_code === 110) {

        context.row.remove();

    }

}

This can be optimized by caching the result of the parsing for the whole request/response cycle.

Conclusion

Hiding data with a proxy is straightforward and does not require any changes to either the database clients or the database servers. The techniques shown here are not exclusive -- they can be combined, nor are they exhaustive -- you may even come up with new ones!

Gallium Data is available for a variety of databases:

and is free for end-users.

Cleanup

To stop all Docker containers started during this tutorial:

Run this from a command line:

docker stop gallium-cassandra

docker stop gallium-cqlparser

docker stop gallium-data

If you also want to remove the Docker images that were downloaded:

Run this from a command line:

docker rmi galliumdata/cassandra-demo:1

docker rmi galliumdata/galliumdata-cqlparser:1.0.0-18

docker rmi galliumdata/gallium-data-engine:1.8.3-1851

We'd love to hear from you -- good or bad! Please drop us an email and let us know if you have any questions or comments.


feedback at gallium data dot com