Transparent encryption and decryption
Most databases have a mechanism to store encrypted data. If you want to encrypt data, using the database's native mechanism is the best choice, usually.
But what if you have an existing application that does not use encrypted data, but needs to? What if you can't or don't want to change your database? What if you want to encrypt only certain data items?
This is the type of scenario where Gallium Data excels.
In principle, encrypting data on its way to the database, and decrypting it on its way back to the client, should be fairly simple, but there are several issues to consider:
Storage: encrypted data is usually bigger than the original data, so the database needs to be able to store this larger data. This is not an issue for MongoDB.
Queries: obviously, encrypted data cannot be queried by the database, since the database has no idea about this encryption. In some cases, you can encrypt data partially to make it somewhat queryable, for instance you could encrypt most of a credit card number but leave the last 4 digits in the clear.
Performance: decrypting data is fairly fast, but it does add a cost to data retrieval.
Key management: it's usually bad practice to keep the encoding key in code. Key management is a vast subject, and there are many ways to deal with secrets (see for instance Docker secrets or Kubernetes Secrets).
In this example, we'll be encrypting (and decrypting) data using the tutorial for MongoDB. If you have not run the tutorial, you'll need to at least start the MongoDB service, and of course you'll need to start Gallium Data. As a MongoDB client, you can either use Mongo Express as described in the tutorial, or you can use your own client if you prefer, as long as that client behaves like Mongo Express. If you have any doubts, using Mongo Express is probably the wise choice at first.
Caution: This is a simple example, it is not intended to be used as is in the real world. Among other things:
encryption uses the DES algorithm, which is considered insecure and easily broken, but we're using it in this example because it's simple
the secret is embedded in the code, you might want to retrieve it from an external source
this does not address encrypting pre-existing data, though this could be easily extended to do encrypt on read
this does not address a change in the encryption key, or having multiple encryption keys
this does not dissimulate the length of the data, which could be guessed pretty accurately from the encrypted data
We have a MongoDB collection called companies that contains objects that look like this:
We have a new requirement that new companies that have a category_code of "secret" should have an encrypted twitter_username attribute, but we don't want to affect the various applications that use this database, nor do we want to change the database.
We'll need to encrypt the twitter_username attribute on insert and on update for "secret" companies.
We'll also need to decrypt that attribute when the client runs a query. Note that encrypted values will not be queryable.
Creating the filter
The code for this filter is shown in fragments here, with a description.
The complete code is at the end.
This filter will be called for both requests (of type QUERY) and responses (of type REPLY), so we'll have to treat the two separately.
We'll only be dealing with inserts and updates done with QUERY packets. We won't address inserts and updates done using other means (like INSERT, UPDATE and MSG packets). This is mostly because Mongo Express uses QUERY packets for inserts and updates, but it shouldn't be difficult to extend this example to these other mechanisms. For the equivalent solution using MSG packets, see the bottom of this page.
This filter will need a number of encryption-related objects, which we don't want to create every time the filter runs. We will cache these objects in the filterContext object, that way they'll always be available to this filter once they're created.
The filter code starts with:
In this code, we look up a number of Java classes we'll need later.
We then create cipher objects for the current thread, if necessary.
We could be more naive here and just create the ciphers for every execution, but that would be expensive. If you're dealing with small amounts of data and light traffic, though, that might be a perfectly viable option.
Next, we define two functions, one to do the encryption, one for decryption:
The encryptData function simply replace the twitter_username attribute's value with "crypt:" followed by the encrypted value.
The decryptData function works in reverse: it looks at the twitter_username attribute and, if it starts with "crypt:", it decrypts it.
Next, we deal with QUERY requests that include an insert or update object:
In this code, we check that the packet is for the companies collection, (either as an insert or an update), and that the object being inserted or updated has a category_code of secret. If it does, we encrypt the twitter_username attribute.
In the case of updates, we have to deal with two documents: one (named q) is the current state of the object, the other (named u) is the new state of the object. If the q object does not match what's in the database, the update will not be applied, so we have to do the encryption in both if relevant.
Finally, we deal with REPLY responses:
In this code, we check whether the REPLY packet has a cursor.ns attribute set to "test.companies", since we want to ignore all other collections.
We then go over the objects in the cursor, and if any of them have a twitter_username attribute with a value that starts with "crypt:", we decrypt it.
In this example, we're using Java's Cipher class, which is not thread-safe. It is therefore critical that each thread in the server have its own Cipher objects. This is easy to do using the context.threadContext object. On execution, we check to see if the ciphers have already been created for the current thread, and if they are not, we create them. This is a compromise: we still create quite a few ciphers, but we avoid the complexity of creating pools of ciphers and managing them.
Testing the code
To test this code, you'll need to insert a new company such as:
or update an existing company so that its category_code is "secret".
If you then look at the database after deactivating the filter, you'll see that the twitter_username attribute is encrypted, e.g.:
As soon as you reactivate the filter, you will only see the decrypted data.
A few notes
This filter is about 90 lines of code. That's a fair amount of code for a filter. For filters that are significantly more complex than this, you may consider writing a Java library and calling it from a (much shorter) filter.
The caching strategy employed here is a good pattern: whatever you can do once and reuse afterwards, stick it in filterContext or another long-lived context, and avoid doing it again for every invocation of the filter.
Here's all the code from above in one place.
Equivalent code for MSG packets
This does the same thing for clients that use MSG packets to do inserts, updates and queries. The first half is the same, it's just different in handling the packets.