Hiding data in DB2

There are many reasons why you might want to hide some of the data in your database from certain users.

In this article, we'll talk about fine-grained access control -- hiding data that is in fact present in the database, but should not be seen by certain users. Static data hiding (physically removing that data from a copy of the database) is a whole different approach (contrasted in this article ).

Fine-grained access control can be done in three places:

Each of these approaches has its advantages and disadvantages. Let's look at each one and contrast them.

DB2's fine-grained access control capabilities

DB2 has excellent fine-grained access control capabilities, among the best in the RDBMS world.

There are two (sometimes complementary) ways to do fine-grained access control in DB2:

There is some overlap between these two approaches. The biggest difference is that RCAC is code-based: it involves the execution of SQL code to determine whether a piece of data is accessible and how, whereas LBAC is data-based: it relies on declarations and labels, without any code.

Row and column access control (RCAC)

RCAC, as its name suggests, consists of two parts:

By combining these, you can achieve most common access control requirements.

Row permissions are expressed as pieces of code that define which rows are visible. For instance:

CREATE PERMISSION CustomersEuropeOnly ON DEMO.CUSTOMERS

    FOR ROWS WHERE

        VERIFY_GROUP_FOR_USER(SESSION_USER, 'SALES_EUROPE') AND

        COUNTRY IN ('BE', 'DK', 'ES', 'FR', 'GR', 'HN', 'IT')

    ENFORCED FOR ALL ACCESS

    ENABLE;

This specifies that database users who belong to group SALES_EUROPE can see the rows in table DEMO.CUSTOMERS where the COUNTRY column has one of the specified values. These values could be looked up or computed from the database, and permissions can call SQL functions including secure user-defined functions, so there is quite a bit of flexibility.

Additional permissions can be added to the same table, and DB2 will automatically apply all of them and show only the rows that satisfy at least one permission.

This type of permission covers all access to the table: select, insert, update and delete. For instance, if the user tries to insert a row that does not satisfy at least one permission, the insert will fail. Same, obviously, for update and delete.


Column masks are also expressed as pieces of code that specify how a column should be presented to the client. For instance:

CREATE MASK CUSTOMERS_LAST_NAME_MASK ON DEMO.CUSTOMERS

  FOR COLUMN LAST_NAME RETURN

    CASE WHEN (VERIFY_ROLE_FOR_USER(SESSION_USER, 'ENGINEER') AND COUNTRY IN ('BE', 'ES'))

    THEN SUBSTR(LAST_NAME, 1, 1) || 'XXXX'

    ELSE LAST_NAME

END

  ENABLE;

This specifies that, for database users who have the role ENGINEER, the column LAST_NAME in table DEMO.CUSTOMERS should be masked, with just the first character showing, followed by XXXX, but only for customers in certain countries.

You can only have one mask per column, so things can get a bit ponderous if you have a lot of logic. Again, though, your logic can call functions, so this is quite flexible.

Label-based access control (LBAC)

Unlike permissions and masks in RCAC, which are defined by code, label-based access control is declarative.

It works by creating a system of label components, which can be organized as unordered sets, e.g.:

CREATE SECURITY LABEL COMPONENT COMPARTMENT

        SET {'RESEARCH', 'MANAGEMENT', 'FINANCE'};

or as ordered arrays, where higher levels subsume lower levels, e.g.:

CREATE SECURITY LABEL COMPONENT CLASSIFICATIONLEVEL

        ARRAY [ 'TOP SECRET', 'SECRET', 'CONFIDENTIAL' ];

or in trees, where again higher levels subsume lower levels:

CREATE SECURITY LABEL COMPONENT REGION

TREE (

    'WORLD' ROOT,

        'AMERICAS' UNDER 'WORLD',

            'CANADA' UNDER 'AMERICAS',

            'USA' UNDER 'AMERICAS',

                'CALIFORNIA' UNDER 'USA',

                'TEXAS' UNDER 'USA',

                etc...

This is quite powerful, though managing all these can quickly become a challenge (of course, there are commercial products to help with that challenge).

Label components can then be assembled into labels, e.g.:

CREATE SECURITY LABEL POLICY2024.RESEARCH_TOPSECRET_USA

   COMPONENT COMPARTMENT 'RESEARCH',

   COMPONENT CLASSIFICATIONLEVEL 'TOP SECRET',

   COMPONENT REGION 'USA';

These labels can then be used to protect individual columns:

CREATE TABLE DEMO.CUSTOMERS (

    ID            INT NOT NULL,

    FIRST_NAME    NVARCHAR(100) NOT NULL,

    LAST_NAME     NVARCHAR(100) NOT NULL,

    COUNTRY       CHAR(2) NOT NULL,

    SALARY        NUMERIC(12,2) NOT NULL SECURED WITH FINANCE_CONFIDENTIAL_AMERICAS,

    SECLABEL      DB2SECURITYLABEL NOT NULL,

    PRIMARY KEY (ID)

)

SECURITY POLICY POLICY2024;

as well as individual rows:

INSERT INTO DEMO.CUSTOMERS(ID, FIRST_NAME, LAST_NAME, COUNTRY, SECLABEL) VALUES

(21, 'Wernher', 'von Braun', 'US', SECLABEL_BY_NAME('POLICY2024', 'RESEARCH_TOPSECRET_USA'));

Finally, users, roles and groups can be given access to the labels, for reading, writing, or both:

GRANT SECURITY LABEL POLICY2024.RESEARCH_TOPSECRET_USA

TO USER TJONES FOR READ ACCESS;

There is even more to LBAC, such as an exemption mechanism allowing 

This is an elaborate system, which has complex rules governing things like security upgrades, default behaviors, and everything else security auditors like to poke around. It is powerful enough to implement the vast majority of data hiding needs, though scaling this system to cover large, complex requirements with many dimensions can be a challenge.

The pros and cons of DB2's fine-grained access control

Pros

Cons

Many of the advantages and disadvantages of this solution come from the fact that it's a centralized solution, which is great for some scenarios, and less great for others.

Proxy-based fine-grained access control

A very different approach consists of hiding data outside of the database, between the database server and the database client, using a programmable proxy.

This is usually a more decentralized approach: some trusted clients may access the database directly, whereas some less trusted clients may have to go through the proxy. And of course, you may have different proxies for different clients.

Because this type of access control is done outside of the database, it is a less powerful approach in some ways, but it does have its advantages.


Restricting or rewriting queries

A proxy can typically intercept any SQL command on its way to the database, and potentially change it, or reject it outright.

The simplest scenario is to have a list of authorized queries, and any time a client issues a query that is not on the list for that client, the query is rejected.

A more complex, but more powerful approach, consists of rewriting the query so that it satisfies our data hiding requirements. This rewriting can of course vary depending on who the user is, where the call is coming from, any other other useful conditions.

For instance, a proxy can easily intercept a query like:

SELECT * FROM DEMO.CUSTOMERS WHERE COUNTRY = 'FR'

and rewrite it to:

SELECT ID, FIRST_NAME, SUBSTR(LAST_NAME, 1, 1) || 'XXXX' AS LAST_NAME, COUNTRY, 0 AS SALARY

FROM DEMO.CUSTOMERS WHERE COUNTRY = 'FR'

before it gets sent to the database. This approach works best for known queries, although it is often possible to analyze the query dynamically and modify it as needed.


Modifying result sets

When the database responds to a query, the result set goes through the proxy, which can manipulate it as needed.

For example, using Gallium Data as a proxy, a trivial result set filter might look like:

if (row.COUNTRY === 'FR') {

    row.LAST_NAME = row.LAST_NAME.substring(0, 1) + "XXXX";

    row.SALARY = 0;

}

This extra processing will add a small amount of latency, but that work needs to be done somewhere, and by doing it in the proxy, you're saving the database from having to do it.

In this context, the proxy has complete control over result sets: it can modify rows, remove entire rows, or even insert new rows into the result sets.

The proxy can do much more than just manipulate queries and result sets: it can also log access to certain data, notify other systems, aggregate data from multiple sources, etc...

The pros and cons of proxy-based data hiding

Pros

Cons

Hiding data in the client

At the opposite end of the spectrum from access control in the database is access control in the database client(s).

This can be done either outside of the application, typically using special database drivers, or in the application itself.

This approach became relevant with the arrival of client-server applications, but it has now fallen out of favor because of its significant disadvantages:

1 - The clients must be trusted

This is perhaps the biggest objection: because the (supposedly) hidden data does in fact find its way to the client, it's much more difficult to ensure that it cannot be seen by a malicious user. The software doing the access control is at the leaf node, and the user presumably has some level of control over the hardware: that's a recipe for shenanigans.

2 - Management is difficult

In addition, the client applications have to be modified when security requirements change, and managing and upgrading the clients can be a challenge.

Side note: if you think hiding data in the client is weird, it turns out that Microsoft SQL Server's data classification system actually relies on the clients to hide classified data (this can be rectified with the judicious use of a proxy).

Conclusion

The easiest way to look at this is to ask yourself what will serve your requirements best: a centralized solution, or a decentralized solution?