The Data Inventory, Part 2

In The Data Inventory, Part 1, we saw how a data inventory can be started with a single-table format in a spreadsheet. In this post we will use relational tables to modify and enhance this design to include additional information about our data to include a global view of all our databases, plus a global view of our data items.

In addition to the Oracle schema that was summarized in Part 1, I have added the columns of a table of customers from SQLite (using the sample Chinook schema). This is our result:

Table and column name conventions differ across database products; SQLite names do not default to all-caps, as they do in Oracle

SQLite’s catalog display does not split datatype name, length, and precision into separate columns, as does Oracle

SQLite does not give us a quick-and-easy way to add information like the number of table rows

I have added a new column, DATABASE_ID, as the second column

I have added a new column, COLUMN_GLOBAL_ID, just after COLUMN_NAME

Regarding the first three items, the lesson is that different database products have differing conventions about how their catalog data is arranged. It might be helpful to have information like datatype, column width, and so forth, but maybe we don’t really need it for a data inventory geared to PII. After all, a person’s date of birth may be stored as a DATE datatype (Oracle) or just a formatted string (SQLite). It’s PII either way, so knowing the datatype is not critical.

Regarding DATABASE_ID, this is an ID-type value, one which we make up as a reference to a row in another table, DATABASES. In this case the referred table looks like this:

We now can capture our source database for each schema. Because databases generally have multiple schemas, this structure allows us to capture information about all the schemas in a particular database without having to duplicate the common information over and over. We are free to add useful information about each database, such as the exact version, its function (dev/test/prod), and so forth.

More to the point, we are now able to answer questions like:

What databases do we have on host X?

What Oracle Express 11g databases do we have?

and so forth. Depending on our needs, we can add more tables, such as a table of hosts or installed database software.

Our second new ID-type column is COLUMN_GLOBAL_ID, which refers to a new table, DATA_INVENTORY_GLOBAL, which looks like this:

With a table like this we can see where we have specific kinds of data, particularly PII or possible PII (the COMBO category). A phone number or a last name is always PII, regardless of technical details (what it’s called in the database, how it’s stored, etc). We can always find our last-name data elements by searching for the global name “LAST_NAME” and see that element’s privacy classification. (Note that we have to choose the global-element names and assign privacy categories to them ourselves.)

It’s helpful to be able to retrieve our PII data elements whenever we need to, such as to create reports for the DPA or DPO, to create impact assessments, and so forth.

Our software can also consult such a database before revealing data, in order to know what rules must be applied to the data (whether or not this piece of data has the ‘badge’ proposed in Part 1 to authorize its exposure). In principle, any code that potentially exposes PII can check the applicable badge for each data item every time that it is invoked. In a future post we will see an example of such functionality.

This strategy has several benefits, including:

privacy rules need not be hard-coded, but can be looked up

a data element might have its rule changed at the global level, then automatically distributed and followed by each application

if PII has a justification for exposure, this structure can be further augmented to show the justification reasons (e.g., data-subject consent)

the sensitivity of combo elements could be decided by algorithm, based on the actual content of the data (a technique to be illustrated in a future post)

this approach, using a global system to manage and propagate privacy rules, is part of a strategy of privacy by default/design (and will hopefully impress any auditors who come to visit)

We now we have a mini-schema that looks like this:

A final caveat is in order. Your data inventory, while not itself constituting PII, must be well secured. If the database comes into the wrong hands it could provide a road map to the most sensitive data, and as such is a security risk.

Post navigation

Step 9 of the Belgian Privacy Commission’s guide to getting started with GDPR compliance concerns detecting, analyzing, and dealing with the fall-out of a data breach. The recent recall and re-issue of Estonia’s smartcard IDs brought home to me that public relations (PR) planning is an essential part of breach preparation, not to protect the […]