Doing more with Drush sql-sanitize

The sites that we build very often end up containing large amounts of user data, much of which allows individual people to be identified. Not just names, but social media accounts, telephone numbers, addresses, occupation details etc. In the UK, user data comes under the Data Protection Act, particularly personal data, which the Act defines as "data which relate to a living individual who can be identified".

So, your site contains lots of personal data, but you want to give a database dump to a Drupal developer so that she can code some new feature for you. How can you handle this without getting on the wrong side of the Information Commissioner's Office (ICO)? You could manually delete all the personal data before you created the database dump, but then your developer won't have any realistic data to work with. Fortunately, there are ways of sanitising the personal data, a process that the ICO refers to as anonymisation. The ICO has a useful code of practice that is worth reading – Anonymisation: managing data protection risk code of practice.

The Drupal command line tool, Drush, comes with a command to sanitise the users table, drush sql-sanitize. While this is better than nothing, it only sanitises user passwords and email addresses. Obviously, it doesn't know about all of the additional fields storing personal data that you added when you built your site.

Note: Drush sql-sanitize will sanitise data in the current database – you do not want to play with this on a live or production server!

With that caveat out of the way, assume I have a site with profile fields set up for a user to enter a LinkedIn url (field_profile_linkedin_page), a telephone number (field_profile_telephone_number), and a twitter username (field_profile_twitter_handle). How can I tell Drush to sanitise these fields too?

Thankfully, Drush is kind enough to provide a hook, hook_drush_sql_sync_sanitize(), which makes this straightforward. Assuming you’ve got a custom module that you’re using already, my_module, you would add a drush include file, /my_module/my_module.drush.inc, containing:

The database table storing the LinkedIn field data is field_data_field_profile_linkedin_page, and the column containing the actual url string is field_profile_linkedin_page_value. The query will step through each entry and replace the url with the output of CONCAT('http://uk.linkedin.com/user-name/', FLOOR(100000 + (RAND() * 900000))).

CONCAT is the SQL command to concatenate strings together, RAND returns a random floating point number between 0 and 1, and FLOOR returns the largest integer value not greater than the value it’s passed. Putting that together, the real LinkedIn urls are replaced with ‘http://uk.linkedin.com/user-name/<random number between 100,000 and 1,000,000>’.

LinkedIn seem to use a few different url formats, and this is just the one that I picked as an example.

Don’t forget that Drupal stores revisions of data, so as well as sanitising the field_data_field_profile_linkedin_page table, we need to repeat the query on the field_revision_field_profile_linkedin_page table too. This is easy to overlook!

The first parameter isn’t actually used for anything at the moment. The second parameter is the text that Drush will display when the the sql-sanitize command is run, and the third parameter is the SQL query that you want to run.

Moving on to the telephone number field, we need a new sanitisation query:

This is very similar to the LinkedIn query above, replacing all user telephone numbers with ‘+44<random number between 1000000000 and 1100000000>’. Again, don’t forget to do the same on the corresponding revision table before calling drush_sql_register_post_sync_op().

The final example is the twitter username field. This one is a little different as instead of a random number, I’ll generate a random 8 character string. One way to do this is to use the SUBSTRING, MD5 and RAND SQL functions.

$twitter_query="UPDATE field_data_field_profile_twitter_handle SET field_profile_twitter_handle_value = SUBSTRING(MD5(RAND()) FROM 1 FOR 8);";

This generates an MD5 hash of a random value, which we then take the first 8 characters from. Yet again, don’t forget to do the same thing on the corresponding revision table!

With our hook_drush_sql_sync_sanitize() implementation complete, we can try running it with Drush sql-sanitize. Don’t forget – you do not want to be running this on a production server!

Not the best twitter usernames in the world, but adequate for testing.

Now that you’ve got a sanitised database, you can export it using Drush sql-dump. Before you do that, though, you should be aware that Drupal’s cache tables in the database can also potentially leak information. To protect against this, you can instruct Drush sql-dump to export just the structure of certain tables, omitting the actual data.

You can specify these structure-only tables directly on the command line, using the --structure-tables-list option (see the sql-dump documentation for more information). Alternatively, you can add a list of tables to the $options['structure-tables'][‘your_ref’] array in your drushrc.php file. That way, you don’t have to type the list each time, just use the --structure-tables-key=your_ref option with sql-dump.

So, with the addition of a small chunk of code, we can generate a completely anonymised database dump very easily. If you keep this in mind during development and add a sanitisation query every time you add a new field which stores personal data, it’s even easier.

We’re Code Enigma

We’re one of the most experienced Drupal teams in Europe, best known for our work on large, technically challenging projects for all kinds of clients.

Our team is passionate about Drupal and open source software. Our whole company spends at least four weeks per year working on Drupal modules or other open source projects. We’re also strongly committed to putting design first, taking a mobile-first, content-out approach to creating websites. This ensures that the sites we build combine the power of Drupal with best practice design and development.