Geolocation with BigQuery: De-identify 76 million IP addresses in 20 secondsWe published our first approach to de-identifying IP addresses four years ago- GeoIP geolocation with Google BigQuery- and it’s time for an update that includes the best and latest BigQuery features, like using the latest SQL standards, dealing with nested data, and handling joins much faster.

Using it lets you explore large datasets to find new and meaningful insights.

To comply with current policies and regulations, you might need to de-identify the IP addresses of your users when analyzing datasets that contain personal data.

For example, under GDPR, an IP address might be considered PII or personal data.

Replacing collected IP addresses with a coarse location is one method to help reduce risk-and BigQuery is ready to help.

Let’s see how.

How to de-identify IP address dataFor this example of how you can easily de-identify IP addresses, let’s use:76 million IP addresses collected by Wikipedia from anonymous editors between 2001 and 2010MaxMind’s Geolite2 free geolocation databaseBigQuery’s improved byte and networking functions NET.

SAFE_IP_FROM_STRING(), NET.

IP_NET_MASK()BigQuery’s new superpowers that deal with nested data, generate arrays, and run incredibly fast joinsThe new BigQuery Geo Viz tool that uses Google Maps APIs to chart geopoints around the world.

Let’s go straight into the query.

Use the code below to replace IP addresses with the generic location.

Top countries editing WikipediaHere’s the list of countries where users are making edits to Wikipedia, followed by the query to use:#standardSQL# replace with your source of IP addresses# here I'm using the same Wikipedia set from the previous articleWITH source_of_ip_addresses AS ( SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c FROM `publicdata.

14 GB processed)Top cities editing WikipediaThese are the top cities where users are making edits to Wikipedia, collected from 2001 to 2010, followed by the query to use:# replace with your source of IP addresses# here I'm using the same Wikipedia set from the previous articleWITH source_of_ip_addresses AS ( SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c FROM `publicdata.

201806_geolite2_city_ipv4_locs` USING (network_bin, mask))WHERE city_name IS NOT nullGROUP BY city_name, geoname_idORDER BY c DESCLIMIT 5000`Exploring some new BigQuery featuresThese new queries are compliant with the latest SQL standards, enabling a few new tricks that we’ll review here.

IP_NET_MASK(4, 24)And that gets an answer: this IP address seems to live in Antarctica.

Scaling upThat looked easy enough, but we need a few more steps to figure out the right mask and joins between the GeoLite2 table (more than 3 million rows) and a massive source of IP addresses.

And that’s what the next line in the main query does:SELECT * , NET.

SAFE_IP_FROM_STRING(ip) & NET.

IP_NET_MASK(4, mask) network_bin FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) maskThis is basically applying a CROSS JOIN with all the possible masks (numbers between 9 and 32) and using these to mask the source IP addresses.

And then comes the really neat part: BigQuery manages to handle the correct JOIN in a massively fast way:USING (network_bin, mask)BigQuery here picks up only one of the masked IPs-the one where the masked IP and the network with that given mask matches.

If we dig deeper, we’ll find in the execution details tab that BigQuery did an “INNER HASH JOIN EACH WITH EACH ON”, which requires a lot of shuffling resources, while still not requiring a full CROSS JOIN between two massive tables.

Go further with anonymizing dataThis is how BigQuery can help you to replace IP addresses with coarse locations and also provide aggregations of individual rows.

This is just one technique that can help you reduce the risk of handling your data.

GCP provides several other tools, including Cloud Data Loss Prevention (DLP), that can help you scan and de-identify data.

You now have several options to explore and use datasets that let you comply with regulations.