For a one-time project, I need to geocode a few thousand addresses. In the past I've used various online resources for this kind of thing (e.g., Google Maps API), but the addresses I'm working with have to be kept confidential - which means no sending it over the Internet, unless there's some iron-clad guarantee of privacy. What other options do I have?

I would really open a quick discussion with RTD, I know in that area they have a powerful GIS and likely could give you direct support. Otherwise; the Geocoder::US is a great option. You can run it internally and not have to risk your data going across the wire.
– D.E.WrightApr 27 '11 at 19:48

If using the Google Geocoding API or another online source is your preference rather than local options, I'd suggest looking into the Tor Project (easily installed through the bundle called 'the Vidalia Bundle').

Tor protects you by bouncing your communications around a distributed
network of relays run by volunteers all around the world: it prevents ...
the sites you visit from learning your physical location.

Along with injection of random addresses and using ssl (https) to encrypt communications to their endpoints (make sure you're also doing this), I can't think of a more secure way to geocode remotely. Whatever geocoding service you're using won't ever be able to identify where the requests ultimately came from, and with https no one else will, either. Note: don't use a geocoding service that requires an api key for this, or you'll no longer be anonymous. (Google doesn't require an api key anymore).

A side 'benefit' of this procedure is that you'll no longer be restricted to any number of geocoding requests, as your requests will look like they're coming from multiple ip addresses. However, I do not recommend or endorse abusing these lovely free APIs! Rate will still be limited if the API limits rate (though the speed of transmission using Tor is quite a bit slower than connecting directly).

Case study in Python -- Once you've installed the Vidalia Bundle and have the proxy running on 127.0.0.1:8118 (the default), in Python 2.7 or higher you can set up an https urllib2 proxy using:

Note that urllib2 proxies don't work with https until at least Python 2.7 or so, so this method only works with recent Python versions. Make sure you've got 'https' (not 'http') in both places in the example above. I've only tested it with Python 2.7.1.

Vidalia changes your identity / apparent IP address origin every 10 minutes, but if you run into slow rates or other problems (quota exceeded errors), or if you are especially paranoid and want to change your identity more frequently, you can change your Tor identity using the python code here (slightly modified below). You'll need to change the Tor password to a static one (rather than a randomly generated one) by entering the Vidalia settings. Might also need to restart Vidalia after all changes.

That doesn't keep the addresses confidential, does it? The physical location of the machine sending the query is irrelevant (not confidential) here.
– underdark♦Jun 7 '11 at 10:43

4

For most purposes, the physical location of the machine sending the query is very important in protecting the anonymity of the data being sent to a geocoding service. Say that a computer in the Institute for the Study of X sends a geocoding request for 1000 addresses. One could (theoretically at least) identify those addresses as containing individuals with X disease. In contrast, addresses mixed in with thousands of random requests from many users, and coming from multiple IP addresses that don't correspond to any one user (the Tor situation) are not identifiable with respect to purpose.
– Victor Van HeeJun 7 '11 at 10:56

Sending data to Google (via Tor or anything) is a fundamental privacy problem. Google does not offer the "iron-clad guarantee of privacy".
– Nicolas RaoulMay 15 '14 at 3:44

One option is to use Geo-Coder-US, which is an open-source Perl module that uses the US Census' Tiger/Line data to geocode. I haven't used it personally, but it looks excellent. The link above includes a nice overview and a link to a version that already has the necessary Census files assembled.

To conserve privacy, you could spread queries to all providers by separating them into sets that are less likely to be linked to your activities. You could also inject noise in your addresses by adding real addresses from an online phone directory. And I suggest you run this script from various places, such as internet cafés, combining the results at the end.

The only way to truly preserve your privacy is to download the full set of data and run your script against it. There's the Nominatim system from OpenStreetMap. It is not complete for all the cities, but you could use that to reduce the list of addresses sent to other providers.

I thought that the code behind http://geocoder.us/ was available for download such that you could get it and a TIGER data file and more-or-less set up your own local install. I don't see that immediately upon revisiting that site, but you may want to look around a bit.

Why not use the same geocoders you've used before, just remove all the other meta data?

Don't send over "Secret Location; 123 Main Street, Some City", just send over "123 Main Street, Some City"? The addresses are public information anyways. Just don't tell the geocoder that you have a list of nuclear bases or all NSA locations. The results will be in table format, you can then re-attached all your other secret meta-data.

This is how I feel about the situation. This is not how my employer feels about the situation. To give the benefit of the doubt, if you get a list of addresses from a recognizable IP address, it's not that much of a stretch to imagine that someone could figure out what the addresses relate to.
– Matt ParkerJul 23 '10 at 5:30

1

@Matt That's one thing consultants are good for :-). Another option is to mix miscellaneous addresses in with the ones you send over. Sure, it increases costs, but they are so low anyway ...
– whuber♦Apr 29 '11 at 21:10

The search on the OpenStreetMap homepage is an system called Nominatim. You can call it as a geocoding service (if you're gentle) but it's all open source, so you can set it up on your own server too.

This is using OpenStreetMap data loaded into postGiS database. It's relatively new and under development still, and the process of setting up and loading with data is not all that straightforward, and quite resource hungry. ...but it's free and open!

Most of the answers are steering you toward a local database. While that would certainly work, you must also consider whether gecoding is your core domain. (Is that what you are good at? If so, you probably already have the data they are recommending. If not, AND YOU WANT IT TO BE, then you should download the data and just do it locally. However, if you just need to solve a problem and don't want to put in countless hours ramping up for production, there are still options to do it through an API without compromising security.

First, insist on HTTPS because you need the data to be secure on the way to the API and then on the way back to you. Second, ensure that you are doing a POST request instead of a GET request to the API. Using POST, you are just passing a URL request with a payload and the only results that would hit the server log is the fact that an address verification and geocoding request was made at a certain time and from a certain IP. Neither the address submitted nor the address returned would be stored to disk or written to a server log. It doesn't get much more secure than that.

So, while a local box would definitely be secure, it could require a lot of development to do what you need. Since the security concerns can be pacified, you might want to consider (again) the option of using an API.

I work for an address verification company that specializes in secure API geocoding -- SmartyStreets.