Setting up a local copy of the OpenStreetMap database, kept up to date with minutely diffs

Here's my experience setting up a local copy of the OpenStreetMap database, kept up to date with minutely diffs. This is the raw data database, not a database suited for generating map tiles (that would be PostGIS).

I started this as part of a school project involving database research, in which I built a robot. I didn't want to hammer the main servers for data, so I wanted to set up a local copy to query against instead.

Note that this should be considered a read-only database. When I submit changes, I submit them directly to the main database. You get in to some difficult consistency issues if you apply updates to your database and then try to extract and apply those to the master database.

Hardware

My hardware is pretty basic. I used my existing Linux machine, which was built from just cheap (but mostly reliable) components.

Intel Celeron dual-core 64-bit E3200 2.4GHz

2GB memory

1x 500GB SATA drive (main system)

1x 250GB SATA drive (database)

Software

Unix/Linux - You'll need some flavor of *nix to work on. Use whatever you're comfortable with... Fedora/RedHat, Ubuntu, etc. I use Gentoo: it's a package-based system, and its package system (portage) installs almost all packages by automatically downloading and compiling source, rather than installing binaries.

PostgreSQL - OSM supports either MySQL or PostgreSQL as its backing database, but PostgreSQL is what's used currently, and so it's probably best to follow their lead to avoid unnecessary complications.

Osmosis - Osmosis is a Java tool for reading and writing XML changesets and data dumps.

The Rails Port - This is a port of the API to Ruby on Rails. It isn't necessary for most of this, except for being the only "official" way to load the schema into the database.

Step-by-step

Step 1: Download planet.osm

planet.osm is a dump of the database to the OSM XML format. These dumps are done about once a week, and we'll use this to load the database with something relatively recent.

Go to the planet.osm page, and pick a mirror to download from. Look for a file with the name "planet-100915.osm.bz2", with the numbers representing the date in YYMMDD format. I ended up going with one that was only a week or two behind the present.

(Make sure that whichever planet file you use, you can find a state.txt in the minute-replicate/ directory that is close to the date of the dump. The state.txt files only seem to go back about 1 year from current.)

Start this downloading. On a cable connection capped at around 500KB/s, it took me about 8-9 hours.

Screen

You will probably want to use the "screen" command, which creates a virtual terminal that you can disconnect from and leave running. You can then use "screen -r" to reattach to the screen from any other login session. (Otherwise, you'd have to leave a terminal window connected, and if you happen to drop the connection... oops, there goes the shell and the download process. Trust me, screen will be your friend through this.)

Axel

Axel is a command-line file downloader like wget, but it takes multiple URLs of mirrors and downloads multiple chunks of the file simultaneously. If your download isn't maxing out your Internet pipe, you might consider using this.

MD5

You want to make sure you got the file intact, so download the .md5 checksum and run

md5sum -c *.md5

to have it verify the intergrity of the file. (Unfortunately, if it fails, your only choice is to redownload the whole thing, since you don't know what part of the file is bad.)

Step 2: Meanwhile: Install Postgres

Although OpenStreetMap can run on either MySQL or Postgres, currently everything runs on and is primarily targeted for Postgres, so that's what we'll use.

On my system (Gentoo Linux), this was just

emerge dev-db/postgresql-server

This is where I popped in a second hard drive I wasn't using, and mounted the whole thing under /var/lib/postgresql/. Then I did the database set-up (on Gentoo, "emerge --config dev-db/postgresql-server")

Tuning the server

By default, Postgres is not tuned for such a large database. I changed two settings that vastly improved its performance.

shared_buffers = 512MB # Set this to 25% of your total memory (2GB in my case)
bgwriter_delay = 2000ms # I don't know what a good value is, but since we're loading a lot of data, larger sounds better

You can set these in the usual manner for your distro or Postgres (probably postgresql.conf in the database directory (/var/lib/postgresql/8.4/data))

pg_top

There's a tool called pg_top which is basically "top" for Postgres. Install it with "emerge dev-db/pg_top".

Step 3: Meanwhile: Install Osmosis

Osmosis is a tool that can read and write a variety of OSM-compatible sources, including OSM XML files, databases, and diff mirrors.

Gentoo Portage already has a package for Osmosis, but it's out of date. Fortunately, you can create a local portage tree and add your own packages to it. I did this and created an ebuild for osmosis-0.36.

I also created a tiny patch file to fix a problem in the launcher script which was causing it to look in the wrong directory for some files.

Gentoo users: You may require some java builds that aren't available in the main portage tree yet. Install layman, which manages portage overlay trees, and add the java overlay with "layman -a java-overlay"

Step 4: Meanwhile: Install Rails port

The Rails port is a port of the OSM API to Rails, a web framework built on Ruby. Although I intended to query the database directly, the Rails port is the only way to create the empty tables in the database. Thus, you have to have it.

Like with Osmosis, I created ebuilds for all the packages I didn't have. However (in the interest of laziness and flexibility) I actually just had the ebuild for the Rails port manage the dependencies, and took care of actually getting the source myself.

The wiki page for the Rails port is pretty complete, so these are largely duplicates of what's on there, with a few modifications and caveats. So, here we go.

Dependencies

According to the wiki page (and a little trial-and-error), the Rails Port requires:

>= Ruby 1.8.6

>= RubyGems 1.3.1

Either of:

>= Postgres 8.3 and bindings (we'll use this one)

>= MySQL 4.1.0 and bindings

RMagick

It also requires the following Gems to be installed before starting:

Rails 2.3.8.*

Rack 1.1.0.*

>= Ruby libxml bindings 1.1.1

Timecop

OAuth

I got complaints if I didn't use certain versions (e.g., Rails 2.3.9 wouldn't work, it wanted 2.3.8 instead), and these may change depending on the version of the Rails port you end up with.

Installation

Configuration

In the config/ directory, we have a couple things to do. You need a database.yml file, which doesn't exist yet. Copy postgres.example.database.yml (or mysql.example.database.yml). Likewise, you need to create an application.yml file from example.application.yml (you can probably leave this unmodified).

As per the README in the db/ directory, there's some code we need to load in to PostgreSQL.

Note: I'm not sure whether this is necessary if you're only running PostgreSQL, and not PostGIS.

Hacks/Fixes

I had some issues running the Rails port. It always complained about an error in config/i18n.rb. I ended up deleting it (rather, renaming it to "i18n.rb.ignore" just in case i needed it) and everything worked fine.

It also didn't like my languages.yml file. I searched for an older version (add link later) from git and copied it in; that seemed to fix things.

Step 5: Initialize database

From the Rails port, run "rake db:migrate". This will only take a minute.

Vacuuming

Databases require periodic maintenance to ensure that the data is packed in the most efficient format, and that not too much free space is being wasted. They also periodically collect statistics on the tables which guides their decisions about how to deal with data.

It doesn't hurt to do this after you create the tables. Go into Postgres ("psql -h hostnamedatabase -U username" and run "VACUUM ANALYZE;". If all you have in Postgres is the OpenStreetMap database you just created, it will be done in under a minute.

Step 6: Recompress database

We're told that Osmosis is much slower at reading bz2 files than gz files. (Also, when I tried to read the bz2 file, it complained that it was malformatted.) So, we recompress it. The GZip version will take up 15GB (plus the 12GB to keep the BZip2 version around).

bunzip2 -c planet-100915.osm.bz2 | gzip -9 > planet-100915.osm.gz

I didn't time this, I think it took about 1 hour. At maximum compression, it's CPU-bound on gzip. It probably doesn't matter if both files are on the same physical drive.

If you have 160GB available, you could just uncompress it and deal with it raw. (If so, you probably want pbunzip2 which is multi-threaded, and will probably be faster.) I don't have that much space. :-)

Step 7: Excerpt that part of the database you want

This step is strictly optional, but... the whole planet will take up more than 300GB once it's loaded into the Postgres server. And unless you have a pretty beefy machine, it will take you several weeks to import.

Nah, you didn't want the whole planet anyway.

Osmosis has --bounding-box and --bounding-polygon options. You can go to http://downloads.cloudmade.com/ and get a .poly file for the continent/country/state you want, and use that to trim the dataset.

Or, you can just set a bounding box. You'll get a little extra, but who cares. I went with this option because the Cloudmade download site happened to be down when I was doing this. Just drop by Wikipedia and look up the latitude and longitude of the thing you're looking for.

Make sure to include the clipIncompleteEntities=true setting. Be default this is false, and it will leave references in ways to nodes that weren't in the bounding box. Several hours in to your data import, Postgres will fail with a foreign key constraint because of this, and you'll have to start over.

Sort and merge

If you're doing more than one bounding box/polygon, you probably want to combine them into a single file to load. This is easy. Merging them requires that you sort them first.

Step 9: Prepare Osmosis for reading diffs

Create an empty directory somewhere; Osmosis will use this as a working directory. Download a state.txt file that's close to the same vintage as the planet.osm dump you got (it's better to be behind than ahead). Save this file as "state.txt" (remove the date from it).

Set up the directory by running

osmosis --rrii workingDirectory=./

This will create a couple other files. Edit the configuration.txt file to point to the mirror of your choice; you can point it at minutely or hourly diffs (but make sure the mirror you want actually has them; not all mirrors do). You can also adjust the maxInterval setting to your liking (i'll have more advice about this once i've made it to this point... >_>)