New York Standup 9/30/2008

What’s the best way to import a million records into a postgres database via ActiveRecord (which is needed to implement some application-specific logic)? We anticipate waiting a second (or so) between inserts to avoid slowing down the production database (which is under load, almost entirely reads). If there is any ActiveRecord feature which helps batch together inserts, noone knew about it. As for generally how long this will take (estimates range from 9 to 27 hours), and what the load on the production database will be, we planned on answering that with a trial run of a small number of these records.

We’re thinking of having capistrano deploy to two demo servers, one particularly aimed at showing to prospective users of our application, and the other mostly for story acceptance. The former would be hosted at a hosting company; the latter an internally run machine. Several people reported they have done this on their projects, and the problems were minor, mostly having to do with whether the deployed location (/u/apps/whatever or some such) is different on the two machines (the solution would be to use the capistrano variables, but tracking down all the places that need to do that could be an issue).

Erector tip of the day: in a Rails project, you can put a file (named edit.rb or edit.html.rb) in your view directory, and Rails/Erector will find the template implicitly (as it would for ERB, HAML, etc). It is not necessary to explicitly call render from your controller method.

10 Comments

Regarding your 2nd point, I usually see this kind of instances as new steps in the production chain. That’s why I use the capistrano multistage extension (gem install capistrano-ext) to define those new steps (possibly with their own environment files).

September 30, 2008 at 7:35 pm

Steve C says:

Is there some non-AR way of loading records into postgres that would meet your needs? I’m thinking of some equivalent of the mysql “load data infile”, that loads mass amounts of data 20x faster than any alternative.

September 30, 2008 at 8:06 pm

Steve C says:

re: erector, I’d say “it’s not necessary to use implicit templates, you can just call render directly”. ha ha.

September 30, 2008 at 8:08 pm

Dan Kubb says:

Have you thought about using DataMapper to handle the inserts instead of ActiveRecord? As of the most recent [DataMapper benchmarks](http://gist.github.com/10735) DM is about 2x faster than AR when inserting records and performing most other operations.

DataObjects would likely even be faster still, since it is what DM uses under the hood to communicate with Postgres. It should be the fastest Ruby RDBSM driver available at the moment — faster than what AR uses, including the recently released Neverblock drivers, and it works with Ruby 1.8.

RE: #1–
Could you load the records on a copy of your production db on a local machine, and after all is done then do a export/import into the production machine? At least this way, if something goes wrong, there is much less of a chance of it munging up your production data. Not that that has ever happened me.

September 30, 2008 at 9:13 pm

Chad Woolley says:

re #2 – yep, Strass is right. That’s what multistage was made for. Put all the differences in config/deploy/.rb

Also, the story acceptance environment should be deployed after every CI build. On our projects, we already do that for a “local” localhost environment (check out Sandbox), so it should be straightforward to do the same thing for a “demo” (vs staging?) environment.

In response to your large dataset import question, we used the acts_as_importable plugin with great results. The plugin allows to you do pretty much everything as usual (validations, column discovery, SQL-escaping, etc…) except that instead of saving to the db, the plugin creates a sql bulk import file which you can load into the db at your leisure.

Sooooo much faster.

Now, we were using MySQL. Not sure about the Postgres support, but it might be worth looking into.

October 1, 2008 at 12:47 am

David Stevenson says:

We use ar-extensions extensively to handle our large data imports. No problems to speak of, :validate => false is a useful option that speeds things up if you are okay skipping validations.