AllGoodBits.org

Tips for Loading Data into RDBMS

Sometimes people set up a little database, perhaps put some data into it, point an application at it and let it run. Sometimes that's the right balance of pragmatism and perfection. Sometimes however, there are reasons to be a little more diligent. if the situation calls for any of

This article therefore collects a few tips to improve the process of loading data into the RDBMS; some general tips and some suggestions for particular databases. In this article I'm going to be lazy about explcitly stating my assumptions, this is mostly about listing some of the low-hanging fruit of improving database load times, not about squeezing every bit of performance.

General Principles

Since there's no reasonable point in making optimization effort unless you can determine whether it helps or not, you want some visibility into your load process. time(1) and pv(1) are both good tools to give just a little extra visibility into your load process, although if your needs are more heavy duty, you might want to consider instrumenting the process more deeply. Here are some brief hints:

This import can be equivalently done in a mysql client session using LOAD DATA INFILE.

Constraints

Don't forget to turn these back on afterwards!

SET FOREIGN_KEY_CHECKS=0;

SET UNIQUE_CHECKS=0;

Logging

SET sql_log_bin=0;

SET GLOBAL slow_query_log = 0;

Flushing

The rate and method of flushing transactions and logs to disk affect performance. As a simplification, the less demanding mysqld/innodb is about flushing (that is, the more it lets the kernel drivers/drive controllers take control), the better performance you'll likely see. This is not always true. The downside is basically decreased robustness in the face of a process/OS/hardware crash. Since we're loading a backup, I'm assuming that a crash just means we start over and there won't be any data loss anyway, so let's get as much performance as possible.

Other Configuration

PostgreSQL

The PostgreSQL docs themselves about populating a database are very clear & useful, although there is no mention of pg_bulkload.

Briefly,

COPY: copy <table_name> from '/path/to/table.csv' DELIMITERS ',' CSV HEADER, although if you're importing data created by pg_dump(1), the points about COPY, indices and constraints are done automatically by psql(1) or pg_restore(1)