In most cubicweb projects I've been developing on, there always comes a time where
I need to import legacy data in the new application. CubicWeb provides Store
and Controller objects in the dataimport module. I won't talk here about the
recommended general procedure described in the module's docstring (I find it a bit
convoluted for simple cases) but I will focus on Store objects. Store
objects in this module are more or less a thin layer around session objects,
they provide high-level helpers such as create_entity(), relate()
and keep track of what was inserted, errors occurred, etc.

In a recent project, I had to create a somewhat fair amount (a few million)
of simple entities (strings, integers, floats and dates) and relations. Default
object store (i.e. cubicweb.dataimport.RQLObjectStore) is painfully slow,
the reason being all integrity / security / metadata hooks that are constantly
selected and executed. For large imports, dataimport also provides the
cubicweb.dataimport.NoHookRQLObjectStore. This store bypasses all hooks
and uses the underlying system source primitives directly, making it around
two-times faster than the standard store. The problem is that we're still
doing each sql query sequentially and we're talking here of millions of
INSERT / UPDATE queries.

My idea was to create my own ObjectStore class inheriting from NoHookRQLObjectStore
that would try to use executemany or even copy_from when possible [1]. It
is actually not hard to make groups of similar SQL queries since create_entity()
generates the same query for a given set of parameters. For instance:

Those extra queries are actually even exactly the same for each entity insterted, whatever
the entity type is, hence craving for executemany or copy_from. Grouping together SQL queries
is not that hard [2] but has a drawback : as you don't have an intermediate state
(the data is actually inserted only at the very end of the process),
you loose the ability to query your database to fetch the entities you've just created
during the import.

Now, a few benchmarks ...

To create those benchmarks, I decided to use the workorder cube which is a simple cube, yet complete enough : it provides only two entity types (WorkOrder and Order), a relation between them (Order split_into WorkOrder) and uses different kind of attributes (String, Date, Float).

Once the cube was instantiated, I ran the following script to populate the database with my 3 different stores:

The clock time indicates the time spent on CubicWeb server side (i.e. hooks
and data pre/postprocessing around SQL queries). The time time should be
the sum of clock time + time spent in postgresql.

RQLObjectStore is slow ;-). Nothing new here, but the clock/time ratio
means that we're speding a lot of time on the python side (i.e. hooks
as I told earlier) and a fair amount of time in postgresql.

NoHookRQLObjectStore really takes down the time spent on the python side,
the time in postgresql remains about the same as for RQLObjectStore, this
is not surprising, queries performed are the same in both cases.

CopyFromRQLObjectStore seems blazingly fast in comparison (inserting
a few thousands of elements in postgresql with a COPY FROM statement
is not a problem). And ... yes, I checked the data was actually inserted,
and I even a ran a cubicweb-ctldb-check on the instance afterwards.

This probably opens new perspective for massive data imports since the client API
remains the same as before for the programmer. It's still a bit experimental, can only be
used for "dummy", brute-force import scenario where you can preprocess your data
in Python before updating the database, but it's probably worth having such
a store in the the dataimport module.

The idea is to promote an executemany('INSERT INTO ...', data) statement into
a COPY FROM whenever possible (i.e. simple data types, easy enough to escape). In that case,
the underlying database and python modules have to provide support for this functionality.
For the record, the psycopg2 module exposes a copy_from() method and soon logilab-database
will provide an additional high-level helper for this functionality (see this ticket).

The code will be posted later or even integrated into CubicWeb
at some point. For now, it requires a bit of monkey patching around
one or two methods in the source so that SQL is not executed but just
recorded for later executions.

Comments

With dataimport features and your above work, the CW community benefits from both advanced reutilization through cubes and impressive performances. This is really cool, congrats and many thanks for the great job!