Usually projects do not start from scratch, most of the time there is
pre-existing data that needs to be imported and sometimes the amount of data is
significant.

Assuming that you have an existing application that generates a few hundred
thousand records a day and you are about to migrate to a new stack with CrateDB
as the database backend. You’ll need a way to import the existing millions of
records into CrateDB as quickly as possible.

This best practice example will guide you through the process and shows tips
and tricks on how to import your data quickly and safely.

So we recommend you choose the number of shards wisely. They depend on the
number of nodes in the cluster as well as on the amount of data that goes into
the table.

Note

Be careful, you cannot change the number of shards once the table is created!

Assuming there are 6 nodes in the cluster, and we put 2 shards on each node
giving us a total of 12 shards, which should be enough for millions of records.

Shards can be thought of as “virtual nodes” - create enough for your needs for
scaling, but use as few as possible to keep the resource overhead (such as file
descriptors and memory) as small as possible.

More importantly, set the number of replicas as low as possible, ideally to
zero while importing data. In case the import fails, we can drop the table and
re-import again. When the import succeeds, adjust the number of replicas
according to your availability requirements.

Another simple, but very important tweak to speed up importing is to set the
refresh interval of the table to 0. This will disable the periodic refresh of
the table that is needed to minimise the effect of eventual consistency and
therefore also minimise the overhead during import.

CrateDB has native support for JSON formatted data, where each line is a
JSON string and represents a single record. Empty lines are skipped. The
keys of the JSON objects are mapped to columns when imported - nonexistent
columns will be created if necessary.
.

Upon execution, each node will check the provided path locally if the file
exists and import the data it contains. Consequently this command will check
/tmp/best_practice_data/ on each node in the cluster to import data from a
file called ‘users.json’. Please note that if the file is not found the command
will return successfully, reporting COPYOK,0rowsaffected(...sec).

When importing data using COPYFROM CrateDB does not check if the types
from the columns and the types from the import file match. It does not cast
the types to their target but will always import the data as in the source
file(s).

In our example it will not make a difference, but if you have a more complex
dataset with a lot of columns and large values, it makes sense to decrease the
bulk_size. Setting bulk_size too high might consume a lot of node
resources while a low bulk_size can increase the overhead per request.

Sometimes you want to split your table into partitions to be able to handle
large datasets more efficiently (e.g. for queries to run on a reduced set of
rows). To demonstrate data import into partitioned tables, we create partitions
for every day (in production, this depends on your use case).

Partitions can be created using the CREATETABLE statement using the
PARTITIONEDBY clause.

A partition column has to be part of the primary key (if one was explicitly
declared), so in our example this constraint is added to the newly created
partition column.

To import data into partitioned tables efficiently you should import each table
partition separately. Since the value of the table partition is not stored in
the column of the table, the JSON source must not contain the column value.

This way, CrateDB does not need to resolve the partition for each row that is
imported but can store it directly into the correct place resulting in a much
faster import.

However, it’s still possible (but not recommended) to import into partitioned
tables without the PARTITION clause and have the column value in the source.

When importing data into a partitioned table with existing partitions, it may
be wanted to apply import optimizations like e.g. disable the Refresh
Interval only to newly created partitions. This can be done by altering the
partitioned table only by using the ALTER TABLE ONLY statement.

Similarly, the number of shards can be adjusted for newly created partitions to
adapt to the increasing data volume! Simply use ALTERTABLEusersSET(number_of_shards=X) before creating a new partition.

To sum up the points described above, importing huge datasets is not difficult
if a few things are kept in mind. These are:

Reduce the number of replicas as much as possible, ideally to 0. Replication
slows down the import process significantly.

Use only as many shards as you really need.

Disable the periodic table refresh by setting the refresh interval to 0
during import.

Adjust the bulk size of the import depending on the size of your records.

Import table partitions separately using the PARTITION clause in the
COPYTO statement.

And last but not least:

Import speed significantly increases with increasing disk I/O. Using SSDs for
CrateDB is recommended anyway, but having one more disk (by adding another
node) in the cluster, can make quite a difference.