The LoadRDF tool is an OFFLINE tool, designed for fast loading of large data sets. It cannot be used against a running server. Rationale for an offline tool is to achieve an optimal performance for loading large amounts of RDF data by directly serializing them into GraphDB’s internal indexes and producing a ready to use repository.

The LoadRDF tool resides in the bin/ folder of the GraphDB distribution. It loads data in a new repository, created from the workbench or the standard configuration turtle file found in configs/templates, or uses an existing repository. In the latter case, the repository data is automatically overwritten.

Warning

During the bulk load, the GraphDB plugins are ignored, in order to speed up the process. Afterwards, when the server is started, the plugin data can be rebuilded.

Configure LoadRDF repositories location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties. If no property is set, the default repositories location will be: <graphdb_dist>/data.

Start GraphDB.

Start a browser and go to the Workbench Web application using a
URL of this
form: http://localhost:7200.
- substituting localhost and the 7200 port number as
appropriate.

Configure LoadRDF repositories location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties. If no property is set, the default repositories location will be: <graphdb_dist>/data.

The LoadRDF tool accepts java command line options, using -D. To change them, edit the command line script.

The following options can tune the behaviour of the parallel loading:

-Dpool.buffer.size - the buffer size (the number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:

less buffer size reduces the memory required;

bigger buffer size reduces the overhead as the operations performed by threads have a lower probability to wait for the operations on which they rely and the CPU is intensively used most of the time.

-Dinfer.pool.size - the number of inference threads in parallel mode. The default value is the number of cores of the machine processor or 4, as set in the command line scripts. A bigger pool theoretically means faster load if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.