The life-cycle of a repository instance typically starts with the
initial loading of datasets, followed by the processing of queries and
updates. The loading of a large dataset can take a long time - up to 12
hours for a billion statements with inference. Therefore, during
loading, it is often helpful to use a different configuration than the one
for a normal operation.

Furthermore, if you frequently load a certain dataset, since it gradually
changes over time, the loading configuration can evolve as you
become more familiar with the GraphDB behaviour towards this dataset. Many
dataset properties only become apparent after the initial load
(such as the number of unique entities) and this information can be used
to optimise the loading step for the next round or to improve the
configuration for a normal operation.

The size of the data structures used to index entities is directly
related to the number of unique entities in the loaded dataset. These
data structures are always kept in memory. In order to get an upper
bound on the number of unique entities loaded and to find the actual
amount of RAM used to index them, it is useful to know the contents of the
storage folder.

The total amount of memory needed to index entities is equal to the sum
of the sizes of the files entities.index and entities.hash. This
value can be used to determine how much memory is used and therefore how
to divide the remaining memory between the cache-memory, etc.

An upper bound on the number of unique entities is given by the size of
entities.hash divided by 12 (memory is allocated in pages and
therefore the last page will likely not be full).

The file entities.index is used to look up entries in the file
entities.hash and its size is equal to the value of the
entity-index-size parameter multiplied by 4. Therefore, the
entity-index-size parameter has less to do with efficient use of
memory and more with the performance of entity indexing and lookup. The
larger this value, the less collisions occur in the entities.hash
table. A reasonable size for this parameter is at least half the number
of unique entities. However, the size of this data structure is never
changed once the repository is created, so this knowledge can only be
used to adjust this value for the next clean load of the dataset with a
new (empty) repository.

Furthermore, the inference semantics can be adjusted by choosing a
different ruleset. However, this will require a reload of the whole
repository, otherwise some inferences can remain when they should not.

Note

The optional indices can be built at a later time when the
repository is used for query answering. You need to experiment using
typical query patterns from the user environment.

Predicate lists are two indices (SP and OP) that can improve
performance in the following situations:

When loading/querying datasets that have a large number of
predicates;

When executing queries or retrieving statements that use a wildcard
in the predicate position, e.g., the statement pattern:
dbpedia:Human?predicatedbpedia:Land.

As a rough guideline, a dataset with more than about 1000 predicates
will benefit from using these indices for both loading and query
answering. Predicate list indices are not enabled by default, but can be
switched on using the enablePredicateList configuration parameter.

Statistics are kept for the main index data structures and include
information such as cache hits/misses, file reads/writes, etc. This
information can be used to fine-tune GraphDB memory configuration and
can be useful for ‘debugging’ certain situations, such as understanding
why load performance changes over time or with particular data sets.

For each index, there will be a CollectionStatistics MBean published,
which shows the cache and file I/O values updated in real-time:

Package

com.ontotext

MBean name

CollectionStatistics

The following information is displayed for each MBean/index:

Attribute

Description

CacheHits

The number of operations completed without accessing the storage system.

CacheMisses

The number of operations completed, which needed to access the storage system.

FlushInvocations

FlushReadItems

FlushReadTimeAvarage

FlushReadTimeTotal

FlushWriteItems

FlushWriteTimeAvarage

FlushWriteTimeTotal

PageDiscards

The number of times a non-dirty page’s memory was reused to read in another page.

PageSwaps

The number of times a page was written to the disk, so its memory could be used to load another page.

Reads

The total number of times an index was searched for a statement or a range of statements.

Writes

The total number of times a statement was added to a collection.

The following operations are available:

Operation

Description

resetCounters

Resets all the counters for this index.

Ideally, the system should be configured to keep the number of cache
misses to a minimum. If the ratio of hits to misses is low,
consider increasing the memory available to the index (if other factors
permit this).

Page swaps tend to occur much more often during large scale data
loading. Page discards occur more frequently during query evaluation.

GraphDB uses a number of query optimisation techniques by default. They
can be disabled by using the enable-optimization configuration
parameter set to false, however there is rarely any need to do this.
See GraphDB’s Explain Plan for a way to view query plans and applied
optimisations.

This optimisation applies when the repository contains a large number of
literals with language tags and it is necessary to execute queries that
filter based on language, e.g., using the following SPARQL query
construct:

FILTER(lang(?name)="ES")

In this situation, the in-memory-literal-properties configuration
parameters can be set to true, causing the data values with language
tags to be cached.

During query answering, all URIs from each equivalence class produced by
the sameAs optimisation are enumerated. You can use the
onto:disable-sameAs pseudo-graph (see
Other special query behaviour) to significantly
reduce these duplicate results (by returning a single
representative from each equivalence class).

Consider these example queries executed against the
FactForge combined dataset. Here, the
default is to enumerate:

The Expand results over equivalent URIs checkbox in the GraphDB
Workbench SPARQL editor plays a similar role, but the meaning is
reversed.

Warning

If the query uses a filter over the textual representation of a URI,
e.g., filter(strstarts(str(?x),"http://dbpedia.org/ontology")),
this may skip some valid solutions as not all URIs within an
equivalence class are matched against the filter.

We are on Stack Overflow

Get quick answers on technical questions from the community as well as Ontotext experts using the
graphdb tag