FAQ: Database Memory Usage

I get an error 'Unable to allocate N bytes of memory' while building a database, and Sawmill seems to have used all my available memory. What can I do about it?

Short Answer

Use a 64-bit computer and operating system with sufficient RAM, and/or simplify your database

Long Answer

This error means that Sawmill tried to allocate another chunk of memory (N additional bytes, on top of
whatever it was already using), and the operating system told it that there was no more memory available for it to use.
This error is usually not a bug; it almost always indicated that Sawmill really has exhausted all memory available.
This error typically happens when using the "internal" database with a very large dataset.

The "internal" database is optimized
for performance above all, and tends to keep some key data structures in memory. On 32-bit systems,
when processing large datasets, the amount of memory required may exceed the available address space.
Typically, the internal database
will work well up to about 10 GB of uncompressed log data on a 32-bit system. Above that, scalability may
become an issue. On 64-bit systems, the address space is not a concern, but if there is not sufficient physical RAM,
this error can still occur.

Itemnum tables, especially, can result in heavy memory usage for large datasets.
Itemnum tables, or normalization tables, are typically kept in memory.
Sawmill keeps a list of all values seen for each field, e.g., a list of all IP addresses
which appear in a particular field, or a list of all URLs which appear in another field, in the
"itemnum" tables. These tables are kept in memory, or at least mapped to memory, so they use
available memory addressing space. In the case of an IP address field, for instance the source
IP address of a web server log, each value is about ten bytes long. If there are 10 million unique IPs
accessing the site, this table is 100 million bytes long, or 100 MB. Similarly for a proxy log
analysis, if each unique URL is 100 bytes long and there are 10 million unique URLs in the log data,
the table will be 1 GB. Tables this large can easily exceed the capabilities of a 32-bit system,
which typically allows only 2 GB of memory to be used per process.

One solution is to use a 64-bit system and operating system, with sufficent RAM; with a 64-bit processor,
Sawmill will be able to allocate as much RAM as it needs, provided the RAM is available on the system
(and it can use virtual memory if it isn't). This is the most complete solution; with a large amount of RAM
on a 64-bit system, it should be possible to build extraordinarily huge databases without running
out of memory.

Another option is to simplify the dataset; see Memory, Disk, and Time Usage for suggestions. In particular,
adding a lot filter to simply or eliminate very complex database fields can not only reduce memory usage, but also
improve performance.