I have a general question regarding memory usage for SciDB. I have close to 260 GB of data on a single-instance system and the SciDB process is consistently using at least 12 GB of system memory. Does this sound right, and how will this change with an increasing data volume?

This is an array of weather stations I am projecting onto a global grid. I am keeping them in a very sparse array to avoid collisions. I plan on regridding in subsequent steps but would like to keep both arrays (the averaged and raw) for queries. When I run a regrid (100x less resolved in row and col) on this array the SciDB process will gradually consume all system memory (32 GB) before crashing.

Also, I dropped some of the larger arrays (in terms of number of values stored) from the current database and it doesn’t seem to change the memory use.

This is a curious case. I was wondering if maybe you are suffering from the “many small chunks” problem - but that’s not what the output says. Your array is pretty well organized and 386+ thousand elements per chunk is good.

We’ve had a bug where very large, very sparse chunks at the edge of the array occupy too much space. It typically happens at edge chunks. For example, in the array you have, the last chunk is at coordinates {50,77599}. The dimensions are 138301 by 100 but the array ends at day=77600, so the system used to create a mask that contained 138301 “run lengths” of 1 to denote that there is no data after day=77600. That chunk could occupy a lot of memory. And if it were placed in the cache - it could account for the large memory footrpint you are seeing. We’ve fixed that bug in 12.10.

We also saw a few similar bugs with caching, and regrid over very sparse arrays. We’ve fixed those in 12.10 too. It’s likely that once you get to 12.10 many of these issues will go away.

Meanwhile, here are some things to try:

It could still be that another array has too many small chunks. Do you have many arrays in the system? Care to perform that kind of query on several arrays and check to make sure? The numbers “chunks: 776” and “cells/chunk: 386708” are what you want to look for.

Have you done a lot of updates or repeated stores? Do you, maybe, have an array that’s at a high version number? I see that GHCND_sparse is only at version 1. Others? In 12.3 the header for each chunk for each version is kept in memory. In 12.10 it gets better.

Try to stop and start the system - i.e. scidb.py stopall / scidb.py startall. What does the mem footprint look like? Does it grow at startup or does it grow after you run a query or two? Does startup itself take a long time?

I don’t know what your setup is like - but there is an unreleased, unofficial “12.7” kit you can try. You’d have to build it yourself. Source tarball is at scidb.org/tutorial_link/