"Dave Tapley" <dave.a.tapley at gmail.com> writes:
> I've upgraded to bytestring-0.9.0.5 from Darcs, no improvement.
> Also this morning I tried using Data.HashMap with Bytestring's readInt
> and HashMap's hashInt.. The result was a Stack space overflow :(
That's not so good.
>> It works as required, loading K/V pairs into a Data.Map, the concern
>> is the amount of memory used. Pausing (using a getLine) after the
>> readFile one can see (through 'ps v') that the whole file 'f' is
>> loaded in to memory.
>> However once the Map is created the memory footprint swells to around
>> five times the size. Surely this can't be just overhead from Map?
>> My reasoning was that by using ByteString and getting the whole file
>> into memory a small and linear increase would be seen for Map
>> overhead..
What's the average line length? Roughly, the Data.Map will contain
2*#lines nodes - at some bytes each, and the #lines leaf nodes will
have pointers to two ByteStrings, which carry an overhead of three
pointers (IIRC: char array, offset, length).
Not sure about the size of Map internal nodes, but I assume it will be
at least two pointers to children, and possibly the size as well?
If we presume three words per node, we get about 6*#lines*wordsize
overhead, so 24*#lines on 32 bits, and 48*#lines on 64 a bit
architecture. Copying GC means that in reality you need -- or at
least, the program will consume, if available¹ -- a good bit more.
(And yes, since each string is just a pointer into the orignial file
contents, you need to retain that as well.)
Short answer: Maps are expensive in terms of memory.
One option could be to pack the keys into Ints (if they're short
enough) and use Data.IntMap. Another could be to just sort an Array
of offsets into the file contents and use binary search.
-k
¹) If your program is thrashing, make sure you limit heap to 80% of
physical memory (use +RTS -Mxxxx).
--
If I haven't seen further, it is by standing in the footprints of giants