Hi Aleksandar,
I thought it'd be educational to do some back-of-the-envelope
calculations to see how much memory we'd expect to use to store words
in a HashMap ByteString Int. First, lets start by looking at how much
memory one ByteString uses. Here's the definition of ByteString [1]:
data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8)
{-# UNPACK #-} !Int -- offset
{-# UNPACK #-} !Int -- length
The two Int fields are used to support O(1) slicing of ByteStrings.
We also need the definitions of ForeignPtr [2] and Int.
data ForeignPtr a = ForeignPtr Addr# ForeignPtrContents
data ForeignPtrContents
= PlainForeignPtr !(IORef (Finalizers, [IO ()]))
| MallocPtr (MutableByteArray# RealWorld) !(IORef
(Finalizers, [IO ()]))
| PlainPtr (MutableByteArray# RealWorld)
data Int = I# Int#
The UNPACK indicates to the compiler that it should unpack the
contents of a constructor field into the constructor itself, removing
a level of indirection. We'll end up with a structure that looks like
this:
data ByteString = PS Addr# ForeignPtrContents
Int# -- offset
Int# -- length
To compute the size of a ByteString, count the number of constructor
fields and add one (for the constructor itself). This is how many
words the value is going to use. In this case it's 5 words. In
addition we need to add the size of the ForeignPtrContents
constructor, which happens to be a PlainPtr in this case, so we add
two more words.
Finally we need to look at the definition of MutableByteArray# [3],
which is implemented by a C struct named StgArrWords:
typedef struct {
StgHeader header;
StgWord bytes;
StgWord payload[FLEXIBLE_ARRAY];
} StgArrWords;
The StgHeader takes one word (when compiling with profiling turned
off) so StgArrWords takes 2 words, plus the actual payload.
If we add it all up we get 9 words, plus the size of the payload (i.e.
the length of a word encoded using UTF-8 in your case).
Now lets look at the definition of HashMap [4]:
data HashMap k v
= Bin {-# UNPACK #-} !SuffixMask
!(HashMap k v)
!(HashMap k v)
| Tip {-# UNPACK #-} !Hash
{-# UNPACK #-} !(FL.FullList k v)
| Nil
We also need the definition of FullList [5]:
data FullList k v = FL !k v !(List k v)
data List k v = Nil | Cons !k v !(List k v)
For the sake of this discussion lets assume the tree is perfectly
balanced. For a HashMap of size N this means that we have N leaves
(Tip) and N-1 interior nodes (Bin). Each Bin takes 4 words.
The size of the Tip depends on the number of hash collisions. These
are quite rare so lets assume that the FullList only has one element.
Also, the Nil constructor is free as it can be shared by all instances
in the program. After applying the UNPACK pragmas the Tip constructor
looks like:
| Tip Int# !k v !(List k v)
This takes another 5 words.
Now when we know the overhead of both Bin and Tip we can compute the
overhead per key/value pair as: (5N + 4(N-1)) / N = 9 - 4/N ~= 9
words.
Given that an Int (not an Int#) takes 2 words, we can approximate the
memory cost of a key/value pair in a HashMap ByteString Int as
(9 + 9 + 2) * MachineWordSize + AvgByteStringLength
bytes.
For example, the average English word length is 5 characters, if you
include stop words. We'd expect to use
(9 + 9 + 2) * 8 + 5 = 165
bytes per unique word in our input corpus, on a 64-bit machine. Plus
any GC overhead. This is probably more than one would expect.
I'm working on switching HashMap to use another data structure, a Hash
Array Mapped Trie, in its implementation. This will bring the overhead
down from 9 words to about 4 words.
You could try using the lower overhead SmallString type from the
smallstring package [6]. It has an overhead of 4 words per string
(instead of 9 like ByteString). There's some runtime overhead involved
in converting a value (i.e. Text) to a SmallString. I don't know if
this overhead will be noticeable in your program.
1. http://code.haskell.org/bytestring/Data/ByteString/Internal.hs
2. https://github.com/ghc/packages-base/blob/master/GHC/ForeignPtr.hs
3. https://github.com/ghc/ghc/blob/master/includes/rts/storage/Closures.h
4. https://github.com/tibbe/unordered-containers/blob/master/Data/HashMap/Common.hs
5. https://github.com/tibbe/unordered-containers/blob/master/Data/FullList/Lazy.hs
6. http://hackage.haskell.org/package/smallstring
P.S. If your program indeed has a space leak this won't help you, but
it's a good way to figure out if your program uses a reasonable amount
of memory.
Cheers,
Johan