I've been writing an Erlang app (my first one) for the last 2 and 1/2 months
and one of it's duties is to parse/transform _large_ CSV files. Some of
them 11+ million lines and 600+ MB in size and they will only get larger as
time goes on. Luckily I've got 128GB of RAM to work with, so I decided to
do everything in memory via binaries. My initial, naive attempt was to read
the file into one large binary and then split that into it's individual
columns. The end result would be a list of lists, each sublist being a list
of binaries representing the columns.
e.g.
[[<<"col1">>,<<"col2">>],
[<<"val1">>,<<"val2">>]]
This chewed up a _lot_ of memory, and my runtimes were nothing to write home
about. A few days ago I sat down and tried to read up more on binaries,
their idiomatic usage, pitfalls, etc. I ended up re-implementing my CSV
functions to build a new binary while parsing the source binary. My memory
usage went down drastically (although it spikes b/c of ephemeral garbage,
lots of temp binaries) and my runtimes improved greatly!
Anyways, while reading the docs on binary handling in the efficiency guide I
got the notion that if you match/split a large binary the resulting
sub-binary will reference the larger, off-heap one. That is, if you pull a
chunk of a large binary it's storage is essentially free. However, after
running some tests I feel like a lot of heap memory is allocated. I wrote a
more formal test tonight, and honestly, I'm not sure I understand things any
better.
I created a bunch of test runs on a 27MB CSV file with 597971 lines each
with 8 columns.
list_of_cols: break each column into it's own binary
list_of_lines: break each line into it's own binary
list_of_32byte: break into 32 byte chunks
list_of_64byte: " "
list_of_128byte:
...
Each test shows certain memory stats for each stage of the test: start,
after send, after run, after GC, and after shutdown. The main thing to
focus on is the difference in process allocated memory between "after send
data" and "after run" stages. Here are my results and attached is the code.
2> binary_test:run_tests().
*** After read
total | processes used | binary
35.57MB | 0.81MB | 26.78MB
==========RUNNING list_of_cols============================
*** Start
total | processes used | binary
35.53MB | 0.77MB | 26.78MB
*** After send data
total | processes used | binary
35.53MB | 0.77MB | 26.78MB
*** After run
total | processes used | binary
400.45MB | 365.68MB | 26.79MB
*** After GC
total | processes used | binary
400.45MB | 365.68MB | 26.78MB
*** After shutdown
total | processes used | binary
35.55MB | 0.78MB | 26.78MB
==========DONE list_of_cols===============================
==========RUNNING list_of_lines===========================
*** Start
total | processes used | binary
35.55MB | 0.78MB | 26.78MB
*** After send data
total | processes used | binary
35.55MB | 0.78MB | 26.78MB
*** After run
total | processes used | binary
84.53MB | 49.75MB | 26.79MB
*** After GC
total | processes used | binary
84.53MB | 49.75MB | 26.78MB
*** After shutdown
total | processes used | binary
35.55MB | 0.78MB | 26.78MB
==========DONE list_of_lines==============================
==========RUNNING list_of_32byte==========================
*** Start
total | processes used | binary
35.55MB | 0.78MB | 26.78MB
*** After send data
total | processes used | binary
35.56MB | 0.79MB | 26.79MB
*** After run
total | processes used | binary
96.76MB | 61.99MB | 26.78MB
*** After GC
total | processes used | binary
112.07MB | 77.30MB | 26.78MB
*** After shutdown
total | processes used | binary
35.55MB | 0.78MB | 26.78MB
==========DONE list_of_32byte=============================
==========RUNNING list_of_64byte==========================
*** Start
total | processes used | binary
35.55MB | 0.78MB | 26.78MB
*** After send data
total | processes used | binary
35.56MB | 0.79MB | 26.78MB
*** After run
total | processes used | binary
66.90MB | 32.13MB | 26.78MB
*** After GC
total | processes used | binary
66.90MB | 32.13MB | 26.78MB
*** After shutdown
total | processes used | binary
35.56MB | 0.78MB | 26.78MB
==========DONE list_of_64byte=============================
==========RUNNING list_of_128byte=========================
*** Start
total | processes used | binary
35.56MB | 0.78MB | 26.78MB
*** After send data
total | processes used | binary
35.56MB | 0.79MB | 26.78MB
*** After run
total | processes used | binary
51.61MB | 16.83MB | 26.78MB
*** After GC
total | processes used | binary
51.61MB | 16.83MB | 26.79MB
*** After shutdown
total | processes used | binary
35.55MB | 0.77MB | 26.79MB
==========DONE list_of_128byte============================
==========RUNNING list_of_256byte=========================
*** Start
total | processes used | binary
35.55MB | 0.78MB | 26.79MB
*** After send data
total | processes used | binary
35.55MB | 0.78MB | 26.79MB
*** After run
total | processes used | binary
45.82MB | 11.05MB | 26.79MB
*** After GC
total | processes used | binary
45.83MB | 11.06MB | 26.79MB
*** After shutdown
total | processes used | binary
35.56MB | 0.78MB | 26.79MB
==========DONE list_of_256byte============================
==========RUNNING list_of_512byte=========================
*** Start
total | processes used | binary
35.54MB | 0.77MB | 26.79MB
*** After send data
total | processes used | binary
35.55MB | 0.77MB | 26.79MB
*** After run
total | processes used | binary
41.89MB | 7.12MB | 26.78MB
*** After GC
total | processes used | binary
41.89MB | 7.13MB | 26.78MB
*** After shutdown
total | processes used | binary
35.55MB | 0.78MB | 26.78MB
==========DONE list_of_512byte============================
I went ahead and did some _rough_ calculations on how much each sub-binary
costs on the process heap by taking the difference in processes_used memory
and dividing by the number of sub-binaries created. I got the following
numbers:
list_of_cols: 80B
list_of_lines: 86B
list_of_32byte: 73B
list_of_64byte: 75B
list_of_128byte: 76B
list_of_256byte: 99B
list_of_512byte: 122B
So does this mean a Refc/sub-binary costs roughly 80B of memory? Am I
thinking about this too much? Probably.
Just to be clear, I'm now happy with the performance of my CSV routines.
I'm writing this because I want to understand the underlying binary
implementation better and because I spent too much time setting up my tests
not to post this :). If you read this far, thank you.
-Ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20100908/59bbc7df/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: binary_test.erl
Type: application/octet-stream
Size: 3443 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20100908/59bbc7df/attachment.obj>