I'm playing with Erlang to make an experimental protocol. I'm trying
to make it use full of 1Gbit link but It won't scale that much and I'm
failing to found a bottleneck in my code or even anything I could call
it bottleneck.
My software is very like a messaging server software in behavior, with
bigger packets, many clients (more than 4k) and uses more complex
sub-components, like a distributed database but those components are
not blocking other portions of system; It's just the client-server
channel that is heavy IO and involve some encryption and decryption.
I made a gen_server process for each UDP socket to clients. There is a
central process registry but It being called just for new clients and
Its message queue is often empty.
I found there was a bottleneck in `scheduler_wait` when I had few
clients (around 400) and It consumed around 50% of total CPU usage. I
found an old patch by Wei Cao [1] which seemed to target same issue.
But on a modern version of Erlang (18.0) blockage in `scheduler_wait`
dropped well in more congested network, specifically to around 10%
when my software reached Its apparent limit, around 600Mbit/s read and
write to network. At this point my incoming UDP packet rate is around
24K/s. Maybe an experienced Erlang developer here can remember that
problem and can tell whether Erlang is now optimized to poll for
network packets more often or not..
I also concerned async pool since there was fairly high work in Erlang
work with pthread but found those threads just used for file IO
operations. I didn't found any assuring documentation about this, just
saw the only user of this dirty IO thing is `io.c` in otp source code.
I'm very grateful if anyone clear the usage and effect of this pool.
I made flame graphs of function calls both inside VM (using eflame2
[2]) which is very even and cannot find any outstanding usage [3]. And
made another flamegraph of perf report outside of VM which cannot find
some symbols [4]. I doubt whether process_main shoud take that much
work itself or not. Apparently encryption and decryption (enacl_nif
calls) didn't take much time too.
Do you have any suggestion for me to analyze better my software and
understand VM working? Is It those limits I should expect and there is
not more room for optimizations?
Thanks in advance
1: http://erlang.org/pipermail/erlang-questions/2012-July/067868.html
2: https://github.com/slfritchie/eflame
3: http://file.reith.ir/in-erl-3k.gif
4: http://file.reith.ir/out-erl-perf.svg (interactive, use web browser)