Docker Workers Scale Nicely with Multiple Cores

Disclaimer: The title might be a bit misleading. For this workload, it’s scaling pretty much linearly. Other workloads might scale differently.

I was running a quick-ish test on a Pi to see how long it would take to churn through 50GB compressed data.

This processing consists of:

Determining the files to be processed — this is via an offset since ultimately there will be 10 workers processing the data. Also I think I might get a slightly more representative set than by simply taking the first N files.

For each file, start a Docker container which uncompresses the file to stdout, where data is extracted from the stream and appended to a file.

The input is read over NFS from a server with a single 10/100 NIC and the output is written locally. Why NFS? In this case, it’s easy to configure and works well enough until proven otherwise.

top output demonstrates that the process doing the extracting is, indeed, working hard:

1

2

3

4

5

6

7

8

9

10

11

12

$top

top-21:12:03up2days,13:33,2users,load average:1.03,2.09,1.80

Tasks:97total,2running,95sleeping,0stopped,0zombie

%Cpu(s):25.2us,0.2sy,0.0ni,74.5id,0.0wa,0.0hi,0.0si,0.0st

KiB Mem:947468total,554564used,392904free,48836buffers

KiB Swap:0total,0used,0free,438636cached

PID USER PR NI VIRT RES SHRS%CPU%MEM TIME+COMMAND

6358root200473642881752R97.80.53:27.19extract_prop_sd

6357root200110080S2.60.00:05.27gzip

Eeek! The extractor (from the SDF Toolkit) is pretty much eating a CPU by itself. It may need to be replaced, depending on whether I have good enough results.

However, I am not going to optimize without testing and evaluating. It might just be the case that pegging the CPU might be ok — when the whole swarm is working, I believe that IO is ultimately the limiting factor. I haven’t tested it yet, so I don’t have much confidence in it. As I’m writing, I begin to doubt it — even if NFS is stupidly chatty, this extraction should only be a one or maybe two time event. I’d spend more time writing code to parse and then testing it than just letting it run. If this were happening on a regular basis, I think that I’d be more concerned. (I just had the glimmer of a fairly easy to implement AWK or Ruby streaming parser, so if I find myself performing the extraction more than I anticipate…)

That usage pattern remains consistent with more processors:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

$top

top-21:15:14up2days,13:36,2users,load average:2.12,1.69,1.67

Tasks:118total,5running,113sleeping,0stopped,0zombie

%Cpu(s):98.9us,1.0sy,0.0ni,0.0id,0.1wa,0.0hi,0.0si,0.0st

KiB Mem:947468total,500500used,446968free,48908buffers

KiB Swap:0total,0used,0free,369856cached

PID USER PR NI VIRT RES SHRS%CPU%MEM TIME+COMMAND

6534root200474842721736R97.50.50:26.10extract_prop_sd

6545root200473243601812R97.50.50:25.72extract_prop_sd

6558root200474442921756R97.50.50:25.90extract_prop_sd

6554root200471642921756R96.50.50:25.85extract_prop_sd

6532root200110080S2.60.00:00.63gzip

6543root200110080S2.30.00:00.57gzip

6557root200110080S2.30.00:00.62gzip

6553root200110080S2.00.00:00.55gzip

The following tests were performed on a Pi 2 with increasing amounts of parallelism:

Number of Files/Total Size

Number of Concurrent Docker Containers

Elapsed time

Average time/file (rounded)

3/43.6477MB

1

real 13m46.731suser 0m1.420ssys 0m1.350s

4:35

10/135.167MB

3

real 17m22.104suser 0m2.910ssys 0m1.420s

1:44

12/162.307MB

4

real 15m45.713suser 0m3.840ssys 0m0.920s

1:19

Note: 10 is not evenly divisible by 3, so the last file was running by itself.

Yes, the individual runs are slower (~4:35 for one vs. 5:15 when 4 cores in use), however the multiple cores more than make up for it.

The number of concurrent processes was controlled by xargs:

1

2

cat file_list|xargs-P$procs-n1worker.sh

The -n 1 specifies that one argument (file) is sent to each invocation of the worker script. One advantage of doing it this way is that if one file finishes quicker than another (or is smaller) then processing is not held up.

The output is, on average, slightly more than 7MB for each of the 12 files. Small enough that I’m not too concerned about compressing them (yet) — I have 16GB MicroSD cards which are less than 25% full.

So…. there are 3664 files. Assuming that I have 4 processes per Pi 2, and 1 per Pi B+, that will give me 25 among the 10 worker nodes. If I press additional hosts to work I could get up to ~37 at the expense of more hosts hitting
a single NFS server. I think I shall copy the data to another data host & split the reads in half.

So, assuming 25 workers and each file taking about 1.5 minutes of wall clock time (padding for IO latency), I should be able to churn through the files in approximately 3 hours and 40 min. Even at 2 minutes/file, that is about 4 hours. Not too terribly bad.

I might be able to get a little more performance if I allow the docker containers to use the host’s network stack. That’s a test for another day, however.