Re: How to set Akka frame size

The error you're receiving is because the Akka frame size must be a positive Java Integer, i.e., less than 2^31. However, the frame size is not intended to be nearly the size of the job memory -- it is the smallest unit of data transfer that Spark does. In this case, your "task result" size is exceeding 10MB, which means that returning the results for a single partition of your data is >10MB.

It appears that the default JavaWordCount example has a minSplits value of 1 (ctx.textFile(args[1],1)). This really means that the number of partitions will be max(1, # hdfs blocks in file). If you have an HDFS block of size ~64MB and all distinct words, the resulting task set may be around the same size, which is >10MB.

You have two collaborating solutions:

Increase the value of minSplits to reduce the size of any single TaskResult set, like: ctx.textFile(args[1],256)

Increase the Akka frame size by a small amount (e.g., to 20-70MB).

Please note that this issue, while annoying, is in good part due to the lack of realism of this example. You very rarely call collect() in Spark in actual usage, as that will put all your output data on the driver machine. Much more likely you'd save to an HDFS file or compute the top 100 words or something like that, which would not have this problem.

(One final note about your configuration, the Spark Worker is simply responsible for spawning Executors, which do the actual computation. As such, it is typical not to change the Worker memory at all [as it needs very little] but rather to give the majority of a machine's memory distributed amongst the Executors. If each machine has 16 GB of RAM and 4 cores, for example, you might set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by Spark.)

Re: Re: How to set Akka frame size

The error you're receiving is because the Akka frame size must be
a positive Java Integer, i.e., less than 2^31. However, the frame size
is not intended to be nearly the size of the job memory -- it is the smallest
unit of data transfer that Spark does. In this case, your "task result" size is
exceeding 10MB, which means that returning the results for a single partition of
your data is >10MB.

It appears that the default JavaWordCount example has a minSplits value of
1 (ctx.textFile(args[1],1)).
This really means that the number of partitions will be max(1, # hdfs blocks in
file). If you have an HDFS block of size ~64MB and all distinct words, the
resulting task set may be around the same size, which is >10MB.

You have two collaborating solutions:

Increase the value of minSplits to reduce the size of any single
TaskResult set, like: ctx.textFile(args[1],256)

Increase the Akka frame size by a small amount (e.g., to 20-70MB).

Please note that this issue, while annoying, is in good part due to the
lack of realism of this example. You very rarely call collect() in Spark in
actual usage, as that will put all your output data on the driver
machine. Much more likely you'd save to an HDFS file or compute the top 100
words or something like that, which would not have this problem.

(One final note about your configuration, the Spark Worker is simply
responsible for spawning Executors, which do the actual computation. As such, it
is typical not to change the Worker memory at all [as it needs very little] but
rather to give the majority of a machine's memory distributed amongst the
Executors. If each machine has 16 GB of RAM and 4 cores, for example, you might
set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by
Spark.)