Frequency with which to average worker parameters.Note: Too high or too low can be bad for different reasons.
- Too low (such as 1) can result in a lot of network traffic
- Too high (>> 20 or so) can result in accuracy issues or problems with network convergence

Set if/when repartitioning should be conducted for the training data.
Default value: always repartition (if required to guarantee correct number of partitions and correct number
of examples in each partition).

Builder

Create a builder, where the following number of workers (Spark executors * number of threads per executor) are
being used.
Note: this should match the configuration of the cluster.

It is also necessary to specify how many examples are in each DataSet that appears in the RDD<DataSet>
or JavaRDD<DataSet> used for training.
Two most common cases here:
(a) Preprocessed data pipelines will often load binary DataSet objects with N > 1 examples in each; in this case,
rddDataSetNumExamples should be set to N
(b) "In line" data pipelines (for example, CSV String -> record reader -> DataSet just before training) will
typically have exactly 1 example in each DataSet object. In this case, rddDataSetNumExamples should be set to 1

Parameters:

numWorkers - Number of Spark execution threads in the cluster. May be null. If null: number of workers will
be obtained from JavaSparkContext.defaultParallelism(), which should provide the number of cores
in the cluster.

rddDataSetNumExamples - Number of examples in each DataSet object in the RDD<DataSet>

Method Detail

trainingHooks

Adds training hooks to the master.
The training master will setup the workers
with the desired hooks for training.
This can allow for tings like parameter servers
and async updates as well as collecting statistics.

Parameters:

trainingHooks - the training hooks to ad

Returns:

trainingHooks

Adds training hooks to the master.
The training master will setup the workers
with the desired hooks for training.
This can allow for tings like parameter servers
and async updates as well as collecting statistics.

averagingFrequency

Frequency with which to average worker parameters.Note: Too high or too low can be bad for different reasons.
- Too low (such as 1) can result in a lot of network traffic
- Too high (>> 20 or so) can result in accuracy issues or problems with network convergence

Parameters:

averagingFrequency - Frequency (in number of minibatches of size 'batchSizePerWorker') to average parameters

aggregationDepth

The number of levels in the aggregation tree for parameter synchronization. (default: 2)
Note: For large models trained with many partitions, increasing this number
will reduce the load on the driver and help prevent it from becoming a bottleneck.

saveUpdater

Set whether the updater (i.e., historical state for momentum, adagrad, etc should be saved).
NOTE: This can double (or more) the amount of network traffic in each direction, but might
improve network training performance (and can be more stable for certain updaters such as adagrad).

This is enabled by default.

Parameters:

saveUpdater - If true: retain the updater state (default). If false, don't retain (updaters will be
reinitalized in each worker after averaging).

repartionData

Set if/when repartitioning should be conducted for the training data.
Default value: always repartition (if required to guarantee correct number of partitions and correct number
of examples in each partition).

storageLevel

Set the storage level for RDD<DataSet>s.
Default: StorageLevel.MEMORY_ONLY_SER() - i.e., store in memory, in serialized form
To use no RDD persistence, use null

Note: Spark's StorageLevel.MEMORY_ONLY() and StorageLevel.MEMORY_AND_DISK() can be problematic when
it comes to off-heap data (which DL4J/ND4J uses extensively). Spark does not account for off-heap memory
when deciding if/when to drop blocks to ensure enough free memory; consequently, for DataSet RDDs that are
larger than the total amount of (off-heap) memory, this can lead to OOM issues. Put another way: Spark counts
the on-heap size of DataSet and INDArray objects only (which is negligible) resulting in a significant
underestimate of the true DataSet object sizes. More DataSets are thus kept in memory than we can really afford.

Default: null. -> use {hadoop.tmp.dir}/dl4j/. In this case, data is exported to {hadoop.tmp.dir}/dl4j/SOME_UNIQUE_ID/
If you specify a directory, the directory {exportDirectory}/SOME_UNIQUE_ID/ will be used instead.