Managed state: Samza manages snapshotting and restoration of a stream processor's state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition).

Fault tolerance: Whenever a machine in the cluster fails, Samza works with YARN to transparently migrate your tasks to another machine.

Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost.

Building Samza

To build Samza from a git checkout, run:

./gradlew clean build

To build Samza from a source release, it is first necessary to download the gradle wrapper script above. This bootstrapping process requires Gradle to be installed on the source machine. Gradle is available through most package managers or directly from its website. To bootstrap the wrapper, run:

gradle -b bootstrap.gradle

After the bootstrap script has completed, the regular gradlew instructions below are available.

Scala and YARN

Samza builds with Scala 2.11 or 2.12 and YARN 2.6.1, by default. Use the -PscalaSuffix switches to change Scala versions. Samza supports building Scala with 2.11 and 2.12.

To modify a job's checkpoint (assumes that the job is not currently running), give it a file with the new offset for each partition, in the format systems.<system>.streams.<topic>.partitions.<partition>=<offset>: