Jobs

Overview

A job, in contrast to
a
replication controller, runs a pod with any number of replicas to completion. A
job tracks the overall progress of a task and updates its status with information
about active, succeeded, and failed pods. Deleting a job will clean up any pod
replicas it created. Jobs are part of the Kubernetes API, which can be managed
with oc commands like other
object types.

Known Limitations

The job specification restart policy only applies to the pods, and not the job controller. However, the job controller is hard-coded to keep retrying jobs to completion.

As such, restartPolicy: Never or --restart=Never results in the same behavior as restartPolicy: OnFailure or --restart=OnFailure. That is, when a job fails it is restarted automatically until it succeeds (or is manually discarded). The policy only sets which subsystem performs the restart.

With the Never policy, the job controller performs the restart. With each attempt, the job controller increments the number of failures in the job status and create new pods. This means that with each failed attempt, the number of pods increases.

With the OnFailure policy, kubelet performs the restart. Each attempt does not increment the number of failures in the job status. In addition, kubelet will retry failed jobs starting pods on the same nodes.

Scaling a Job

A job can be scaled up or down by using the oc scale command with the
--replicas option, which, in the case of jobs, modifies the
spec.parallelism parameter. This will result in modifying the number of pod
replicas running in parallel, executing a job.

The following command uses the example job above, and sets the parallelism
parameter to three:

$ oc scale job pi --replicas=3

Scaling replication controllers also uses the oc scale command with the
--replicas option, but instead changes the replicas parameter of a
replication controller configuration.

Setting Maximum Duration

When defining a Job, you can define its maximum duration by setting
the activeDeadlineSeconds field. It is specified in seconds and is not
set by default. When not set, there is no maximum duration enforced.

The maximum duration is counted from the time when a first pod gets scheduled in
the system, and defines how long a job can be active. It tracks overall time of
an execution and is irrelevant to the number of completions (number of pod replicas
needed to execute a task). After reaching the specified timeout, the job is
terminated by OpenShift Container Platform.

The following example shows the part of a Job specifying
activeDeadlineSeconds field for 30 minutes:

spec:
activeDeadlineSeconds: 1800

Job Backoff Failure Policy

A Job can be considered failed, after a set amount of retries due to a
logical error in configuration or other similar reasons. To specify the number
of retries for a job set the .spec.backoffLimit property. This field defaults
to six. Failed Pods associated with the Job are recreated by the controller with an exponential backoff delay (10s, 20s, 40s …) capped at six minutes. The limit is reset if no new failed pods appear between controller checks.