Scenario Setup

Modify a previous job run's individual batch to have RUN_STATUS=-1 and WHEN_COMPLETED=NULL

Problem Statement

Issue is reported where Archiving is failing. But what does this mean?

Are the Archive Process and Instance both running?

Will an Archive Job start but fail shortly after?

Can an Archive Job be created and saved?

Whenever the answer to #2 is Yes, then #3 will be No. Per the Administration Guide, "If a job fails, no other scheduled job can run in the system till the failure of the job is resolved and the failed job is restarted manually." Creating a new Archive Job won't work - the failed job run must be addressed and allowed to complete.

Why Does the Job Fail?

There isn't much that can be done from the UI. You can try DEBUG level logs, but there are typically a lot of benign exceptions, errors, and failures that are irrelevant to the problem at hand. It's off to the eGActiveDB for this scenario.

Start by finding the JOB_ID from the Job table.

select * from egpl_arch_job

Next, determine which JOB_RUN_ID corresponds to the failed Job Run.

select * from egpl_arch_job_run where job_id=<JOB_ID_FROM_1>

Notice the JOB_RUN with RUN_STATUS=-1. RUN_STATUS should always be 0 when a job is not running. We will focus on this JOB_RUN_ID.

List out the Batches from that particular Job Run.

select * from egpl_arch_job_run_batch where job_run_id=<JOB_RUN_ID_FROM_2>

Notice that most of the batches have a value of NULL for WHEN_COMPLETED, and one in particular has a RUN_STATUS of -1. RUN_STATUS should always be 0 when a job is not running.

RUN_STATUS 1 = RUNNING

RUN_STATUS -1 = FAILED

RUN_STATUS 0 = COMPLETED

Dig deeper into the failed batch's steps.

select * from egpl_arch_batch_step where batch_id=<BATCH_ID_FROM_3>

All of the Steps in the Batch have RUN_STATUS=0, meaning they all completed successfully. For some reason, the Batch did not reflect this, which caused the whole Job Run to fail.

How do we make the job Complete?

We can manually update the Batch and Job Run to take care of this. We must also purge the list of objects tied to that problematic job run.

update egpl_arch_job_run_batch set run_status=0 where batch_id=<BATCH_ID_FROM_3> and job_run_id=<JOB_RUN_ID_FROM_2>
update egpl_arch_job_run set run_status=0 where job_run_id=<JOB_RUN_ID_FROM_2> and job_id=<JOB_ID_FROM_1>
delete from egpl_arch_object_list where job_run_id=<JOB_RUN_ID_FROM_2>

Cycle the Archive Process and Instance, then confirm that the job run is listed as "Archive Completed"

Resolution

There are clearly problems with the job for it to get into this state, and we should not take another chance with it. It is best to set this job as Not Active, and create a new job to take its place.