Monitoring scheduled jobs

TL;DR; If you have been struggling to find a solution which would keep an eye on the scheduled jobs in your application, just go and grab the latest Plumbr Agent and start monitoring the job activity.

In case you are wondering what such jobs are in the first place and whether or not it is even important to monitor them, bear with me for the following pages and you will understand the concept, related problems and our solution.

What are scheduled jobs?

You can think of a job as a single encapsulated function call. A job can for example check whether there are any new images present in the particular folder in the file system and resize/crop all the images found to fit with the layout used on your website.

Now if we would schedule the image resizing function to be run once a minute, we would have an example of a scheduled job. The schedule can be set to anything – jobs can be scheduled to be run hourly or daily or only on 29th February on leap years.

The aspect to notice with the example above is the way the resize function is called. To give you a counter-example, let’s redesign the same application to call the resize function in the context of a user interaction. If the very same thread uploading the image would call the image resize function directly, the call would not be scheduled but instead triggered by the user interacting with the application.

But why on earth would you make my app more complex by including job schedulers? Why couldn’t you do all the stuff during the user-initiated transaction? Indeed, if you can, you should do just that. You should consider using scheduled jobs usually when either of the following patterns are detected in the application:

Certain operations just are not linked with a particular user interactions. Such operations typically include archiving old data, extracting data for OLAP processes, running eventual consistency checks on transactions, etc.

Some operations are too expensive to be run in the context of an individual user interaction. When you have a high-traffic website where similar user interactions all include an expensive operation (such as an image resize), it might make sense to offload this particular functionality from the user interaction, and instead batch process all the images uploaded a during certain period at once. Doing so will improve the perceived latency for the end user as the image upload now finishes faster.

So I guess you are now convinced that jobs have their role to fulfill in many of the applications out there. The rise of microservices has decoupled some of the jobs from the monoliths, but the nature of these microservice-embedded jobs has not changed. They are still just functions launched periodically.

The guestimate is also confirmed by our statistics – before building the feature, we analyzed the data and discovered that 26% of the applications analyzed in our sample did indeed have scheduled jobs embedded.

Why would I need to monitor such jobs?

Many of the scheduled jobs perform truly important functions without which the application would start malfunctioning. To understand this, let’s take a look at the following examples:

As a first example, let us assume we have a log file rolling job ran on every midnight. The job will copy the log file for the previous day to a separate file and will create a new log file for the next day. The same process checks whether or not the number of log files exceeds seven, indicating that the logs are kept for seven days. If there are more log files already present, older files will be deleted. If this job now fails to run successfully for a few days, it is likely that nothing too bad happens. The storage is likely provisioned to accommodate few more days of logs. The individual file appending is also not likely suffering from the increased size of the current log file. So in this case the failing job might not impact the end users, at least not immediately.

As a second example, let us have a job responsible batch-processing invoices generated throughout the day. The job would send the batch to an external service provider responsible for delivering the invoice to the recipient through the recipient’ channel of choice. If this job now would fail to deliver some or all of the invoices, such invoices would not reach recipients resulting in a direct business impact.

As a third example, consider a scheduled job designed to aggregate raw time-series data by minute and by hour for faster time range queries. If such aggregations are too slow, for example if the by-minute-aggregation takes more than a minute to complete, the aggregation starts to fall behind, resulting in inconsistencies in the user experience.

So it is clear that jobs play an important role in most architectures. It is also clear that it might make sense to keep an eye on all the jobs, making sure the job would not fail nor complete too slowly.

Each such instance submission to scheduler is a separate job Plumbr will be tracking. Plumbr will monitor the payload methods (run()/call()) of the job both for correctness and duration. Based on this, the job instances can end up in three statuses:

Failed: If any uncaught exception is thrown out of the payload method, the exception is recorded as the root cause and the job instance is flagged as Failed.

Slow: if the payload method of the job instance completed slower than the threshold set for the particular job, the instance is flagged as Slow.

Success: if the payload method completed without any exceptions and faster than the threshold set, the instance is flagged as Success.

The job is given a name so you can immediately distinguish between different jobs. In case of a Spring framework, if @Scheduled annotation is used on a method, then the method name with the name of its containing class will be used. In all other cases the name of the class implementing Runnable or Callable is used. To distinguish the jobs from the services with transactional nature, the job identifier is prefixed by ‘JOB:’.

When this bean is deployed to an application monitored by Plumbr, the scheduled method invocations to saveMinuteStats() and purgeDeadJvms() are detected as jobs by Plumbr:

Exposing the details of the job, a familiar picture is seen – the time-series data about the job invocations, latency distribution chart representing the duration the jobs took to complete and root causes for slow and failed job instances:

So now that you are familiar with how we can expose even more insight from your applications, just go ahead and either upgrade the Plumbr Agent if you are an existing user of Plumbr or grab your free trial to start really improving the performance of your applications.