JobMonitor can now deploy multiple threads for faster job-status polling. Use 'gridmix.job-monitor.thread-count' to set the number of threads. Stress mode now relies on the updates from the job monitor instead of polling for job status. Failures in job submission now get reported to the statistics module and ultimately reported to the user via summary.

JobMonitor can now deploy multiple threads for faster job-status polling. Use 'gridmix.job-monitor.thread-count' to set the number of threads. Stress mode now relies on the updates from the job monitor instead of polling for job status. Failures in job submission now get reported to the statistics module and ultimately reported to the user via summary.

Tags:

gridmix3 stress

Description

Gridmix STRESS mode can be improved as follows:
1. The sleep time in JobMonitor can be reduced and/or made configurable
2. Map and reduce load calculation in StressJobFactory can be done in one loop
3. Updating the overload status from the job submitter thread (inline)
4. Optimizations to avoid un-necessary progress check (which inturn would result into delay)

Activity

Attaching a patch that improves the Gridmix STRESS logic. Changes are as follows:

StressJobFactory:

Update and submission happens in one thread. Earlier, the job submission and status-polling/load-calculation happened in separate thread. Slower status polling resulted in throttling of job submission.

No RPC calls are made in stress mode. The Stress mode relies on the status cached by JobMonitor.

Load calculation logic optimized to return early if the load is considerably lower or considerably higher.

If the total number of tasks submitted (without considering their progress) is lower than the max permissible limit, the cluster is considered underloaded.

If at anytime the actual load (considering the task progress) is higher than the permissible limit, the cluster is considered overloaded without iterating over the remaining jobs.

Maintains a list of blacklisted jobs that the will be ignored.

JobMonitor:

This is now multi-threaded. Total number of threads can be specified using the 'gridmix.polling.threads' config key.

The sleep interval is now made configurable. The sleep time can be set using 'gridmix.polling.delay'.

Invokes statistics for lost jobs. Not doing this results into missing status-update events.

Statistics:

JobStats now caches job-status along with a) Running job handle, b) total number of map tasks and c) total number of reduce tasks. Other modules should rely on these cached status objects instead of polling for job status individually.

Statistics maintains a count of total number of submitted maps and reduces.

Execution Summarizer:

Now reports "Lost Jobs". These are the jobs for which the status is unavailable.

JobFactory:

Ignores reduce only jobs. Currently, Gridmix is incapable of handling reduce only jobs.

Modified TestGridmixSummary to test summary on lost jobs. Added TestGridmixStatistics to test the Statistics.java changes.

With this patch, following things are timed and reported:
1. Time taken to submit a job
2. Time taken to build job splits
3. Time taken to get the job's status

Amar Kamat
added a comment - 09/Feb/12 04:03 Attaching a patch that improves the Gridmix STRESS logic. Changes are as follows:
StressJobFactory:
Update and submission happens in one thread. Earlier, the job submission and status-polling/load-calculation happened in separate thread. Slower status polling resulted in throttling of job submission.
No RPC calls are made in stress mode. The Stress mode relies on the status cached by JobMonitor.
Load calculation logic optimized to return early if the load is considerably lower or considerably higher.
If the total number of tasks submitted (without considering their progress) is lower than the max permissible limit, the cluster is considered underloaded.
If at anytime the actual load (considering the task progress) is higher than the permissible limit, the cluster is considered overloaded without iterating over the remaining jobs.
Maintains a list of blacklisted jobs that the will be ignored.
JobMonitor:
This is now multi-threaded. Total number of threads can be specified using the 'gridmix.polling.threads' config key.
The sleep interval is now made configurable. The sleep time can be set using 'gridmix.polling.delay'.
Invokes statistics for lost jobs. Not doing this results into missing status-update events.
Statistics:
JobStats now caches job-status along with a) Running job handle, b) total number of map tasks and c) total number of reduce tasks. Other modules should rely on these cached status objects instead of polling for job status individually.
Statistics maintains a count of total number of submitted maps and reduces.
Execution Summarizer:
Now reports "Lost Jobs". These are the jobs for which the status is unavailable.
JobFactory:
Ignores reduce only jobs. Currently, Gridmix is incapable of handling reduce only jobs.
Modified TestGridmixSummary to test summary on lost jobs. Added TestGridmixStatistics to test the Statistics.java changes.
With this patch, following things are timed and reported:
1. Time taken to submit a job
2. Time taken to build job splits
3. Time taken to get the job's status
test-patch and ant-tests passed.