Ability to build on configurable subset of matrix configurations

Details

Description

It is useful sometimes to perform build on predefined number of slave builders
assigned to multi-configuration job. For example, if some OS builds failed, and
we wish to re-build only them after code fixed

Activity

This problem isn't exclusive to slave selection, as I understand the matrix
build feature, so I've updated the summary.

Also see issue 3332, which would allow a canary configuration to run first, and
only run the rest of the matrix configurations if the canary succeeds. This
would cut down on massive failure notification.

mdonohue
added a comment - 05/Apr/09 1:30 AM This problem isn't exclusive to slave selection, as I understand the matrix
build feature, so I've updated the summary.
Also see issue 3332, which would allow a canary configuration to run first, and
only run the rest of the matrix configurations if the canary succeeds. This
would cut down on massive failure notification.

Lars Kruse
added a comment - 11/Apr/11 2:47 PM I've assigned this case to myself.
We ('Christian Wolfgang' and myself) are researching and we will present a design proposal for a simple and elegant solution to this problem within a few days.
Hope to get feed-back on the design from the community before we start implementing.
Stay tuned!

The general idea of the design is being able to differentiate between matrix configurations reusing(copying) older builds(a basis build) and configurations actual being rebuild, without disturbing older non-reusing projects' configurations.
For this to work we need to alter the Jenkins core at the very lowest level, where the job is run. This is done in the abstract class Run, in the method run(). This method has two points of interest:

job.run(listener)

job.post(listener)
Cutting of these two calls, we can achieve a build, that looks like it was run correctly, setting the build result as we want(=the same result as the basis build). Being able to create a Run object, whose values are set, we only need to copy the artifacts from the basis builds, the ones that are not rebuild. This can be achieved by adding some extra functionality to the AbstractBuild class. Currently, as it can be viewed in the code, the build directory of the basis build is hard copied to the source build, only leaving out build.xml. The build.xml file is created by Jenkins after the job is run and would cause an exception.

To distinguish between builds that should be rebuild and builds that should not, we need to add a property to the Run class. We have added a small class called ReuseRun, that only implements an integer referencing the rebuild build. If this object is either null or the integer is 0, the new functionality is left untouched. This object is null for all projects not using the rebuild functionality.

Though this is a prototype, we consider this a functional starting point, based on the fact that Run objects will, by default, instantiate the ReuseRun object to null. Thus only explicit calls to setReuse can change that and the code added is only related to the object being not null.

Christian Wolfgang
added a comment - 14/Apr/11 2:09 PM Modifications to Jenkins-core
-----------------------------
The general idea of the design is being able to differentiate between matrix configurations reusing(copying) older builds(a basis build) and configurations actual being rebuild, without disturbing older non-reusing projects' configurations.
For this to work we need to alter the Jenkins core at the very lowest level, where the job is run. This is done in the abstract class Run, in the method run(). This method has two points of interest:
job.run(listener)
job.post(listener)
Cutting of these two calls, we can achieve a build, that looks like it was run correctly, setting the build result as we want(=the same result as the basis build). Being able to create a Run object, whose values are set, we only need to copy the artifacts from the basis builds, the ones that are not rebuild. This can be achieved by adding some extra functionality to the AbstractBuild class. Currently, as it can be viewed in the code, the build directory of the basis build is hard copied to the source build, only leaving out build.xml. The build.xml file is created by Jenkins after the job is run and would cause an exception.
To distinguish between builds that should be rebuild and builds that should not, we need to add a property to the Run class. We have added a small class called ReuseRun, that only implements an integer referencing the rebuild build. If this object is either null or the integer is 0, the new functionality is left untouched. This object is null for all projects not using the rebuild functionality.
Though this is a prototype, we consider this a functional starting point, based on the fact that Run objects will, by default, instantiate the ReuseRun object to null. Thus only explicit calls to setReuse can change that and the code added is only related to the object being not null.

In the situation where the build you are re-using does not exist, you are failing the matrix build. Why is failing the build better than rebuilding that configuration?

I also see that you haven't provided UI changes yet. How do you think this functionality would be exposed? I imagine you could add a Rebuild action to the left side navigation, which would take the user to a page where they can pick the configuration subset they want to rebuild. Maybe there's another way to do this that I didn't consider.

A fundamental problem I see with this capability is that it requires a human to know that it's safe to only rebuild parts of the matrix. Going back to the original issue description, if the person doing the rebuild doesn't realize that the code change also affects another OS that doesn't get rebuilt, then you have failed to test a necessary configuration. This also then bubbles up into release management. You probably want to tag your version control at the point of release, but if some of your configurations were built with older versions and not rebuilt, you've lost that information.

Dean Yu
added a comment - 14/Apr/11 7:57 PM In the situation where the build you are re-using does not exist, you are failing the matrix build. Why is failing the build better than rebuilding that configuration?
I also see that you haven't provided UI changes yet. How do you think this functionality would be exposed? I imagine you could add a Rebuild action to the left side navigation, which would take the user to a page where they can pick the configuration subset they want to rebuild. Maybe there's another way to do this that I didn't consider.
A fundamental problem I see with this capability is that it requires a human to know that it's safe to only rebuild parts of the matrix. Going back to the original issue description, if the person doing the rebuild doesn't realize that the code change also affects another OS that doesn't get rebuilt, then you have failed to test a necessary configuration. This also then bubbles up into release management. You probably want to tag your version control at the point of release, but if some of your configurations were built with older versions and not rebuilt, you've lost that information.

Why fail rather than rebuild if the configuration is missing? Well we think like this: In the case where we need to make the choice, the user obviously asked for a reuse - not a rebuild. Potentially a rebuild can take a long time - so it's likely that our guess will not always capture the users intend or wish. Therefore we choose to fail the build and print to the log what happened - simply because it's faster, and this way we're not assuming anything. Then the user (yes - there IS a human involved in this feature) can schedule another rebuild, only this time he explicitly asks for a rebuild - not a reuse.

You are right about your assumptions regarding the UI (we will present a mock-up of what we have in mind very soon) but in large: A rebuild action that takes you to a new form listing all the results in the matrix from the previous build, each cell in the matrix now has a small checkbox attached to it. The user checks the builds in the matrix he/she wants to rebuild - the other ones are reused. There will be short cuts like "select all failed", "select all unstable" ...etc. When the user hits submit, the choices are traversed and all the builds that shall be reused will have the ReuseRun attribute set.

Our plan is to extend the ability to reuse a build in the core (as presented here) and then we will release the form that utilizes it as a seperate (new) plugin.

...leading to your third question - or concern: It requires a knowning human! Yes! The scenario we see for this reuse/rebuild feature is:
In a large matrix some (most) configurations has succeeded, a few have failed or are unstable. Going through the logs it appears that it's due to the slaves configuration or abilities that something went wrong (say a process hung or a network connection was lost or it failed to get a license to tool it depended on....) Now you don't want to rebuild the entire matrix again (imagine that it takes several hours to do that). You just pick the failed builds and ask to reuse the successful ones. We imagine that the configurations that are rerun will still honour all the various other settings of the job, thus if the job picks latest commit on a branch, and there is a new commit - then you are right it's a different build then before.

We imagine that users must take the necessary steps in their job configurations to avoid mistakes like this to happen. In our setup we're building a named commit, so it's safe - for us

The thing is, we realize that the behaviour we're looking for might be special, but the aproach to how we implemented it (in AbstractBuild and Run) is generic. Other plugins could use the same feature to make different behaviour - either by contributing to our plugin or by creating a new one. The design should support that no more extensions are needed in the core to make either work.

Hmmm - But to avoid making future changes to the core on behalf of others in the future would rather rebuild than fail the build, then I guess we could extend the ReuseRun object to be able to carry such information (rebuildIfMissing) and act accordingly. We could set it to FALSE as default and get the same behaviour. Would that work?

Lars Kruse
added a comment - 14/Apr/11 9:45 PM Hi Dean, thanks for feed back - Good questions.
Why fail rather than rebuild if the configuration is missing? Well we think like this: In the case where we need to make the choice, the user obviously asked for a reuse - not a rebuild. Potentially a rebuild can take a long time - so it's likely that our guess will not always capture the users intend or wish. Therefore we choose to fail the build and print to the log what happened - simply because it's faster, and this way we're not assuming anything. Then the user (yes - there IS a human involved in this feature) can schedule another rebuild, only this time he explicitly asks for a rebuild - not a reuse.
You are right about your assumptions regarding the UI (we will present a mock-up of what we have in mind very soon) but in large: A rebuild action that takes you to a new form listing all the results in the matrix from the previous build, each cell in the matrix now has a small checkbox attached to it. The user checks the builds in the matrix he/she wants to rebuild - the other ones are reused. There will be short cuts like "select all failed", "select all unstable" ...etc. When the user hits submit, the choices are traversed and all the builds that shall be reused will have the ReuseRun attribute set.
Our plan is to extend the ability to reuse a build in the core (as presented here) and then we will release the form that utilizes it as a seperate (new) plugin.
...leading to your third question - or concern: It requires a knowning human! Yes! The scenario we see for this reuse/rebuild feature is:
In a large matrix some (most) configurations has succeeded, a few have failed or are unstable. Going through the logs it appears that it's due to the slaves configuration or abilities that something went wrong (say a process hung or a network connection was lost or it failed to get a license to tool it depended on....) Now you don't want to rebuild the entire matrix again (imagine that it takes several hours to do that). You just pick the failed builds and ask to reuse the successful ones. We imagine that the configurations that are rerun will still honour all the various other settings of the job, thus if the job picks latest commit on a branch, and there is a new commit - then you are right it's a different build then before.
We imagine that users must take the necessary steps in their job configurations to avoid mistakes like this to happen. In our setup we're building a named commit, so it's safe - for us
The thing is, we realize that the behaviour we're looking for might be special, but the aproach to how we implemented it (in AbstractBuild and Run) is generic. Other plugins could use the same feature to make different behaviour - either by contributing to our plugin or by creating a new one. The design should support that no more extensions are needed in the core to make either work.
Hmmm - But to avoid making future changes to the core on behalf of others in the future would rather rebuild than fail the build, then I guess we could extend the ReuseRun object to be able to carry such information (rebuildIfMissing) and act accordingly. We could set it to FALSE as default and get the same behaviour. Would that work?
Again, Thanks for feed back, it's very welcome!

Given the problems arisen by rebuilt matrices composed of several different matrix builds, turns out to be too big.
Thus the demand for rebuilding a matrix from a specific matrix build is required.
This requirement needs to be build into the core.

Christian Wolfgang
added a comment - 27/May/11 9:37 AM Given the problems arisen by rebuilt matrices composed of several different matrix builds, turns out to be too big.
Thus the demand for rebuilding a matrix from a specific matrix build is required.
This requirement needs to be build into the core.

thus if the job picks latest commit on a branch, and there is a new commit - then you are right it's a different build then before.

Doesn't every job (that builds from an SCM) have some sort of "bookmark" (i.e. the commit hash in git) of the point in the SCM's history where a build was done? Couldn't this "rebuild" feature ensure that the SCM checkout is done at that "bookmark" rather than the latest as it usually is supposed to do?

That would then ensure that you don't have a different build than the one you are rebuilding.

Brian Murrell
added a comment - 14/Jun/11 11:58 AM
thus if the job picks latest commit on a branch, and there is a new commit - then you are right it's a different build then before.
Doesn't every job (that builds from an SCM) have some sort of "bookmark" (i.e. the commit hash in git) of the point in the SCM's history where a build was done? Couldn't this "rebuild" feature ensure that the SCM checkout is done at that "bookmark" rather than the latest as it usually is supposed to do?
That would then ensure that you don't have a different build than the one you are rebuilding.

Brian Murrell
added a comment - 14/Jun/11 3:56 PM
Yes, the core changes are in 1.416, so my guess is this upcoming Monday.
Oh. I thought this went into 1.413. From the Changelog:
What's new in 1.413 (2011/05/22)
...
Support rebuilding a subset of matrix configurations (issue 1613)
...
Was the commit to 1.413 incomplete perhaps?

One problem with the current implementation is it's effect on the build history graph - suppose you incrementally re-run parts of an unreliable matrix build, it gives the impression that the number of tests in each run are diminishing - when in actual fact the number of tests remain constant, we are just running a smaller subset.

Would it be possible to alter this so that results from previous runs which are not re-executed by Matrix Reloaded are carried forwards. That means I'd like to see the height of the bar-graph remain constant with the red proportion gradually diminishing with each additional build.

Here's a slightly exaggerated screenshot which illustrates the problem:

salimfadhley
added a comment - 25/Nov/11 3:14 PM One problem with the current implementation is it's effect on the build history graph - suppose you incrementally re-run parts of an unreliable matrix build, it gives the impression that the number of tests in each run are diminishing - when in actual fact the number of tests remain constant, we are just running a smaller subset.
Would it be possible to alter this so that results from previous runs which are not re-executed by Matrix Reloaded are carried forwards. That means I'd like to see the height of the bar-graph remain constant with the red proportion gradually diminishing with each additional build.
Here's a slightly exaggerated screenshot which illustrates the problem:
http://imgur.com/HEV9J

Rayees Namathponnan
added a comment - 19/Jun/13 7:13 PM Subset start jobs with new Run number, it will generate separate report;
Is it possible to make option to append result with its parent job number ?
Example Executed Job 100, there are 5 failure, if you rebuild the subset, starting job from 101; instead of this can append this with 100 itself ?