Software engineering in theory and practice

Menu

I love the use of semantic versioning. It provides clear guidelines when an API version number should be bumped. In particular, if a change to an API breaks backward compatibility, the major version identifier must be updated. Thus, when as a client of such an API I see a minor or patch update, I can safely upgrade, since these will never break my build.

But to what extent are people actually using the guidelines of semantic versioning? Has the adoption of semantic versioning increased over time? Does the use of semantic versioning speedup library upgrades? And if there are breaking changes, are these properly announced by means of deprecation tags?

To better understand semantic versioning, we studied seven years of versioning history in the Maven Central Repository. Here is what we found.

Versioning Policies

Semantic Versioning is a systematic approach to give version numbers to API releases. The approach has been developed by Tom Preston-Werner, and is advocated by GitHub. It is “based on but not necessarily limited to pre-existing widespread common practices in use in both closed and open-source software.”

In a nutshell, semantic versioning is based on MAJOR.MINOR.PATCH identifiers with the following meaning:

MAJOR: increment when you make incompatible API changes

MINOR: increment when you add functionality in backward compatible manner

PATCH: incremented when you make backward-compatible bug fixes

Versioning in Maven Central

To understand how developers actually use versioning policies, we studied 7 years of history of maven. Our study comprises around 130,000 released jar files, of which 100,000 come with source code. On average, we found around 7 versions per library in our data set.

The data set runs from 2006 to 2011, so predates the semantic versioning manifesto. Since the manifesto claims to be based on existing practices, it is interesting to investigate to what extent API developers releasing via maven have adhered to these practices.

As shown below, in Maven Central, the major.minor.patch scheme is adopted by the majority of projects.

Breaking Changes

To understand to what extent version increments correspond to breaking changes, we identified all “update-pairs”: A version as well as its immediate successor of a given maven package.

To be able to assess semantic versioning compliance, we need to determine backward compatibility for such update pairs. Since in general this is undecidable, we use binary compatibility as a proxy. Two libraries are binary compatible, if interchanging them does not force one to recompile. Examples of incompatible changes are removing a public method or class, or adding parameters to a public method.

We used the Clirr tool to detect such binary incompatibilities automatically for Java. For those interested, there is also a SonarQube Clirr plugin for compatibliity checking in a continuous integration setting; An alternative tool is japi.

Major versus Minor/Patch Releases

For major releases, breaking changes are permitted in semantic versioning. This is also what we see in our dataset: A little over 1/3d of the major releases introduces binary incompatibilities:

Interestingly, we see a very similar figure for minor releases: A little over 1/3d of the minor releases introduces binary incompatibilities too. This, however, is in violation with the semantic versioning guidelines.

The situation is somewhat better for patch releases, which introduce breaking changes in around 1/4th of the cases. This nevertheless is in conflict with semantic versioning, which insists that patches are compatible.

Overall, we see that minor and major releases are equally likely to introduce breaking changes.

The graph below shows that these trends are fairly stable over time. Minor (~40%) and patch releases (~45%) are most common, and major releases (~15%) are least common.

The orange line indicates releases with breaking changes: Around 30% of the releases introduce breaking changes — a number that is slightly decreasing over time, but still at 29%.

Of these releases with breaking changes, the vast majority is non-major (the crossed green line at ~25%). Thus, 1/4th of the releases does not comply with semantic versioning.

Deprecation to the Rescue?

Apart from requiring that such changes take place in major releases,
semantic versioning insists on using deprecation to announce such changes. In particular:

Before you completely remove the functionality in a new major release there should be at least one minor release that contains the deprecation so that users can smoothly transition to the new API.

Given the many breaking changes we saw, how many deprecation tags would there be?

At the library level, 1000 out of 20,000 libraries (5%) we studied include at least one depracation tag.

When we look at the individual method level, we find the following:

Over 86,000 public methods are removed in a minor release, without any deprecation tag.

Almost 800 public methods receive a deprecation tag in their life span: These methods, however, are never removed.

16 public methods receive a deprecation tag that is subsequently undone in a later release (the method is “undeprecated”)

That is all we found. In other words, we did not find a single method that was deprecated according to the guidelines of semantic versioning (deprecate in minor, then delete in major).

Furthermore, the worst behavior (deletion in minor without deprecation, over 86,000 methods) occurs hundred times more often than the best behavior witnessed (deprecation without deletion, almost 800 methods)

Discussion

Our findings are based on a somewhat older dataset obtained from maven. It might therefore be that adherence to semantic versioning at the moment is higher than what we report.

Our findings also take breaking changes to any public method or class into account. In practice, certain packages will be considered as API only, whereas others are internal only — yet declared public due to the limitations of the Java modularization system. We are working on a further analysis in which we explicitly measure to what extent the changed public methods are used. Initial findings within the maven data set suggest that there are hundreds of thousands of compilation problems in clients caused by these breaking changes — suggesting that it is not just internal modules containing these changes.

Some non-Java package managers explicitly have adopted semantic versioning. A notable example is npm, and also nuget has initiated a discussion. While npm has a dedicated library to determine version orderings according to the Semantic Versioning standard, I am not aware of Javascript compatibility tools that are used to check for breaking changes. For .NET (nuget) tools like Clirr exists, such as LibCheck, ApiChange, and the NDepend build comparison tools. I do not know, however, how widespread their use is.

Last but not least, binary incompatibility, as measured by Clirr, is just one form of backward incompatibility. An API may change the behavior of methods as well, in breaking ways. A way to detect those might be by applying old test suites to new code — which we have not done yet.

Conclusion

Breaking API changes are a fact of life. Therefore, meaningful version identifiers are a cause worth fighting for. Thus:

If you adopt semantic versioning for your API, double check that your minor and patch releases are indeed backward compatible. Consider using tools such as clirr, or a dedicated test suite.

If you rely on an API that claims to be following semantic versioning, don’t put too much faith in backward compatibility promises.

If you want to follow semantic versioning, be prepared to bump the major version identifier often.

Should you need to introduce breaking changes, do not forget to announce these with deprecation tags in minor releases first.

If you are in research, note that the problems involved in preventing, encapsulating, detecting, and handling breaking changes are tough and real. Backward compatiblity deserves your research attention.

The full details of our study are published in our paper presented (slides) at the IEEE International Conference on Source Analysis and Manipulation (SCAM), held in Victoria, BC, September 2014.

EDIT, November 2014

If you are a Java developer and serious about semantic versioning, then checkout this open source semantic versioning checking tool. It is available on Maven Central and on GitHub, and it can check differences, validate version numbers, and suggest proper version numbers for your new jar release.

At first sight, this sounds like a great success of knowledge transfer from academic research to industry practice. Upon closer inspection, the Maintainability Index turns out to be problematic.

The Original Index

The Maintainabilty Index was introduced in 1992 by Paul Oman and Jack Hagemeister, originally presented at the International Conference on Software Maintenance ICSM 1992 and later refined in a paper that appeared in IEEE Computer. It is a blend of several metrics, including Halstead’s Volume (HV), McCabe’s cylcomatic complexity (CC), lines of code (LOC), and percentage of comments (COM). For these metrics, the average per module is taken, and combined into a single formula:

To arrive at this formula, Oman and Hagemeister started with a number of systems from Hewlett-Packard (written in C and Pacscal in the late 80s, “ranging in size from 1000 to 10,000 lines of code”). For each system, engineers provided a rating (between 1-100) of its maintainability. Subsequently, 40 different metrics were calculated for these systems. Finally, statistical regression analysis was applied to find the best way to combine (a selection of) these metrics to fit the experts’ opinion. This eventually resulted in the given formula. The higher its value, the more maintainable a system is deemed to be.

The maintainability index attracted quite some attention, also because the Software Engineering Institute (SEI) promoted it, for example in their 1997 C4 Software Technology Reference Guide. This report describes the Maintainability Index as “good and sufficient predictors of maintainability”, and “potentially very useful for operational Department of Defense systems”. Furthermore, they suggest that “it is advisable to test the coefficients for proper fit with each major system to which the MI is applied.”

Use in Visual Studio

Visual Studio Code Metrics were announced in February 2007. A November 2007 blogpost clarifies the specifics of the maintainability index included in it. The formula Visual Studio uses is slightly different, based on the 1994 version:

As you can see, the constants are literally the same as in the original formula. The new definition merely transforms the index to a number between 0 and 100. Also, the comment metrics has been removed.

Furthermore, Visual Studio provides an interpretation:

MI >= 20

High Maintainabillity

10 <= MI < 20

Moderate Maintainability

MI < 10

Low Maintainability

I have not been able to find a justification for these thresholds. The 1994 IEEE Computer paper used 85 and 65 (instead of 20 and 10) as thresholds, describing them as a good “rule of thumb”.

The metrics are available within Visual Studio, and are part of the code metrics power tools, which can also be used in a continuous integration server.

Concerns

I encountered the Maintainability Index myself in 2003, when working on Software Risk Assessments in collaboration with SIG. Later, researchers from SIG published a thorough analysis of the Maintainability Index (first when introducing their practical model for assessing maintainability and later as section 6.1 of their paper on technical quality and issue resolution).

Based on this, my key concerns about the Maintainability Index are:

There is no clear explanation for the specific derived formula.

The only explanation that can be given is that all underlying metrics (Halstead, Cyclomatic Complexity, Lines of Code) are directly correlated with size (lines of code). But then just measuring lines of code and taking the average per module is a much simpler metric.

The Maintainability Index is based on the average per file of, e.g., cyclomatic complexity. However, as emphasized by Heitlager et al, these metrics follow a power law, and taking the average tends to mask the presence of high-risk parts.

The set of programs used to derive the metric and evaluate it was small, and contained small programs only.

Furthermore, the programs were written in C and Pascal, which may have rather different maintainability characteristics than current object-oriented languages such as C#, Java, or Javascript.

For the experiments conducted, only few programs were analyzed, and no statistical significance was reported. Thus, the results might as well be due to chance.

Tool smiths and vendors used the exact same formula and coefficients as the 1994 experiments, without any recalibration.

One could argue that any of these concerns is reason enough not to use the Maintainability Index.

These concerns are consistent with a recent (2012) empirical study, in which one application was independently built by four different companies. The researchers used these systems two compare maintainability and several metrics, including the Maintainability Index. Their findings include that size as a measure of maintainability has been underrated, and that the “sophisticated” maintenance metrics are overrated.

Think Twice

In summary, if you are a researcher, think twice before using the maintainability index in your experiments. Make sure you study and fully understand the original papers published about it.

If you are a tool smith or tool vendor, there is not much point in having several metrics that are all confounded by size. Check correlations between the metrics you offer, and if any of them are strongly correlated pick the one with the clearest and simplest explanation.

Last but not least, if you are a developer, and are wondering whether to use the Maintainability Index: Most likely, you’ll be better off looking at lines of code, as it gives easier to understand information on maintainability than a formula computed over averaged metrics confounded by size.

To any software engineer, the two consecutive goto fail lines will be suspicious. They are, and more than that. To quote Adam Langley:

The code will always jump to the end from that second goto, err will contain a successful value because the SHA1 update operation was successful and so the signature verification will never fail.

Not verifying a signature is exploitable. In the upcoming months, there will remain plenty of devices running older versions of iOS and MacOS. These will remain vulnerable, and epxloitable.

Brittle Software Engineering

When first seeing this code, I was once again caught by how incredibly brittle programming is. Just adding a single line of code can bring a system to its knees.

For seasoned software engineers this will not be a real surprise. But students and aspiring software engineers will have a hard time believing it. Therefore, sharing problems like this is essential, in order to create sufficient awareness among students that code quality matters.

Code Formatting is a Security Feature

When reviewing code, I try to be picky on details, including white spaces, tabs, and new lines. Not everyone likes me for that. I often wondered whether I was just annoyingly pedantic, or whether it was the right thing to do.

The case at hand shows that white space is a security concern. The correct indentation immediately shows something fishy is going on, as the final check now has become unreachable:

Indentation Must be Automatic

Because code formatting is a security feature, we must not do it by hand, but use tools to get the indentation right automatically.

A quick inspection of the sslKeyExchange.c source code reveals that it is not routinely formatted automatically: There are plenty of inconsistent spaces, tabs, and code in comments. With modern tools, such as Eclipse-format-on-save, one would not be able to save code like this.

Yet just forcing developers to use formating tools may not be enough. We must also invest in improving the quality of such tools. In some cases, hand-made layout can make a substantial difference in understandability of code. Perhaps current tools do not sufficiently acknowledge such needs, leading to under-use of today’s formatting tools.

Code Review to the Rescue?

Code review can be effective against these sorts of bug. Not just auditing, but review of each change as it goes in. I’ve no idea what the code review culture is like at Apple but I strongly believe that my colleagues, Wan-Teh or Ryan Sleevi, would have caught it had I slipped up like this. Although not everyone can be blessed with folks like them.

There is a mismatch between the expectations and the actual outcomes of code reviews. From our study, review does not result in identifying defects as often as project members would like and even more rarely detects deep, subtle, or “macro” level issues. Relying on code review in this way for quality assurance may be fraught.

Automated Checkers to the Rescue?

If manual reviews won’t find the problem, perhaps tools can find it? Indeed, the present mistake is a simple example of a problem caused by unreachable code. Any computer science student will be able to write a (basic) “unreachable code detector” that would warn about the unguarded goto fail followed by more (unreachable) code (assuming parsing C is a ‘solved problem’).

Therefore, it is no surprise that plenty of commercial and open source tools exist to check for such problems automatically: Even using the compiler with the right options (presumably -Weverything for Clang) would warn about this problem.

Here, again, the key question is why such tools are not applied. The big problem with tools like these is their lack of precision, leading to too many false alarms. Forcing developers to wade through long lists of irrelevant warnings will do little to prevent bugs like these.

Unfortunately, this lack of precision is a direct consequence of the fact that unreachable code detection (like many other program properties of interest) is a fundamentally undecidable problem. As such, an analysis always needs to make a tradeoff between completeness (covering all suspicious cases) and precision (covering only cases that are certainly incorrect).

To understand the sweet spot in this trade off, more research is needed, both concerning the types of errors that are likely to occur, and concerning techniques to discover them automatically.

Testing is a Security Concern

As an advocate of unit testing, I wonder how the code could have passed a unit test suite.

Unfortunately, the testability of the problematic source code is very poor. In the current code, functions are long, and they cover many cases in different conditional branches. This makes it hard to invoke specific behavior, and bring the functions in a state in which the given behavior can be tested. Furthermore, observability is low, especially since parts of the code deliberately obfuscate results to protect against certain times of attacks.

Thus, given the current code structure, unit testing will be difficult. Nevertheless, the full outcome can be tested, albeit it at the system (not unit) level. Quoting Adam Langley again:

In other words, while the code may be hard to unit test, the system luckily has well defined behavior that can be tested.

Coverage Analysis

Once there is a set of (system or unit level) tests, the coverage of these tests can be used an indicator for (the lack of) completeness of the test suite.

For the bug at hand, even the simplest form of coverage analysis, namely line coverage, would have helped to spot the problem: Since the problematic code results from unreachable code, there is no way to achieve 100% line coverage.

Therefore, any serious attempt to achieve full statement coverage should have revealed this bug.

Note that trying to achieve full statement coverage, especially for security or safety critical code, is not a strange thing to require. For aviation software, statement coverage is required for criticality “level C”:

Level C:

Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a major failure condition for the aircraft.

This is one of the weaker categories: For the catastrophic ‘level A’, even stronger test coverage criteria are required. Thus, achieving substantial line coverage is neither impossible nor uncommon.

The Return-Error-Code Idiom

Finally, looking at the full sources of the affected file, one of the key things to notice is the use of the return code idiom to mimic exception handling.

Since C has no built-in exception handling support, a common idiom is to insist that every function returns an error code. Subsequently, every caller must check this returned code, and include an explicit jump to the error handling functionality if the result is not OK.

This is exactly what is done in the present code: The global err variable is set, checked, and returned for every function call, and if not OK followed by (hopefully exactly one) goto fail.

Almost 10 years ago, together with Magiel Bruntink and Tom Tourwe, we conducted a detailed empirical analysis of this idiom in a large code base of an embedded system.

One of the key findings of our paper is that in the code we analyzed we found a defect density of 2.1 deviations from the return code idiom per 1000 lines of code. Thus, in spite of very strict guidelines at the organization involved, we found many examples of not just unchecked calls, but also incorrectly propagated return codes, or incorrectly handled error conditions.

Based on that, we concluded:

The idiom is particularly error prone, due to the fact that it is omnipresent as well as highly tangled, and requires focused and well-thought programming.

EDIT February 24, 2014: Fixed incorrect use of ‘dead code detection’ terminology into correct ‘unreachable code detection’, and weakened the claim that any computer science student can write such a detector based on the reddit discussion.

When teaching software architecture it is hard to strike the right balance between practice (learning how to work with real systems and painful trade offs) and theory (general solutions that any architect needs to thoroughly understand).

To address this, we decided to try something radically different at Delft University of Technology:

To make the course as realistic as possible, we based the entire course on GitHub. Thus, teams of 3-4 students had to adopt an active open source GitHub project and contribute. And, in the style of “How GitHub uses GitHub to Build GitHub“, all communication within the course, among team members, and with external stakeholders took place through GitHub where possible.

Furthermore, to ensure that students actually digested architectural theory, we made them apply the theory to the actual project they picked. As sources we used literature available on the web, as well as the software architecture textbook written by Nick Rozanski and Eoin Woods.

Course Contents

GitHub Project Selection

At the start of the course, we let teams of 3-4 students adopt a project of choice on GitHub. We placed a few constraints, such as that the project should be sufficiently complex, and that it should provide software that is actually used by others — there should be something at stake.

Based on our own analysis of GitHub, we also provided a preselection of around 50 projects, which contained at least 200 pull requests (active involvement from a community) and an automated test suite (state of the art development practices).

The open source projects generally welcomed these contributions, as illustrated by the HornetQ reaction:

Thanks for the contribution… I can find you more if you need to.. thanks a lot! :+1:

And concerning the Netty tutorial:

Great screencast, fantastic project.

Students do not usually get this type of feedback, and were proud to have made an impact. Or, as the CakePHP team put it:

All in all we are very happy with the reactions we got, and we really feel our contribution mattered and was appreciated by the CakePHP community.

To make it easier to establish contact with such projects, we setup a decidcated @delftswa Twitter account. Many of the projects our students analyzed have their own Twitter accounts, and just mentioning a handle like @jenkinsci or @cakephp helped receive the project’s attention and establish the first contact.

Identifying Stakeholders

One of the first assignments was to conduct a stakeholder analysis:
understanding who had an interest in the project, what their interest was, and which possibly conflicting needs existed.

To do this, the students followed the approach to identify and engage stakeholders from Rozanski and Woods. They distinguish various stakeholder classes, and recommend looking for stakeholders who are most affected by architectural decisions, such as those who have to use the system, operate or manage it, develop or test it, or pay for it.

Aurelius Stakeholders (click to enlarge)

To find the stakeholders and their architectural concerns, the student teams analyzed any information they could find on the web about their project. Besides documentation and mailing lists, this also included an analysis of recent issues and pull requests as posted on GitHub, in order to see what the most pressing concerns of the project at the moment are and which stakeholders played a role in these discussions.

Architectural Views

Since Kruchten’s classical “4+1 views” paper, it has been commonly accepted that there is no such thing as the architecture. Instead, a single system will have multiple architectural views.

Following the terminology from Rozanski and Woods, stakeholders have different concerns, and these concerns are addressed through views.
Therefore, with the stakeholder analysis in place, we instructed students to create relevant views, using the viewpoint collection from Rozanski and Woods as starting point.

Rozanski & Woods Viewpoints

Some views the students worked on include:

A context view, describing the relationships, dependencies, and interactions between the system and its environment. For example, for Diaspora, the students highlighted the relations with users, `podmins’, other social networks, and the user database.

A development view, describing the system for those involved in building, testing, maintaining, and enhancing the system. For example, for neo4j, students explained the packaging structure, the build process, test suite design, and so on.

A deployment view, describing the environment into which the system will be deployed, including the dependencies the system has on its runtime environment. For GitLab, for example, the students described the installations on the clients as well as on the server.

Besides these views, students could cover alternative ones taken from other software architecture literature, or they could provide an analysis of the perspectives from Rozanski and Woods, to cover such aspects as security, performance, or maintainability.

Again, students were encouraged not just to produce architectural documentation, but to challenge the usefulness of their ideas by sharing them with the projects they analyzed. This also forced them to focus on what was needed by these projects.

For example, the jQuery Mobile was working on a new version aimed at realizing substantial performance improvements. Therefore, our students conducted a series of performance measurements comparing two versions as part of their assignment. Likewise, for CakePHP, our students found the event system of most interest, and contributed a well-received description of the new CakePHP Events System.

Software Metrics

For an architect, metrics are an important tool to discuss (desired) quality attributes (performance, scaleability, maintainability, …) of a system quantitatevely. To equip students with the necessary skills to use metrics adequately, we included two lectures on software metrics (based on our ICSE 2013 tutorial).

Students were asked to think of what measurements could be relevant for their projects. To that end, they first of all had to think of an architecturally relevant goal, in line with the Goal/Question/Metric paradigm. Goals picked by the students included analyzing maintainability, velocity in responding to bug reports, and improving application responsiveness.

Subsequently, students had to turn their project goal into 3 questions, and identify 2 metrics per question.

Brackets Analyzability

Many teams not just came up with metrics, but also used some scripting to actually conduct measurements. For example, the Brackets team decided to build RequireJS-Analyzer, a tool that crawls JavasScript code connected using RequireJS and analyzes the code, to report on metrics and quality.

Architectural Sketches

Being an architect is not just about views and metrics — it is also about communication. Through some luck, this year we were able to include an important yet often overlooked take on communication: The use of design sketches in software development teams.

We were able to attract Andre van der Hoek (UC Irvine) as guest speaker. To see his work on design sketches in action, have a look at the Calico tool from his group, in which software engineers can easily share design ideas they write at a sketch board.

In one of the first lectures, Andre presented Calico to the students. One of the early assignments was to draw any free format sketches that helped to understand the system under study.

The results were very creative. The students freed themselves from the obligation to draw UML-like diagrams, and used colored pencils and papers to draw what ever they felt like.

For example, the above picture illustrates the HornetQ API; Below you see the Jenkins work flow, including dollars floating from customers the contributing developers.

Lectures From Industry

Last but not least, we wanted students to meet and learn from real software architects. Thus, we invited presentations from the trenches — architects working in industry, sharing their experience.

For example, in 2013 we had TU Delft alumnus Maikel Lobbezoo, who presented the architectural challenges involved in building the payment platform offered by Adyen. The Adyen platform is used to process gazilions of payments per day, acrross the world. Adyen has been one of the fastest growing tech companies in Europe, and its success can be directly attributed to its software architecture.

Maikel explained the importance of having a stateless architecture, that is linearly scalable, redundant and based on push notifications and idempotency. His key message was that the successful architect possesses a (rare) combination of development skills, communicative skills, and strategical bussiness insight.

Our other guest lecture covered cyber-physical systems: addressing the key role software plays in safety-critical infrastructure systems such as road tunnels — a lecture provided by Eric Burgers from Soltegro. Here the core message revolved around traceability to regulations, the use of model-based techniques such as SysML, and the need to communicate with stakeholders with little software engineering experience.

Course Structure

Constraints

Some practical constraints of the course included:

The course was worth 5 credit points (ECTS), corresponding to 5 * 27 = 135 hours of work per student.

The course run for half a semester, which is 10 consecutive weeks

Each week, there was time for two lectures of 90 minutes each

We had a total of over 50 students, eventually resulting in 14 groups of 3-4 students.

The course was intended four 4th year students (median age 22) who were typically following the TU Delft master in Computer Science, Information Architecture, or Embedded Systems.

GitHub-in-the-Course

All communication within the course took place through GitHub, in the spirit of the inspirational video How GitHub uses GitHub to Build GitHub. As a side effect, this eliminated the need to use the (not so popular) Blackboard system for intra-course communication.

From GitHub, we obtained a (free) delftswa organization. We used it to host various repositories, for example for the actual assignments, for the work delivered by each of the teams, as well as for the reading material.

Each student team obtained one repository for themselves, which they could use to collaboratively work on their assignments. Since each group worked on a different system, we decided to make these repositories visible to the full group: In this way, students could see the work of all groups, and learn from their peers how to tackle a stakeholder analysis or how to assess security issues in a concrete system.

Having the assignment itself on a GitHub repository not only helped us to iteratively release and improve the assignment — it also gave students the possibility to propose changes to the assignment, simply by posting a pull request. Students actually did this, for example to fix typos or broken links, to pose questions, to provide answers, or to start a discussion to change the proposed submission date.

We considered whether we could make all student results public. However, we first of all wanted to provide students with a safe environment, in which failure could be part of the learning process. Hence it was up to the students to decide which parts of their work (if any) they wanted to share with the rest of the world (through blogs, tweets, or open GitHub repositories).

Time Keeping

One of our guiding principles in this course was that an architect is always eager to learn more. Thus, we expected all students to spend the full 135 hours that the course was supposed to take, whatever their starting point in terms of knowledge and experience.

This implies that it is not just the end result that primarily counts, but the progress the students have made (although there is of course, a “minimum” result that should be met at least).

To obtain insight in the learning process, we asked students to keep track of the hours they spent, and to maintain a journal of what they did each week.

For reasons of accountability such time keeping is a good habit for an architect to adopt. This time keeping was not popular with students, however.

Student Presentations

We spent only part of the lecturing time doing class room teaching.
Since communication skills are essential for the successful architect,
we explicitly allocated time for presentations by the students:

Half-way all teams presented a 3-minute pitch about their plans and challenges.

At the end of the course we organized a series of 15 minute presentations in which students presented their end results.

Each presentation was followed by a Q&A round, with questions coming from the students as well as an expert panel. The feedback concerned the architectural decisions themselves, as well as the presentation style.

Grading

Grading was based per team, and based on the following items:

Final report: the main deliverable of each team, providing the relevant architectural documentation created by the team.

Series of intermediate (‘weekly’) reports corresponding to dedicated assignments on, e.g, metrics, particular views, or design sketches. These assignments took place in the first half of the course, and formed input for the final report;

Team presentations

Furthermore, there was one individual assignment, in which each student had to write a review report for the work of one of the other teams. Students did a remarkably good (being critical as well as constructive) job at this. Besides allowing for individual grading, this also ensured each team received valuable feedback from 3-4 fellow students.

Evaluation

All in all, teaching this course was a wonderful experience. We gave the students considerable freedom, which they used to come up with remarkable results.

A key success factor was peer learning. Groups could learn from each other, and gave feedback to each other. Thus, students not just made an in depth study of the single system they were working on, but also learned about 13 different architectures as presented by the other groups. Likewise, they not just learned about the architectural views and perspectives they needed themselves, but learned from their co-students how they used different views in different ways.

The use of GitHub clearly contributed to this peer learning. Any group could see, study, and learn from the work of any other group. Furthermore, anyone, teachers and students alike, could post issues, give feedback, and propose changes through GitHub’s facilities.

On the negative side, while most students were familiar with git, not all were. The course could be followed using just the bare minimum of git knowledge (add, commit, checkout, push, pull), yet the underlying git magic sometimes resulted in frustration with the students if pull requests could not be merged due to conflicts.

An aspect the students liked least was the time keeping. We are searching for alternative ways of ensuring that time is spent early on in the project, and ways in which we can assess the knowledge gain instead of the mere result.

One of the parameters of a course like this is the actual theory that is discussed. This year, we included views, metrics, sketches, as well as software product lines and service oriented architectures. Topics that were less explicit this year were architectural patterns, or specific perspectives such as security or concurrency. Some of the students indicated that they liked the mix, but that they would have preferred a little more theory.

Conclusion

In this course, we aimed at getting our students ready to take up a leading role in software development projects. To that end, the course put a strong focus on:

Close involvement with open source projects where something was at stake, i.e., which were actually used and under active development;

The use of peer learning, leveraging GitHub’s social coding facilities for delivering and discussing results

A sound theoretical basis, acknowleding that architectural concepts can only be grasped and deeply understood when trying to put them into practice.

Next year, we will certainly follow a similar approach, naturally with some modifications. If you have suggestions for a course like this, or are aware of similar courses, please let us know!

Share this:

Like this:

One of the software testing events in The Netherlands I like a lot is the Test Automation Day. It is a one day event, which in 2014 will take place on June 19 in Rotterdam. The day will be packed with (international) speakers, with presentations targeting developers, testers, as well as as management.

Test automation is a broad topic: it aims at offering tool support for software testing processes where ever this is possible. This raises a number of challenging questions:

What are the costs and benefits of software test automation? When and why should I apply a given technique?

What tools and techniques are available? What are their limitations and how can they be improved?

The Test Automation Day was initiated four years ago by Squerist, and organized by CKC Seminars. In the past years, it has brought together over 200 test experts each year. During those years, we have built up some good traditions:

Presentations from industry (testers, developers, client organizations) and universities (new research results, tools and techniques), with a focus on practical applicability.

A mixture of invited (keynote) presentations, open contributions that can be submitted by anyone working in the area of software test automation, and sponsored presentations.

Pre-conference events, which this year will be in the form of a masterclass on Wednesday June 18.

This year’s call for contributions is still open. If you do something cool in the area of software test automation, please submit a proposal for a presentation! Deadline Monday December 2, 2013!

You are very much welcome to submit your proposal. Your proposal will be evaluated by the international TADNL program committee, who will look at the timeliness, innovation, technical content, and inspiration that the audience will draw from your presentation. Also, if you have suggestions for a keynote you would like to hear at TADNL2014, please send me an email!

This year’s special theme is test automation innovation: so if you’re work with or working on some exciting new test automation method or technique, make sure you submit a paper!

We look forward to your contributions, and to seeing you in Rotterdam in June 2014!

Share this:

Like this:

Last week I was discussing how one of my students could use test coverage analysis in his daily work. His observation was that his manager had little idea about coverage. Do managers need coverage information? Can they do harm with coverage numbers? What good can project leads do with coverage data?

For me, coverage analysis is first and foremost a code reviewing tool. Test coverage information helps me to see which code has not been tested. This I can use to rethink

The test suite: How can I extend the test suite to cover the corner cases that I missed?

The test strategy: Is there a need for me to extend my suite to cover the untested code? Why did I miss it?

The system design: How can I refactor my code so that it becomes easier to test the missing cases?

An incoming change (pull request): Is the contributed code well tested?

Coverage is not a management tool. While coverage information is essential to have good discussions on testing, using test coverage percentages to manage a project calls for caution.

To appreciate why managing by coverage numbers is problematic, let’s look at four common pitfalls of using software metrics in general, and see how they apply to test coverage as well.

Coverage Without Context

A coverage number gives a percentage of statements (or classes, methods, branches, blocks, …) in your program hit by the test suite. But what does such a number mean? Is 80% good or not? When is it good? Why is it good? Should I be alarmed at anything below 50%?

The key observation is that coverage numbers should not be leading: coverage targets (if used at all) should be derived from an overall project goal. For example:

We want a well-designed system. Therefore we use coverage analysis to discover classes that are too hard to test and hence call for refactoring;

We want to make sure we have meaningful tests in place for all critical behavior. Therefore, we analyze for which parts of this behavior no such tests are in place.

We have observed class Foo is bug prone. Hence we monitor its coverage, and no change that decreases its coverage is allowed.

Treating The Numbers

Coverage numbers are easily tricked. All one needs to do is invoke a few methods in the test cases. No need for assertions, or any meaningful verification of the outcomes, and the coverage numbers are likely to go up.

Test cases to game coverage numbers are dangerous:

The resulting test cases are extremely weak. The only information they provide is that a certain sequence of calls does not lead to an unexpected exception.

The faked high coverage may give a false sense of confidence;

The presence of poor tests undermines the testing culture in the team.

Once again: A code coverage percentage is not a goal in itself: It is a means to an end.

One-Track Coverage

There is little information that can be derived from the mere knowledge that a test suite covers 100% of the statements. Which data values went through the statements? Was my if-then conditional exercised with a true as well as a false value? What happens if I invoke my methods in a different order?

A lot more information can be derived from the fact that the test suite achieves, say, just 50% test coverage. This means that half of your code is not even touched by the test suite. The number tells you that you cannot use your test suite to say anything about the quality of half of your code base.

As such, statement coverage can tell us when a test suite is inadequate, but it can never tell us whether a test suite is adequate.

For this reason, a narrow focus on a single type of coverage (such as statement coverage) is undesirable. Instead, a team should be aware of a range of different coverage criteria, and use them where appropriate. Example criteria include decision coverage, state transition coverage, data-flow coverage, or mutation coverage — and the only limitation is our imagination, as can be seen from Cem Kaner’s list of 101 coverage metrics, compiled in 1996.

The latter technique, mutation coverage, may help to fight gaming of coverage numbers. The idea is to generate mutations to the code under test, such as the negation of Boolean conditions or integer outcomes, or modifications of arithmetic expressions. A good test suite should trigger a failure for any such ‘erroneous’ motivation. If you want to experiment with mutation testing in Java (or Scala), you might want to take a look at the pitest tool.

Coverage Galore

A final pitfall in using coverage metrics is to become over-enthusiastic, measuring coverage in many (possibly overlapping) ways.

Even a (relatively straightforward) coverage tool like EclEmma for Java can measure instruction, branch, line, method, and type coverage, and as a bonus can compute the cyclomatic complexity. Per metric it can show covered, missed, and total, as well as the percentage of covered items. Which of those should be used? Is it a good idea to monitor all forms of coverage at all levels?

Again, what you measure should depend on your goal.

If your project is just starting to test systematically, a useful goal will be to identify classes where a lot can be gained by adding tests. Then it makes sense to start measuring at the coarse class or method level, in order to identify classes most in need of testing. If, on the other hand, most code is reasonably well tested, looking at missed branches may provide the most insight

The general rule is to start as simple as possible (just line coverage will do), and then to add alternative metrics when the simpler metric does not lead to new insights any more.

Additional forms of coverage should be different from (orthogonal to) the existing form of coverage used. For example, line coverage and branch coverage are relatively similar. Thus, measuring branch coverage besides line coverage may not lead to many additional cases to test (depending on your code). Mutation coverage, dataflow-based coverage, or state-machine based coverage, on the other hand, are much more different from line coverage. Thus, using one of those may be more effective in increasing the diversity of your test cases.

Putting Coverage to Good Use

“Code without corresponding tests can be more challenging to understand, and is harder to modify safely. Therefore, knowing whether code has been tested, and seeing the actual test coverage numbers, can allow developers and managers to more accurately predict the time needed to modify existing code.” — I can only agree.

“Monitoring coverage reports helps development teams quickly spot code that is growing without corresponding tests” — This happens: see our paper on the co-evolution of test and production code)

“Given that a code coverage report is most effective at demonstrating sections of code without adequate testing, quality assurance personnel can use this data to assess areas of concern with respect to functional testing.” — A good reason to combine user story acceptance testing with coverage analysis.

Summary

Test coverage analysis is an important tool that any development team taking testing seriously should use.

More than anything else, coverage analysis is a reviewing tool that you can use to improve your own (test) code, and to to evaluate someone else’s code and test cases.

Coverage numbers can not be used as a key performance indicator (KPI) for project success. If your manager uses coverage as a KPI, make sure he is aware of the pitfalls.

Coverage analysis is useful to managers: If you are a team lead, make sure you understand test coverage. Then use coverage analysis as a way to engage in discussion with your developers about their (test) code and the team’s test strategy.

This requires not just good research, but also success in innovation: ensuring that research ideas are adopted in society. Why is innovation so hard? How can we increase innovation in software engineering and computer science alike? What can universities do to increase their innovation success rate?

I was made to rethink these questions when Informatics Europe contacted me to co-organize this year’s European Computer Science Summit (ECSS 2013) with special theme Informatics and Innovation.

When Informatics Europe asked me as program chair of their two day summit, my only answer could be “yes”. I have a long term interest in innovation, and here I had the opportunity and freedom to compile a full featured program on innovation as I saw fit, with various keynotes, panels, and presentations by participants — a wonderful assignment.

In compiling the program I took the words of Peter Denning as starting point: “an idea that changes no one’s behavior is only an invention, not an innovation.” In other words, innovation is about changing people, which is much harder than coming up with a clever idea.

In the end, I came up with the following “Dimensions of Innovations” that guided the composition of the program.

Startups

Innovation needs optimists who believe they can change the world. One of the best ways to bring a crazy new idea to sustainable success is by starting a new company dedicated to creating and conquering markets that had not existed before.

Many of the activities at ECSS 2013 relate to startups. The opening keynote is by FranÃ§ois Bancilhon, serial entrepreneur currently working in the field of open data. Furthermore, we have Heleen Kist, strategy consultant in the area of access to finance. Last but not least, the first day includes a panel, specifically devoted to entrepreneurship, and one of the pre-summit workshops is entirely devoted to entrepreneurship for faculty.

Patents

Patents are traditionally used to protect (possibly large) investments that may be required for successful innovation. In computer science, patents are more than problematic, as evidenced by patent trolls, fights between giants such as Oracle and Google, and the differences in regulations in the US and in Europe. Yet at the same time (software) patents can be crucial, for example to attract investors for a startup.

Several of the ECSS keynote speakers have concrete experience with patents — Pamela Zave at AT&T, and Erik Meijer from his time at Microsoft, when he co-authored hundreds of patents. Furthermore, Yannis Skulikaris of the European Patent Office will survey patenting of software-related inventions.

Open Source, Open Data

An often overlooked dimension of innovation are open source and open data. How much money can be made by giving away software, or by allowing other to freely use your data? Yet, many enterprises are immensely successful based on open source and open data.

How can universities use education to strengthen innovation? What should students learn so that they can become ambassadors of change? How innovative should students be so that they can become successful in society? At the same time, how can outreach and education be broadened so that new groups of students are reached, for example via on line learning?

To address these questions, at ECSS we have Anka Mulder, member of the executive board of Delft University of Technology, and former president of the OpenCourseWare Consortium. She is responsible for the TU Delft strategy on Massive On-Line Open Courses (MOOC), and she will share TU Delft experiences in setting up their MOOCs.

Furthermore, ECSS will host a panel discussion, in which experienced yet non-conformist teachers and managers will share their experience in innovative teaching to foster innovation.

Fostering Innovation

Policy makers and university management are often at loss on how to encourage their stubborn academics to contribute to innovation, the “third pillar” of academia.

Last but not least, we have Carl-Cristian Buhr, member of the cabinet of Digital Agenda Commissioner and EU Commission Vice-President Neelie Kroes, who will speak about the EU Horizon 2020 programme and its relevance to computer science research, education, and innovation.

Inspriational Content

All talk about innovation is void without inspirational content. Therefore, throughout the conference, exciting research insights and new course examples will be interwoven in the presentations.

For example, we have Pamela Zave speaking on The Power of Abstraction, Frank van Harmelen addressing progress in his semantic web work at the Network Institute, and Felienne Hermans on how to reach thousands of people through social media. Last but not least we have Erik Meijer, who is never scared to throw both math and source code in his presentation.

The summit will take place October 7–9, 2013, in Amsterdam. You are all welcome to join!