Saturday, October 18, 2014

Embracing Technology Change In Scientific Software

True 500 BC, this statement for sure holds nowadays in the software world. Whether at micro scale, when replacing a third parties library or continuously refactoring, or at macro scale, when adopting a new language, a framework or making major paradigm shift such as going parallel. Evolution is part of the development cycle.

Change can be planned, when a prototype paves the route for a production software or unexpected, when a technology breakthrough emerges during the lifespan of a project.

Although every domain is concerned by the phenomenon, scientific software raises a couple peculiarities. Often serving research, the purpose of the code itself will pivot in unforeseen directions. Meanwhile, developers, as talented as they might be, rarely come from a software engineering background and sometimes (that might be an understatement) do not consider the software building process as a relevant priority.

Embracing evolution raises challenges in multiple dimensions. We will try to cover both the technical perspective as well as the leadership one.

Build to Change

For most projects, long term planning days are over and the Agile philosophy of delivering small incremental versions of running software has become mainstream. Either used as a source of inspiration or implemented by strictly following one book or another, Agile carries code evolution deep into its DNA.

Test

Although writing (and executing) tests in parallel to creating new code is obvious among many developer teams, experience shows that this a not a thoroughly adopted practice in the scientific software community, at least for small and medium size projects. Even though most languages come with mature testing frameworks, the barrier is more cultural.

However, testing at the unit, system or integration levels brings stability, trust and speed in the development process. Refactoring continuously the code to adapt it to the new problem at hand is the most frequent evolution exercise. It frees the developer’s mind (“is my program still running?”) and allow her to focus on the creative process itself.

Finally, one biggest advantage of tests is for a project to carry both its purpose and its validation for the next developer. Don’t test only if a) you don’t write bugs, b) your project is over in less than two weeks c) no one else will ever have a look at your code.

"Whenever you are tempted to type something into a print statement or a debugger expression, write it as a test instead." Martin Fowler

Test More With Data

Data in and data out is the heart bit of scientific software. And testing my method with this 1000x1000 matrix or reproducing this bug with that 200M data file is neither the fastest nor the most convenient way to go.

Reducing each test data size to the smallest meaningful set is a worthwhile time investment. The bug or feature at hand will be clearer to a human mind and the test will run faster.

Building a minimum relevant dataset can be time consuming. You might for example write scripts to reduce a mass spectrometry peak list to the minimum one producing a given exception. Hopefully, libraries exists to produced constrained random data and reduce them to a minimum tests. Inherited from QuickCheck in Haskell, ScalaCheck fits exactly this purpose. Among other features, scalacheck allows to generate random dataset, execute test on them and in case of failure, try to report a minimal failing set. The same ideas has been ported to a large variety of languages, from Java to JavaScript, Python, C and more.

Storing test input and output in files may also be considered. These data are the real value of the tests whereas the code itself is only some lightweight machinery. This will provide the main advantages to have smaller test files and to use another language on the same data. It will become an obvious benefit when you transition from a prototype to a production code, when you want to test a Python code against an R gold standard or undertake a major refactoring.

Enough testing for now, let’s jump back at writing some actual code.

A couple of Design Patterns

For sure, any pattern, from the Gang of Four or planet Mars may certainly be useful in scientific software project. However a couple of them may have a special taste. Although the next two sub sections (as certainly the whole post itself) will appear rather obvious for a seasoned developer, personal experience and observations have shown them to be somewhat relevant for many scientific programmers.

Facade

Imagine you want a function to compute the Bingo random distribution. To your great pleasure, com.mouse.mickey.jar offers an open source implementation of the Bingo function, bingoDist()and bingoRnd(). You make a few tests to check that you have understood how to use the package and that it gives you reasonable answers.

Everything goes well until the day when someone discovers that some extreme negative values are not correctly computed. You can either become a contributor to the downloaded project but your boss is not super enthusiastic watching you spending your time on an open source project. However, you discover that meanwhile, org.apache.commons.maths4 has also implemented the Bingo function, in a more correct way. You decide to drop com.mickey dependencies and join the org.apache one.

Fortunately, as you have proxied the bingoDist() and bingoRnd() calls with a wrapper around the first library. Therefore, you only need to add one test (the negative long tail value) an change the wrapper implementation.

Modularize

A typical application will have different concerns (database access, computations, web frontend, R libraries etc.) However, it is not uncommon for medium scale project for the same person to be in charge of all those aspects. This is specially true when the domain is complex to understand.

The temptation to mix the different concerns together is often strong, and can be motivated by a short term sense to urgency. However, clearly decoupling each aspects of a projects into smaller entities will greatly reduces the overall complexity and simplify changes.

A typical example is to see data growing. You wrote a genomic pipeline for a coupe of exomes (SIZE) and it now has to handle hundreds of full genomes. Or this brand new mass spectrometer the lab just bought is to be used in a profile mode, generating hundreds time more data than the previous one. Or your number of users explodes. Or batch computations must become interactive. Or the data model must go from relational to a mix of relational and graph. PHP has some limit, the project must move to Java 8. Or…

All these changes are common. One typical answer is to blame the original architect who has not planned for all these. But one typical strength of the Agile approach is not to plan up front but to be at ease with evolution. And a clear separation of concerns between project components will greatly help to pivot smoothly.

And let’s not forget that the companion of code modularity is continuous integration. With ℇ effort, developers can work steadily, relying on the code to be tested and assembled in the background and that they will be notified soon if anything breaks.

Where (and When) To Go?

Twitter's jump from Ruby on Rails to Scala/Java stack is a impressive and inspiring story.

When to make a big jump? When to decide that the web frontend has grown too entangled, that adding this new feature in the jQuery spaghetti bowl will be too painful and out of control (not saying that the developer left six months ago)? When to decide that a PHP server is reaching limits and more performance hacking will not solve the customer load problems? We have previously discussed some technical concepts to get ready for such changes and we shall now discuss about the decision process itself.

Looking at the map

Technology pace of change is mind-blowing. If we only consider mainstream database systems, the decision was not very hard a few years ago: you have a lot of cash: go Oracle, if you don’t: go MySQL or Postgres. A few RDBM systems were ruling the world. But launching a project nowadays faces a larger scope of choices: is my problem suited for a RDBM? For a graph database like Neo4j or a RDF store? A document oriented one like MongoDB? A large data oriented one like Vertica? A Hadoop system, basic or richer like BDAS? A mix of any of these?

The previous example is only scratching the surface and the same holds true for languages, frameworks, libraries or core algorithm parts. To quote Heraclitus of Ephesus, no one can be certain that any original architecture choices will stand the long term.

Decision to pickup technology are always taken with a part of subjectivity. The skills among your team, their good and bad experiences and, let’s face it, the enthusiasm (or lack of) for novelty.

However, a few more objective information sources should also play a role:

Finally, one shall consider two rules of thumb before making any technical decision. The first one is to seriously consider two options. If a second option has not been put on the table, chances are the research has not be vey serious. The second rule is to also research the reasons why not to pick up a technology. Disillusioned blog posts or StackOverflow threads will be of great help in making an educated choice.

Making the Jump

Decision has been made to change, goals have clearly been stated and part of the team is dedicated to scout. A toy project will come handy to discover the technology but will not be enough to validate the choice.

If a small set of data will makes anyone amazed by the flexibility of MongoDB, the technology shall not be adopted before real size data (and even bigger) have been challenged. Scaling up will often show the real limits of a technology and will need more engagement from the scout team. However, having data driven test ready, as mentioned above, will save a lot of time.

An SCM system with deep branching capabilities like git will also make the exploration expedition life easier.

Turning Back (Eventually)

It is never an easy decision to turn back. Emotions can be at stake. However, if the promises are not met with a technology, it is better to acknowledge it a failure, understand together why and roll back. Scientific quests are paved with failures, acknowledgments, step back and push forwards. Although many teams and individuals lack the self confidence to admit it, this is our daily bread.

Pushing a new technology without a clear win and general adoption among the team can both generate a huge uncontrolled technical debt, but also create tension within the team between pro and against. Meanwhile, turning back, without pointing fingers at anyone, can build a team.

"Nothing great is easy." Matthew Webb

Foster Evolution in Your Team

Managing code evolution is not only a technical matter, it is also about leadership. It is often less obvious in scientific software than in the general industry, as most developers do not come from a software engineering background but from the domain at hand (biology, chemistry, statistics, astronomy, physics etc.) However, being able to evolve is a key to success in many endeavors.

The leadership has a key role in fostering an environment where technology evolution is encouraged.

Encouraging Exploration

There is very little chance to see one of your team member pick up the next technology you need for your project out of the blue. You need them to have a certain level of culture. Even though their resume was shiny when they joined, our world moves fast and they need to keep on learning new things (which is quite different from "being on top of everything.")

No single technique works for everyone nor in any environment. Here are a few I have organized or participated to and found worthwhile:

give an easy access to books, either via a seamless purchase system, or online library subscription like Safari

setup offsite hacking retreats for a week, or shorter hackatons, can both unleash spirits to try new things and build deep links among the team.

Leading Change

Any change comes with a risk. Well, not changing can also carry a risk, indeed. Too often failure is deemed on the developer leading the change. If the manager should clearly assess the goals and a time frame, the leader (often the same person) shall build a trust environment where the scout will not be pointed as a looser if the change fails or acclaimed as a hero if it succeeds. The scout is part of the team, she may not be more or less important to the team than other member focusing on the production at that time.
Moreover, changes should not be the sole responsibility of a small subset of the team. The leader can often balance technology evolution among the crew. Even junior members can be in charge of important scouting (for example, adopting continuous deployment on a JavaScript application). In this latter situation, it may be wise to build two junior developers scout in pair.

Career Development for Software Folks

No management structure comes without some "career development" buzzwords. Career development for a computational scientist can mean managing a team, architecting projects or taking more responsibilities as a PI. But for folks spending most of their time writing code, it can also mean keeping up to date with technology. Except for a very few experts in niche domains, who can for example make a career in building the best available genome aligner in C (and proving it to the world), pressuring people in doing only PHP or PERL code because they know how to do it is a short term bet. In the other hand, encouraging them to learn, to try new things, will both make them better developers, strengthen their confidence and ultimately increase the chances that top contributors will stay in you team.

Unleash your team

Experience also brings good news. Fostering an environment prone to learning will soon turn into a virtuous circle. More adventurous team members will show the way to their peers. Leadership then turns into channeling all this energy while keeping the business goals in mind. And with such teams, limit is the sky.