Saturday, June 1, 2013

My PhD thesis propositions and some discussion

Apart from the contents of my PhD thesis, theses at our university are usually accompanied by a list of propositions. According to our university's Doctorate regulations, they must be defendable and opposable, at least six of the propositions are not supposed to be directly related to the research subject and two of them may be slightly playful. Besides the contents of my thesis, committee members are also allowed to ask questions about the propositions.

A colleague of mine: Felienne Hermans, has covered her propositions on her blog to elaborate about them. I have decided to do the same thing, although I'm not planning to create separate blog posts for each individual proposition. Instead, I cover all of them in a single blog post.

Propositions

Many of the non-functional requirements of a deployed service-oriented system can be realized by selecting an appropriate software deployment system.

As I have explained earlier in a blog post about software deployment complexity, systems are rarely self-contained but composed of components. An important property of a component is that it can be deployed independently, significantly complicating a software deployment process.

Components of service-oriented systems are called "services". It's a bit of an open debate to exactly tell what they are, since people from industry often think in terms of web services (things that use SOAP, WSDL, UDDI), while I have also seen the description "autonomous platform independent entities that can be loosely coupled" in the academic literature.

Although web services are some sort of platform independent entities, they still have implementations behind their interfaces depending on certain technology and can be deployed to various machines in a network. We have seen that deployment on a single machine is hard and that deploying components into networks of machines is even more complicated, time consuming and error prone.

Besides deployment activities, there are many important non-functional requirements a system has to meet. Many of them can be achieved by designing an architecture, e.g. components, connectors and the way they interact through architectural patterns/styles. Architectural patterns (e.g. layers, pipes and filters, blackboard etc.) implement certain quality attributes.

For service-oriented systems, it's required to deploy components properly into a network of machines to be able to compose systems. In other words: we have to design and implement a proper deployment architecture. This has several technical challenges, such as the fact that we have to deploy components in such a way that they exactly match the deployment architecture, the deployment activities themselves, and the fact that we have to determine whether a system is capable of running a certain component (i.e. a service using Windows technology cannot run on a Linux machine or vice-versa).

In addition to technical constraints, there are also many non-functional issues related to deployment that require attention, i.e. where to place components and how to combine them to achieve certain non-functional requirements? For example, privacy could be achieved by placing services providing access to privacy-sensitive data in a restricted zone and robustness by deploying multiple redundant instances of the same service.

It can also be hard to manually find a deployment architecture that satisfies all non-functional requirements. In such cases, deployment planning algorithms are very helpful. In some cases it's even too hard or impossible to find an optimal solution.

Because of all these cases, an automated deployment solution taking all relevant issues into account is very helpful in achieving many non-functional requirements of a deployed service-oriented system. This is what I have been trying to do in my PhD thesis.

The intention of the Filesystem Hierarchy Standard (FHS) is to provide portability among Linux systems.
However, due to ambiguity, over-specification, and legacy support, this standard limits adoption and innovation. This phenomenon applies to several other software standards as well.

There are many standards in the software engineering domain. In fact, it's dominated by them. One standard that is particularly important in my research is the Filesystem Hierarchy Standard (FHS) defining the overall filesystem structure of Linux systems. I have written a blog post on the FHS some time ago.

In short: the FHS defines the purposes of directories, it makes a distinction between static and variable parts of a system, it defines hierarchies (e.g. / is for boot/recovery, /usr is for user software and for /usr/local nobody really knows (ambiguity)). Moreover, it also defines the contents of certain directories, e.g. /bin should contain /bin/sh (over-specification).

I have problems with the latter two aspects -- the hierarchies do not provide support for isolation, allowing side-effects to easily manifest themselves while deploying and enabling destructive upgrades. I also have a minor problem with strict requirements of the contents of directories, as they easily allow builds to trigger side-effects while assuming that certain tools are always present.

For all these reasons, we deviate on some aspects of the FHS in NixOS. Some people consider this unacceptable, and therefore they will not be able to incorporate most of our techniques to improve the quality of deployment processes.

Moreover, as the FHS itself has issues, we observe that although the filesystem structure is standardized, file system layouts in many Linux distributions are still slightly different and portability issues still arise.

In other domains, I have also observed various issues with standards:

Operating systems: nearly every operating system is more or less forced to implement POSIX and/or the Single UNIX Specification, taking a lot of effort. Furthermore, by implementing these standards they UNIX is basically reimplemented. These standards have many strict requirements on how certain library calls should be implemented, although it also specifies undefined behaviour at the same time. Apart from the fact that it's difficult and time consuming to implement these standards, there is little room to implement an operating system that is conceptually different from UNIX as it conflicts with portability.

The Web (e.g. HTML, CSS, DOM etc.): First, a draft is written in a natural language (which is inherently ambiguous) and sometimes underspecified. Then vendors start implementing these drafts. As initially these standards are ambiguous and underspecified, implementations behave very differently. Slowly these implementations converge into something that is uniform by collaborating with other vendors and the W3C to improve the draft version of a standard. Some vendors intentionally or accidentally implement conformance bugs, which don't get fixed for quite some time.

These buggy implementations may become the de-facto standard, which has happened in the past, e.g. with Internet Explorer 6, requiring web developers to implement quirks code. Since the release of Internet Explorer 6 in 2001, Microsoft had 95% market share and did not release a new version until 2006. This was seriously hindering innovation in web technology. It also took many years before other implementations with better web standards conformance and more features gained acceptance.

So are standards bad? Not necessarily, but I think we have to critically evaluate them and not consider them as holy books. Moreover, standards need to be developed with some formality and elegance in mind. If junk gets standardized, it will remain junk and requires everybody to cope with junk for quite some time.

One of the things that may help is using good formalisms. For example, a good one I can think of is BNF that was used in the ALGOL 60 specification.

To move the software engineering community as a whole forward, industry and academia should collaborate. Unfortunately, their Key Performance Indicators (KPIs) drive them apart, resulting in a prisoner's dilemma.

It's obvious that, if both fractions would let themselves go a bit from their KPIs, i.e. academia does a bit more in engineering tools and transferring knowledge, while industry spends some of their effort in experimenting and paying attention to "secondary tasks", that both parties would benefit. However, in practice often the opposite happens (although there are exceptions of course).

This is analogous to a prisoner's dilemma, which is a peculiar phenomenon. Visualize the following situation: two prisoners have jointly committed a crime and got busted. If both confess their crime then the amount of time they have to spend in prison are five years. If one prisoner confesses while the other does not, then the prisoner that confesses goes ten years into jail and the other remains free. If none of them confess, they both have to spent twenty years in prison.

In this kind of situation the (obvious) win-win situation for both criminals is that they both confess. However, because they both give priority to their self-interests, none of them confesses as they assume that they remain free. But instead, the situation has the worst outcome: both have to spent twenty years in prison.

In software engineering, the use of social media, such as blogging and twitter, are an effective and efficient way to strengthen collaboration between industry and academia.

This proposition is related to the previous one. How can we create a win-win situation? Often I hear people saying: "Well collaboration is interesting, but it costs time and money, which we don't have right now and we have other stuff to do".

I don't think the barrier has to be that high. Social media, such as blogging and twitter, can be used for free and allows one to easily share stories, thoughts, results and so on. Moreover, recipients can also share these with people they know.

My blog for example, has attracted many more readers and has given me much more feedback then all my research papers combined. Moreover, I'm not limited by all kinds of contraints that program committee members impose on me.

However these observations are not unique to me. Many years ago, a famous Dutch computer scientist named Edsger Dijkstra wrote many manuscripts that he sent to his peers directly. He wrote about subjects that he found relevant. His peers spread these manuscripts through their colleagues allowing him to reach a wide range of people, eventually reaching thousands of people.

While the vision behind the definition of free software as described by the Free Software Foundation
to promote freedom is compelling, the actual definition is ambiguous and inadequately promoted.

The free software definition defines four freedoms. I can rephrase them in one sentence: "Software that can be used, studied, adapted and shared for any purpose". An essential precondition for this is the availability of the source code. I think this definition is clear and makes sense.

However, there is a minor issue with the definition. The word 'free' is ambiguous in English. In the definition, it refers to free as in freedom not free in price (gratis). In Dutch or French it's not a problem. In these languages free software translates to 'vrije software' and 'libre software'.

Moreover, although (almost) all free software is gratis, it's also allowed to sell free software for any price, which is often misunderstood.

I have seen that the ambiguity of the word free is often used as an argument why the definition is not attracting a general audience.

Although I don't want to say that they are not right and we should tolerate such bad practices, I think it would also help to pay more attention to the good aspects of free software. The open source definition has much more care for this. For example, being able to improve quality of software. That's something I think that would attract people from the other side, whereas negative campaigning does not.

Compared to the definition of free software provided by the Free Software Foundation, the definition of Open Source as provided by the Open Source Initiative, fails to improve on freedom. While it has been more effectively promoted, it lacks a vision and does not solve ambiguity.

The open source definition lists ten pragmatic points with the intention of having software that is free (as in freedom), e.g. availability of source code, means to share modified versions, and so on. However, it does not explain why it's desired for others to respect these ten pragmatic points and what their rationale is (although there is an annotated definition that does).

Because of these reasons, I have seen that sometimes software is incorrectly advertised as being open-source, while in reality they are not.
For example, there is also software available with source code, for which it is not allowed to do commercial redistributions, such as the LCC compiler. That's not open source (nor free software). Another prominent example is Microsoft's Shared Source initiative, only allowing someone to look at code, but not to modify or redistribute it.

A very useful aspect of open source is the way it's advertised. It pays a lot of attention in selling its good points. For example, that everyone is able to improve its quality, and allowed to collaborate etc. Companies (even those that sell proprietary software) acknowledge these positive aspects and are sometimes willing to work with open-source people on certain aspects or to "open-source" pieces of their proprietary products. Examples of this are the Eclipse platform and the Quake 3 arena video game.

Just like making music is more than translating notes and rests into tones and pauses with specific durations, developing software systems is more than implementing functional requirements. In both domains, details, collaboration and listening to others are most important.

I have observed that in both domains we make estimations. In software development, we try to estimate how much effort it takes to develop something and in music we try to estimate how much effort it takes to practice and master a composition.

In software development, we often look at functional requirements (describing what a system should do) to estimate. I have seen that sometimes functional requirements may look ridiculously simple, such as displaying tabular data on a web page. Nearly every software developer would say: "that's easy".

But even if functional requirements are simple, certain non-functional requirements (describing how and where) could make it very difficult. For example, properly implementing security facilities, a certain quality standard (such as ISO 9126) or to provide scalability. These kind of aspects may be much more complicated than the features of a system itself.

Moreover, software is often developed in a team. Good communication and being able to divide work properly is important. In practice, you will almost always see that something goes wrong with that, because people have assumptions that all details are known by others or there is no clear architecture of a system so that work can be properly divided among team members.

These are all kinds of reasons that may result in development times that are significantly longer than expected and failure to properly deliver what clients have asked.

In my spare time I'm a musician and in music I have observed similar things. People make effort estimations by looking at the notes written on paper. Essentially, you could see those as functional requirements as they tell you what to play.

However, besides playing notes and pausing, there are many more aspects that are important, such as tempo, dynamics (sudden and gradual loudness) and articulation. You could compare these aspects to non-functional requirements in software, as they tell somebody how to play (series of) notes.

Moreover, making music can also be a group effort, such as a band or an orchestra, requiring people to properly interact with each other. If others make mistakes they may confuse you as well.

I vividly remember a classical composition of 15 years ago. I just joined an orchestra and we were practising: "Land of the Long White Cloud" by Philip Sparke. In the middle of the composition, there is a snare drum passage consisting of only sixteenth notes. I already learned playing patterns like these in my first drum lesson, so I thought that it would be easy.

However, I had to play these notes in a very fast tempo and very quietly, which are usually conflicting constraints for percussionists. Furthermore, I had to keep up the right tempo and don't let the other members distract me. Unfortunately, I couldn't cope with all these additional constraints, and that particular passage had to be performed by somebody else. I felt like an idiot and I was very disappointed. However, we did win the contest in which we had to perform that particular composition.

Multilingualism is a good quality for a software engineer as it raises awareness that in natural languages as well as in software languages and techniques, there are things that cannot be translated literally.

It's well known that some words or sentences cannot be literally translated from one natural language to another. In such cases, we have to reformulate a sentence into something that has an equivalent meaning, which is not always trivial.

For example, non-native English speakers, like Dutch people, tend to make (subtle) mistakes now and then, which sometimes have hilarious outcomes. Make that the cat wise is a famous website that elaborates on this, calling the Dutch variant of English: Dunglish.

Although we are aware of the fact that we cannot always translate things literally in natural languages, I have observed that in the software domain the same phenomenon occurs. One particular programming language may be more useful for a certain goal, than another programming language. Eventually, code written in a programming language gets compiled into machine code or another programming language (having an equivalent reformulation in a different language) or interpreted by an interpreter.

However, I have also observed that in the software engineering domain there is a lot of programming language conservatism. Most conventional programming languages used nowadays (Python, C++, Java, C# etc.) use structured and imperative programming concepts in combination with class-based OO techniques. Unconventional languages such as purely functional programming languages (Haskell) or declarative languages (Prolog, Erlang) only get little mainstream acceptance, although they have very powerful features. For example, programs implemented in a purely functional language easily scale across multiple cores/processors.

Instead, many developers use conventional languages to achieve the same, imposing many additional problems that need to be solved and more chances on errors. Our research based on the purely functional deployment model also suffers from conservatism. Therefore, I think multilingualism is very powerful asset for an engineer, as he is not limited by a solution set that is too narrow.

Stubbornness is both a positive as well as a negative trait of a researcher.

I think that when you do research and if you discover something that is uncommon, others may reject it or tell you to do something that they consider more relevant. Some researchers choose to comply and give up stuff that they think is relevant. If every scientist would do that, then I think certain things would have never been discovered. I think it's a scientist's duty to properly defend important discoveries.

In the middle ages it was even worse. For example, a famous scientist named Galileo Galilei revealed that not the Earth but the Sun was the centre of our solar system. He was sentenced to house arrest for the rest of his life by the catholic church.

However, stubbornness also has a negative aspect. It often comes with ignorance and sometimes that's bad. For example, I have "ignored" some advice about properly studying related work and taking evaluation seriously, resulting in a paper that was badly rejected.

Conclusion

In this blog post, I have described some thoughts on my PhD propositions. The main reason of writing this down is to prepare myself for my defence. I know this blog post is lengthy, but that's good. This will probably prevent my committee members to read all the details, so that they cannot use everything I have just written against me :-) (I'm very curious to see if anyone has notice that I just said this :P).