QCon 2013: Wednesday Papers

This first talk of the conference came from one of the veterans of computer science, the Turing-award decorated Liskov having begun her computing journey before 1970. In it, she gave a historical overview of the development of modern ideas of data abstraction, the less talked about counterpart to code abstraction, and her own role in it.

The basis of modern thinking in language, design and architecture can be traced back to academic work from around 1970. Dijkstra's seminal paper Goto Statement Considered Harmful is well known and can be considered the source of the current control structure based approach to code structure. Barbara mentioned two other papers from 1971 which introduce the concepts of top-down design and modularity (Wirth's Program Development by Stepwise Refinement and Parnas's Information Distribution Aspects of Design Methodology respectively).

The concept of modularity is familiar to us today: a module can be specified purely by its external interface. Liskov is the author of a 1972 paper introducing the idea of 'partitions', the first time technical abstractions were available on a similar scale to those used in the design process – we know this concept today by the term 'component'. The concept of encapsulation and data hiding was also first considered around this time.

Barbara then went on to discuss the concept of data abstraction, using her academic language CLU to illustrate it. Data abstraction means that data types define operations, but the user of the data type doesn't need to know how those operations are implemented. This means that seemingly modern concepts like generics, type constraints and type safety of general utility code become easy, as the behaviour is defined by the type. My own interpretation of this section is that we have reached a similar point from the object-oriented perspective by allowing the creation of object types which function as data types, particularly with the value types (structs) in C#. Type constraints by exposed interface is also a current approach in dynamic languages like Ruby.

During the 1970s and 80s, the development of an object-oriented paradigm was taking place in parallel to the work on data abstraction; the story of Smalltalk is quite well known. OO had developed the concept of inheritance at this time, but it was used without considering the semantic meaning embodied in Liskov's Substitution Principle: when you call a method on a subclass instance in a superclass context, it should behave like the supertype method. This is so obvious to modern coders that it is almost unsaid, but it was an important insight, and brought together data abstraction and object orientation together by ensuring that object classes acted like data types.

She finished the talk by discussing the state of languages today, expressing an opinion that we have an unhelpful dichotomy between professional languages like C# and Java, which are good for experts but too complex for beginners to get into, and 'easy' dynamic languages like Python which are easy to get into but missing necessary abstractions for moving into professional development. Can we find a language which can do both? In my opinion dynamic languages are improving their high level professional feature set so this should become less of an issue. There is also the impending question of distributed computing; conventional languages lack the necessary data and design abstractions to take advantage of a distributed (across chips or machines) architecture.

This entertaining talk told us about some common problems that web applications have from a usability perspective, and why the frameworks and processes that we typically use to create them lead to such issues.

Some common problems caused by how we write web appliations today and which are widely visible are disabled forward/back buttons, inability to refresh a page or to do without resubmitting POST data, not being able to open multiple windows or tabs on the same data, loading content last (i.e. after ads, menu decoration etc), being unable to bookmark or link directly to pages, and sites being unusable without script enabled. All of these issues come from having frameworks that try to 'fix' how the web works, instead of working with it:

HTTP is a stateless protocol, for good reasons (in particular scaling considerations). Web frameworks typically store state, and often in unhelpful ways that cause the issues above, like storing user progress through a multi-step process in session state.

"We need to go beyond the page model to serve interactive content", a mindset that is responsible for not being able to bookmark particular views of a web app which manages everything within one page. The page model is very adequate for interactive content, and it's been used for it from the beginning. Remember CGI scripts?

Similarly, hiding the writing of HTML/CSS/JavaScript, technologies from the last century, can result in developers being too insulated from what actually happens, and cause rendering problems or unexpected requirements on scripts (or even platforms, with some ASP.net controls).

Allowing or simulating server-push, which is just another way of storing state.

If we think about what we actually want in an ideal world for web applications, we'd need a defined protocol so that we can serve content; a declarative data format so that services can share data; a separation of content and layout in the content served to end users; etc. And then we'd have to get it standardised, get consensus among interested parties, and so on – clearly not a sensible approach. But if we take a step back then it turns out that the existing web infrastructure does a pretty good job – not perfect for our needs, but pretty good. It makes more sense to appreciate that and to work with the technology, not against it.

The two most common approaches to web development don't use the web as it's intended to be used: using server side components to generate content depending on state (ASP, JSP etc), or 'single page apps' where an entire application is embedded within one URL. The right way uses the existing infrastructure, of the internet and of the browser. Stefan used the term "Resource-Oriented Client Architecture", or ROCA, to describe a set of principles for this, which he goes into in more depth on his site at roca-style.org:

Be web-centric, i.e. use the mechanics of HTTP, instead of fighting them.

Sending semantic, content-rich HTML directly from server code is okay. It loads fast, it is accessible (both in terms of disabled access and API access), self-describing and readable, and through HTML 5 semantic attributes it can also carry meaning to automated readers.

Use unobtrusive script, i.e. make sure that every piece of content on your site is available to readers that don't have script. Additional or alternative behaviour can be applied by script if it is present. (I don't fully agree with the details of the unobtrusive script approach but the principle is good.)

Use meaningful URIs, i.e. a URI should represent a concept that the user will be interested in. Extra content (sidebars etc) can be loaded by script if wanted – again, I don't agree entirely with the scripted loading but the idea of URIs being user relevant is a good one.

Work with the browser, not against it: allow multiple tabs, bookmarks, history etc provide the functionality that they are expected to do.

All application logic should take place on the server; this is often required anyway for security reasons so overly thick clients result in duplication of effort.

Avoid session state. This point is harder to achieve in the real world, but doing as much as possible to avoid dependence on state makes an application easier to scale and easier to use.

It doesn't happen because we are often too specific about what we are trying to write, and because people take the 'easy' answer (do what you can do now with the tools in front of you with the least effort) instead of investing time in a simple solution (one which is easy to understand and use at the end).

This talk about garbage collection in the Java virtual machine was an excuse for some gratuitous geekery. Most of us won't use information about how garbage collection works directly, although people in industries where performance and latency are critical (the bankers who use Java, for example) will, but it is interesting background knowledge.

The first point he made is that, although the name would suggest otherwise, garbage collection is not really about the garbage (dead objects); it is about live objects. Garbage collection tracks the tree of object references from each root node (an entry point or currently active stack reference) each generation; what is available for collection is simply those objects that have not been located, and the Java GC doesn't need to know anything about them. (The .Net GC is a little different as it supports finalisers, so it needs to know what has died, but the main point remains the same.)

The rest of this presentation refers specifically to the default Hotspot JVM, although similar structures exist in other garbage collection systems, like the .Net one or memory management approaches used by dynamic languages.

Java splits its memory space into three separate areas, known as memory pools:

Younggen space, which is further subdivided into 'Eden', the space for new objects, and 'Survivor 0' and 'Survivor 1'. Short lifespan objects never leave these pools. Almost all new objects are created in the Eden pool, and move to the current Survivor pool if they are still alive when younggen GC operates; that happens whenever Eden reaches a threshold size. The two Survivor pools alternate being the one in use; during a GC, objects are moved into the current Survivor pool (either from the previous Survivor pool or from Eden), and their previous location dereferenced. (Similar things happen in .Net, which is why objects that need a fixed memory location must be given a fixed-location GCHandle.) Younggen collection typically happens multiple times per second in a running application.

Oldgen or tenured space. Once objects reach a certain lifetime (4 collections by default), they are instead moved into this memory pool. Objects too large to fit into the appropriate younggen space are also moved here, so rapid allocation of large amounts of memory can pollute tenured space with young objects, leading to inefficient collection. Collection in this space is done by a concurrent mark-and-sweep algorithm, pausing execution for a minimum amount of time, until the space becomes too full or too fragmented, when a full compacting, pausing collection is run.

Perma-Gen space, for objects which have an expected lifetime of the entire process duration, e.g. Spring configuration and services. Garbage collection doesn't run at all here, but references from perma-gen objects can act as GC roots (see below) into other pools.

Why split up the memory we manage into pools like this? The answer comes from the Weak Generation Hypothesis – really more of a Theory as it is empirically backed – which states that most objects have a very short lifespan, a few objects have a long lifespan, and few objects are in between.

Garbage collection algorithms all require a stage to mark which objects are live. This starts from GC roots, which are the objects that are referenced from outside and which mark the root of the reference tree: objects referenced from the currently active stack frame in each running application thread, objects used as synchronisation roots (i.e. for a lock or a SyncRoot property), and references from more long term memory (i.e. references from tenured into younggen, or references from perma-gen). Tracking the live status through the whole reference tree can be time consuming, particularly as memory size increases.

Some experimental work is taking place on non-contiguous memory models for the Java VM, from the work done by JRocket before they were taken over. Martijn said that it is too early to say much about that with regard to how it will work with the mainstream VM.

Finally, we learnt what OutOfMemoryException actually means in Java. It means that one of the following is true: 98% of execution time has been taking place in garbage collection; a full GC execution freed up less than 2% of the heap; a single allocation is larger than the available heap memory (the one we all understand); or
a thread failed to spawn.

Although Simon is best known for his work on and evangelism for Haskell and functional programming, this talk contained none of that; in fact, it contained no technical material at all. Instead, he was promoting another cause for which he is a leading member; the Computing at School Working Group (CAS), and the wider concern of how to improve computing teaching in schools.

It is widely known that the current ICT teaching is not fit for purpose. In some schools it is not taught at all, and in most others it is considered boring to learn and dispiriting to teach. The problem that Simon and others identified is that ICT is typically taught as a skill: applied knowledge which is directly useful in business, but which ages quickly and which has no deep understanding, thus is not interesting to intelligent students. The theory and knowledge that underpins skills like this is deeper, has a longer lifetime, and constitutes a discipline, such as physics or English language. In the case of ICT the associated discipline is computer science, which is hardly taught at all. And yet, like the other disciplines taught to children from primary school onwards, an understanding of computing principles is essential to understand the world we live in today. CAS aimed to bring this mismatch to light and to have elementary computer science taught to all children, like elementary maths and science.

Before that could happen, though, they had to define what computer science is. Like other scientific disciplines, it contains knowledge (languages, data structures, history), but also patterns of thought: abstract and computational thinking. It should not be tied to a particular technology; in fact it should be possible to teach computer science without a computer!

In the last year, significant progress has been made. In the spring of 2012, the Royal Society issued a report on computing provision in education, and declared that far more attention should be paid to computer science (as opposed to ICT), and the Education Secretary Michael Gove (for whom I have little time in most cases) made similar points in a speech he made while shaking up the National Curriculum here in England; drafts of the new curriculum have CS as well as IT from Key Stage 1: primary school age. Several exam boards also now offer computer science as a GCSE option. There has also been progress on the hardware front, with cheap portable hardware: first the .Net Gadgeteer and then the Raspberry Pi, which as well as being newsworthy and a household name may also have made computing a little bit cool again.

Computer science is now recognised as being important in principle. However, there is still a major logistical effort required to get it integrated into school education. In the medium term new teachers need to be offered PGCE courses in computer science; this is now happening. Existing ICT teachers also need to be trained, so that they can be comfortable teaching CS too. To this end CAS has set up a Network of Excellence, which links together university computer science departments and schools which wish to acquire CS knowledge, so far over 70 and 500 respectively, and which encourages skilled teachers to pass on that knowledge locally. Individuals who are qualified or knowledgable in CS can also sign up to be members of CAS and help to pass that knowledge to schools in that area.

I personally agree with Simon that this is an important issue, and if you are qualified to give schools some support in introducing computer science to their curriculum, I'd encourage you to do so.

Chris comes from Basho, a company that develops a distributed, fault-tolerant and high-availability database (Riak). I wasn't originally going to go to this talk, but after talking to Chris on the Basho exhibitor stand in the morning, I wanted to hear about this area in more depth. This talk describes the inevitable trade-offs intrinsic to distributed data solutions, and how Riak has chosen to manage those.

The core idea of his talk is eventual consistency: the idea that a distributed database will eventually contain the same data on every node, and that temporary inconsistency is okay. The CAP Theorem proves that you cannot have all three of Consistent, Available and Partition-tolerant data; a distributed case is by definition partitioned, so that means that a distributed solution must trade consistency off against availability. (Availability in this context means the degree to which the system is able to accept select and update requests on any node.) If we choose consistency, then we have 'eventual availability – and in reality, if a system is not highly available (i.e. if any node of the distributed system we connect to cannot accept our request), the system is not usable, so we must prioritise availability and sacrifice some consistency. That is, for an available distributed data solution, eventual consistency is inevitable. Fortunately, eventual consistency is usually okay in the real world, and it is already used in well known distributed systems like Facebook.

Eventual consistency works by allowing any node to accept an update at any time, and pushing that update to a number of replicating nodes. Eventually the update will propagate through the entire distributed database. However, this means that updates can happen in a different order on different nodes. The ideal solution is a data model where providing the same updates in a different order results in the same final state; Chris calls this outcome "Strong Eventual Consistency" and it is currently the subject of Basho's research effort. However, until this is perfected, accepting eventual consistency also means accepting data divergence or conflict from differently ordered updates.

The question then becomes: how best to resolve these conflicts? Riak uses a concept they call a vector clock, which acts rather like version control revision numbers. When submitting an update, the update is tagged with the identifier of the last known state for that record. If two different nodes try to update the same record with two different updates that have the same parent state, that means that two clients have tried to update the record at the same time, and conflict resolution will be triggered. (Conflicting updates can also occur if a network has been split due to infrastructure failure, and then connection is re-established; this situation, and the state mismatches it can cause, is well known to IRC users as a 'netsplit'.) This 'sibling' resolution is context dependent so Riak's API lets the application layer define how to resolve conflicts in different cases.

The overall message is that eventual consistency is acceptable, it is in fact unavoidable for a distributed database, and we should use the appropriate methods to get the best out of it.

QCon includes several papers devoted to the use of Java and other technologies to the financial industry, which typically pushes the boundaries of the technology further and faster than most other fields. This one was about how relational databases don't fit the bill for storing information about financial transactions that come in a very non-tabular format.

The SWIFT message formats used for recording inter-bank end user transactions (e.g. sending money to a business abroad to pay for services) are tree structured. The message contains several blocks; the main content block contains sections, some of which may be optional or repeated; they contain fields which similarly may not always be present or which may be present several times; and so on. The format is several levels deep, and at no point is there a guarantee that a particular field will be present in every message. It's also versioned, as the formats are updated frequently. This is obviously a poor fit for a traditional RDB and its idea of records containing values for a fixed set of fields (columns in a tabular view).

Several solutions have been tried in order to manage this data. Firstly, some consumers simply extract only the fields that they are interested in, and store that in an RDB. This works, but data is permanently lost, which could be losing value from future re-analysis of stored data with new ideas. Others tried storing the message records in XML, but although this preserves the tree nature of the data, it is not easily indexable and it is not the format that the consuming code will want to use. What would be ideal wold be an object tree in the target environment (i.e. Java objects for a Java program, populated structures to a C++ program, etc), in an indexable database type container to permit querying.

Naturally the presenter's company sells just such a solution, particularly targeted at financial messaging! Their system has predefined type mappings for the financial message types used in the talk. The general concept is that a grammar is specified for the input format, and incoming messages are parsed and translated into an object tree, which is then stored in the data source.

Andrew then moved on to the advantages of using an event-driven data processing paradigm. In a typical flow-based architecture, scalability only happens in one place (the number of streamed processing units you have), even though some parts of the process are probably more resource intensive than others. By having each processing unit small and self-contained, taking data from one storage pool, processing it in some way and returning it to another, and having the data storage pools dispatch events to request processing to occur, each step of the processing becomes scalable and potentially distributable. This is very similar in idea to the pub-sub (publish/subscribe) model of multiple worker processes typically seen in the Q world, and in my time at First Derivatives I saw how easy that makes it to scale and distribute the processing units.

Storing objects in your database has some big advantages. You can use any object properties in indexing and querying, and you can store exception objects or other state-holding rich failure objects in it to get the most possible information about unsuccessful execution cases. But it can also result in very large record sizes (and of course the data objects must be serialisable in order to distribute the database).

Distributing an object database also raises the questions of how to manage notification (i.e. dispatching events to request a processing agent takes on a data object), and how to deal with exceptions, when the system is spread across multiple instances. These problems have already been looked at in solution that use conventional RDSs, but they are more immediately obvious when the object tree is integrated with the data storage and when the database and application are distributed together.

This slot, after a first round of free beer had been handed out and before the conference party, is not a particularly serious one. Damian's presentation was full of jokes and the audience obliged with plenty of laughs. However, it did contain some valuable points, too. The main point was that a monoculture is bad, in software engineering just as it is in biology. Having a diversity of thought patterns prepares us better for life in the problem wilderness; if you only know one language (or, in my opinion, only one type of language; knowing C# or even Delphi or C++ as well as Java doesn't add much) then you only know one way of thinking, and you miss out on other perspectives. You may even miss out on knowing that you are missing out on other perspectives!

Where Damian went to find new thought patterns was a dark place: "the language morgue, where programming languages go to die". First he brought back to life PostScript, the purely stack based reverse-Polish language for drawing graphics, implementing Buffon's Needle (a recreation of the 18th century Comte de Buffon's experiment of dropping needles onto lines to detemine π) in it. Although many jokes were cracked at PostScript's expense, and it is a rather impractical language for serious coding, he pointed out that it has things to teach us about parameter passing, results and variable redefinition. All languages are really stack based; it's just that PostScript is honest about it.

Next came some C++ code unlike anything most of us in the audience had ever seen before. Damian had implemented a state machine and declarative state transitions by creating classes to represent them, complete with operator overloads allowing him to write wire-up code like:

state(0) ------------------> QUOTE() --------------> state(1);

... and execution code that traversed the state tree to parse input or even resolve cyclic dependency chains. This received the appropriate level of 'amused and bemused' reaction from the floor.

Finally we were treated to a flight of fancy into the deadest language of all: Latin. Because Latin has cases and verb declensions that define the meaning of each word precisely, word order doesn't matter. Can't we write a programming language like that, too? It turns out we can: procedures become verbs, value-returning functions gerunds and variables nouns (singular or plural as appropriate); values used as lvalues (the target of an action) become dative and those used as parameters to an action become accusative; functions are defined in the infinitive and executed in the imperative. Even code blocks and control structures can be translated, with the brace markers becoming sic and cis! Damian demonstrated the Sieve of Erasthosthenes algorithm for finding prime numbers (which you may have seen in my discussions about Rowan) in Latin, to a good round of applause.

Even this had a point, though; getting operations in the correct order can be hard for new people, and there are places where we can remove that problem from coding: named arguments and associative arrays, for example. C# now allows insantiation and assignment of properties by name for data objects (as well as attributes, where it has been available from the beginning). Even 2000 year old languages, which aren't even computing languages, can teach us new ideas. All the entertainment aside (and I recommend you watch this talk when it becomes available) I definitely agree with the core point: knowing several languages and several approaches gives you more tools in your mental toolbox, so when you're presented with a problem, you can apply the right one. OO is not always the right one.