Archive for August, 2011

I recently worked on adding the meta data section for each of the different document types that it serves which involved showing 15-20 pieces of data for each document type.

There are around 4-5 document types and although the meta data for each document type is similar it’s not exactly the same!

When we got to the second document type it wasn’t obvious where the abstraction was so we went for the copy/paste approach to see if it would be any easier to see the commonality if we put the two templates side by side.

We saw some duplication in the way that we were building up each individual piece of meta data but couldn’t see any higher abstraction.

We eventually got through all the document types and hadn’t really found a clean solution to the problem.

I wanted to spend some time playing around the code to see if I could find one but Duncan pointed out that it was important to consider that refactoring in the bigger context of the application.

Even if we did find a really nice design it’s probably not going to give us any benefit since we’ve covered most of the document types and there will maybe be just one that we have to add the meta data section for.

The return on investment for finding a clean generic abstraction won’t be very high in this case.

In another part of our application we need to make it possible for the use to do faceted search but it hasn’t been decided what the final list of facets to search on will be.

It therefore needs to be very easy to make it possible to search on a new facet and include details about that facet in all search results.

We spent a couple of days about 5/6 weeks ago working out how to model that bit of code so that it would be really easy to add a new facet since we knew that there would be more coming in future.

When that time eventually came last week it took just 2 or 3 lines of code to get the new facet up and running.

In this case spending the time to find the generic abstraction had a good return on investment.

I sometimes find it difficult to know exactly which bits of code we should invest a lot of time in because there are always loads of places where improvements can be made.

Analysing whether there’s going to be a future return on investment from cleaning it up/finding the abstraction seems to be a useful thing to do.

Of course the return on investment I’m talking about here relates to the speed at which we can add future functionality.

I guess another return on investment could be reducing the time it takes to understand a piece of code if it’s likely to be read frequently.

The last couple of applications I’ve worked on have had almost completely read only databases where we had to populate the database in an offline process and then provide various ways for users to access the data.

This creates an interesting situation with respect to how we should setup our development environment.

Our normal setup would probably have an individual version of that database on every development machine and we would populate and then truncate the database during various test scenarios.

This actually means that our tests are interacting with the database in a different way than we would see during the running of the application.

It also means that we have more infrastructure to take care of and more software updates to do although using tools like Chef or Puppet can reduce the pain this causes once the initial setup of those scripts has been done.

On the project I worked on last year we started off with the individual database approach but eventually moved to having a shared database used by all the developers.

We only made the move once we had the real production data and the script which would populate that data into our database ready.

The disadvantage of having this shared database is that our tests become more indirect.

We wrote our tests against data which we knew would be in our production data set which meant if anything failed you had a bit more investigation to do since the data setup was done elsewhere.

On the other hand noone had to worry about getting it setup on their machines which had proved to be tricky to totally automate.

We have a similar situation on the application I’m currently working on and have noticed that we run into problems that don’t usually exist as a result of adding data to the database on each test.

For example in one test the database takes a bit of time to sort out its indexes which means that some tests intermittently fail.

We found a bit of a hacky way around this by forcing the database to reindex in the test and waiting until it has done so but we’ve now solved a problem which doesn’t actually exist in production.

This approach wouldn’t work as well if we had a read/write database since we’d end up with tests failing since another developer machine had mutated the data it relied on.

My colleague Pat Fornasier has been using an interesting spin on the idea of making decisions at the last responsible moment by encouraging our team to ‘feel the pain’ before introducing any constraint in our application.

These are some of the decisions which we’ve been delaying/are still delaying:

Dependency Injection

Everyone in our team comes from a Java/C# background and one of the first technical decisions that gets made on applications in those languages is which dependency injection container to use.

We decided to just create a trait where we wired up the dependencies ourself and then inject that trait into the entry point of our application. Effectively it acts as the ApplicationContext that a framework like Spring would provide.

I was fairly sure that we’d need to introduce a container fairly quickly but it’s been 10 weeks and we still haven’t felt the need to do that and our application is simpler as a result.

Data Ingestion

As I mentioned in an earlier post we have to import around 5 million documents into our database by the time the application goes live.

Our initial attempt at writing this code was single threaded and it was clear that there were many places where performance optimisations could be made.

Since we were only ingesting a few thousand documents at that stage it still ran pretty quickly so Pat encouraged us to wait until we felt the pain before making any changes.

That duly happened once the number of documents increased and it started taking 3/4 hours to run the job in our QA environment.

Complex markup in documents

We decided to incrementally add different types of documents into the database.

This meant that initially all our transformations involved just getting a text representation of XML nodes even though we knew that eventually we’d need to do more processing on the data depending on which tags appeared.

These data transformations actually turned out to be more complicated than we’d imagined so we might have delayed the pain here a little bit too long.

On the other hand we were able to show early progress to our business stakeholders which probably wouldn’t have been the case if we’d tried to take on the complex markup all at once.

N.B.

One thing to note with this approach is that we need to make sure there is a feedback mechanism to recognise when we are feeling pain otherwise we’ll end up going beyond the last responsible moment more frequently.

There will probably also be more complaints about things not being done ‘properly’ since we’re waiting for longer until we actually do that.

We have a code review that the whole team attends for an hour each week which acts as the feedback mechanism and we recently starting using Fabio’s effort/pain wall to work out which things were causing us most pain.

We just do some simple subtraction between the start and end build times and then filter out any results which have an end time before the start time. I’m not entirely sure why we end up with entries like that but having those in the graph totally ruins it!

And is mixed into various objects which need to display the size of a file in kB like this:

class SomeObject extends Formatting {}

By mixing that function into SomeObject any of the clients of SomeObject would now to be able to call that function and transform a bytes value of their own!

The public API of SomeObject is now cluttered with this extra method although it can’t actually do any damage to the state of SomeObject because it’s a pure function whose output depends only on the input given to it.

There are a couple of ways I can think of to solve the modifier ‘problem’:

Make formatBytes a private method on SomeObject

Put formatBytes on a singleton object and call it from SomeObject

The problem with the first approach is that it means we have to test the formatBytes function within the context of SomeObject which makes our test much more difficult than if we can test it on its own.

It also makes the discoverability of that function more difficult for someone else who has the same problem to solve elsewhere.

With the second approach we’ll have a dependency on that singleton object in our object which we wouldn’t be able to replace in a test context even if we wanted to.

While thinking about this afterwards I realised that it was quite similar to something that I used to notice when i was learning F# – the modifiers on functions don’t seem to matter if the data they operate on is immutable.

I often used to go back over bits of code I’d written and make all the helper functions private before realising that it made more sense to keep them public but group them with similar functions in a module.

I’m moving towards the opinion that if the data is immutable then it doesn’t actually matter that much who it’s accessible to because they can’t change the original version of that data.

private only seems to make sense if it’s a function mutating a specific bit of data in an object but I’d be interesting in hearing where else my opinion doesn’t make sense.