All UnkeptProgramming, web development, and other stuff2015-03-24T11:45:28ZBlogofilehttp://lukeplant.me.uk/blog/categories/haskell/atom/Luke Planthttp://lukeplant.me.uk/bloghttp://lukeplant.me.uk/blog/posts/you-cant-compare-language-features-only-languages2014-11-11T09:51:43Z2014-11-11T09:51:43Z

A lot of programming language debate is of the form “feature X is really good,
every language needs it”, or “feature X is much better than its opposite feature
Y”. The classic example is static vs dynamic typing, but there are many others,
such as different types of meta-programming etc.

I often find myself pulled in both directions by these debates, as I’m rather
partial to both Haskell and Python. But I’d like to suggest that doing this kind
of comparison in the abstract, without talking about specific languages, is
misguided, for the following reasons:

Language features can take extremely different forms in different languages

In my experience, static typing in Haskell is almost entirely unlike static
typing in C, and different again from C# 1.0, and, from what I can tell, very
different from static typing in C# 5.0. Does it really make sense to lump all
these together?

Similarly, dynamic typing in Shell script, PHP, Python and Lisp are perhaps more
different than they are alike. You can’t even put them on a spectrum — for
example, Python is not simply a ‘tighter’ type system than PHP (in not treating
strings as numbers etc.), because it also has features that allow far greater
flexibility and power (such as dynamic subclassing due to first class classes).

Combination of features is what matters

One of my favourite features of Python, for example, is keyword arguments. They
often increase the clarity of calling code, and give functions the ability to
grow new features in a backwards compatible way. However, this feature only
makes sense in combination with other features. If you had keyword arguments but
without the **kwargs syntax for passing and receiving an unknown set of
keyword arguments, it would make decorators extremely difficult.

If you are thinking of how great Python is, I don’t think it helps to talk about
keyword arguments in general as a killer feature. It is keyword arguments in
Python that work particularly well.

Comparing language features opens up lots of opportunities for bad arguments

For example:

Attacking the worst implementation

So, a dynamic typing advocate might say that static typing means lots of
repetitive and verbose boilerplate to indicate types. That criticism might apply
to Java, but it doesn’t apply to Haskell and many other modern languages, where
type inference handles 95% of the times where you might need to specify types.

Defending the best implementation

The corollary to the above fallacy is that if you are only debating language
features in the abstract, you can pick whichever implementation you want in
order to refute a claim. Someone claims that dynamic typing makes IDE support
for refactoring very difficult, and a dynamic typing advocate retorts that this
isn’t the case with Smalltalk — ignoring the fact that they don’t use Smalltalk,
they have never used Smalltalk, and their dynamically-typed language of
choice does indeed present much greater or even insurmountable problems to
automated refactoring.

Defending a hypothetical implementation

Defending the best implementation goes further when you actually defend one that
doesn’t exist yet.

The mythical “smart enough compiler” is an example of this, and another would be
dynamic typing advocates might talk about “improving” dynamic analysis.

Hypothetical implementations are always great for winning arguments, especially
as they can combine all the best features of all the languages, without worrying
about whether those features will actually fit together, and produce something
that people would actually want to use. Sometimes a hybrid turns out like Hercules, and sometimes like the
Africanized bee.

Ignoring everything else

In choosing a programming language, it’s not only the features of the language
that you have to consider — there is long list of other factors, such as the
maturity of the language, the community, the libraries, the documentation, the
tooling, the availability (and quality) of programmers etc.

Sometimes the quality of these things are dominated by accidents of history
(which language became popular and when), and sometimes they can be traced back
to features of the language design.

Many language-war debates ignore all these things. But it’s even easier if you
are not actually comparing real languages — just language features, abstracted
from everything else.

I understand that comparing everything at once is difficult, and we will always
attempt to break things down into smaller pieces for analysis. But I doubt that
this goes very far with programming languages, because of the way the different
features interact with each other, and also exert huge influence on the way that
everything else develops e.g. libraries.

Conclusion

Language features exist within the context of a language and everything
surrounding that language. It seems to me that attempts to analyse them outside
that context simply lead to false generalisations.

Of course, being really concrete and talking about specific languages often ends
up even more personal, which has its own pitfalls! Is there a good way forward?

The problem

Many programs build up sentences using bits - often a template into which
different things might be substituted. However, the things you substitute into a
sentence can change the sentence, and vice-versa, in ways that are not
anticipated by the programmer.

For example, plurals. In English, you might try code like this:

ifn==1:return"I have 1 pig"else:return"I have %s pigs"%n

Localising these strings gives problems, because the rules for how to create
plural forms is different in every language.

This specific problem is generally considered 'solved' by the use of gettext,
but many more exist.

For example, we have another problem as soon as we start substituting nouns:

"Delete selected %s?"%object_name

Various attributes about the noun could affect the sentence. In French, the
adjective "selected" needs to agree in gender with the noun being substituted
in. So you cannot lookup the translations for "Delete selected %s" and for
object_name separately. (This is a real example picked from Django source code).

Further, depending on how the sentence uses the noun, the form of the noun might
need to change. For example, the noun might appear in the accusative position
for a given sentence and language, which requires a different form of the noun
to be used compared to the nominative form.

Several other examples of this appeared in Django ticket 11688. One proposed solution on that
ticket would require a huge amount of knowledge and effort on the part of Django
programmers, and almost certainly would not work anyway.

This post is an attempt to come up with a better solution, or at least kick
start discussion. I haven't been able to find any solutions to this problem
online, and most people seem to be just using gettext, which is a 95% solution —
and maybe that is good enough for most people.

[Update 2013-02-19 - ‘Richard’ pointed me to Locale::Maketext article,
which has in essence a similar approach to what I've done here]

Assumptions and simplifications

We will assume that a sentence is a composable unit of meaning, such that
sentences can be translated independently. So, if in language A we have
sentences 1 and 2, in that order, we can translate these into language B by
translating sentence 1 and sentence 2 independently, and putting them together
in the same order.

This is, no doubt a simplification. In some languages, the two sentences might
make more sense if re-ordered, or combined, or split in various ways. Indeed,
some languages may not have a truly equivalent concept of 'sentence' at all.

However, we have to do something, and this is a reasonable approximation.

Requirements

We need a powerful way of defining sentences in a given human language. It must
be powerful enough that the person doing the translation can do anything they
need, without the programmer needing to be aware of all the things in the
language that will cause difficulty.

So, we'll start with a full programming language, and chop out the things we
shouldn't need.

We shouldn't need side effects - translation should be a pure function. So we'll
use a purely functional programming language without side effects.

We need something fairly readable, because translators are going to have to use
it. It should be as close as possible to declarative in style.

Pattern matching seems like a great fit for some of our needs.

Possible solution

Given the above requirements, let's start with a Haskell-like pure functional
language, whose pattern matching will be extremely helpful. It will obviously
have IO removed, and no type signatures (but that won't stop us inferring them
and being able to statically type-check the code). Everything else will be
borrowed directly from Haskell, so that I can avoid having to make up my own
syntax and semantics.

If the concept works, we can argue about better or simpler syntax for some
constructs, or helper functions that aren't part of the Haskell prelude.

Hopefully, we will find a relatively small subset of Haskell that is needed to
give us all the power we need to solve this problem - a subset small enough that
we could guarantee non-termination ideally, to avoid problems with translations
created by malicious agents.

This will be an example based exploration.

Let's assume that every sentence can be generated by a function. The function
will take as parameters all the substutions that are needed, and return the
translated string.

So, suppose we have the English sentence "I have some pigs". For every different
language we need, we would have a translation file which contains the function
iHaveSomePigs, which in this case takes zero parameters. So for French:

iHaveSomePigs="J'ai des cochons"

(The mapping between the English sentence "I have some pigs" and the function
name iHaveSomePigs hasn't been defined, and we'll skate over that detail for
now).

(For those unfamiliar with Haskell, the way that pattern matching works is that
the first definition that matches the arguments is used. Since n is not a
literal, but a variable, it can match any argument.)

We can cope with more complicated rules, such as those used in Polish, perhaps
something like this:

Note that the complex logix in pluralForm and pluralize only has to be
defined once. Adding more words simply requires additional plurals
lines. It's not the nicest syntax, but could probably be improved, and it's
pretty easy to copy.

Let's add in gender, using the sentences "Delete this %s?" (singular) and "Delete
selected %s?" (plural). We can use guards:

Note that the only thing required by this system is that the functions
deleteThisThing and deleteSelectedThings exist. Everything else is at the
freedom of the translator, and better ways of defining any of these functions
are possible.

Of course, it isn't expected that a translator would be able to produce this by
himself/herself. However, once the basic logic has been set up, this syntax is
readable enough that a translator could easily add more of the same. Lines like:

pluralForm1"pig"="cochon"

are actually pretty readable. The lack of parentheses in Haskell function calls
is also a bonus (though, as I said earlier, exact syntax could be debated). This
is not really that much harder than editing a .po file if you are just wanting
to add more of the same.

Also, we've got flexibility. If we really don't care about getting the gender
right, we can just do "sélectioné(e)s" and be done with it.

Let's make it harder - we'll add case. I'll use NT Greek as an example,
because it has nouns that decline with case (and I don't know any similar modern
languages well enough). I'm going to introduce an enum for the different cases,
using data for now, and for the different genders. I could also do the same
for number ("Singular" and "Plural"), but just using 1 and 2 seems
easier.

Our sentence will be "You like the %s.". For this in Greek, we need to choose
the accusative singular form of the thing we pass in. We also need to pick the
word for "the" (the definite article) which matches the gender and number of
the noun, and it has to match the accusative case too. So, if we pass in a
masculine word, we need the singular accusative masculine definite article
(having fun yet?):

Of course, you can easily define shorter aliases to avoid some typing here, and
there may be better ways to generate the tables, though as written above they
are pretty readable, and should be familiar to anyone who knows Greek.

The function youLikeTheThing here is no longer very readable, although it
could be much worse. Some kind of substitution syntax/function could be used.

The code above actually works, BTW, and it actually ran first time I tried - the
only correction I needed to make its output correct was to add a space after the
definite article. You just need to put it in a file test.hs, add the
following line:

main=putStrLn$youLikeTheThing"book"

and do:

$runhaskelltest.hs

There is not a type signature in sight, but you have compile time
guarantees. This is all a testimony to the clarity of Haskell's syntax.

The features of Haskell we've used are:

functions

simple pattern matching on numbers and strings

guards

data statements, limited to union types of nullary constructors
i.e. effectively enumerated values. We could use a keyword enum for
clarity.

string concatenation

lists

a few arithmetic and logical operators

We haven't used recursion. I can imagine circumstances where it might be useful,
but if deemed too risky, you could add some rules that would disallow it
(e.g. by requiring a function mustn't call itself directly, and must only call
functions that exist prior to it in the source code, to avoid mutual recursion.)
This would be helpful to ensure termination.

You might also want a module system, to be able to pull in some common
definitions and functions for a given language, for consistency across different
projects.

This whole approach has the advantage of being able to refine and special case
as much as you want. Take the sentence "you like the %s": suppose that if the
thing is a human being e.g. "man" or "woman", you need to use a completely
different verb. Then you just add a special case first:

isAPerson"man"=TrueisAPerson"woman"=TrueisAPersonn=FalseyouLikeTheThingthing|isAPersonthing=...-- fall through to the normal case here

In the other direction, if you just don't have the time to care about any of
this, you can just use a really simple (and often wrong) formula:

youLikeTheThingthing="φιλεις τον "++greekthinggreek"book"="βιβλιον"

Notice that the programmer of the main project does not know anything about
plural forms, gender, case etc., or put any of that into the source code. The
only thing he/she would do is call a function with all the things to be
substituted. We could have some mapping from English strings to function names,
or we could just use the function name as a string, e.g. from a Python project
we might call the function like so:

prompt=translate("doYouWantToDelete",n,object_name)

This would call the translation function doYouWantToDelete with the parameters
n and object_name.

As a refinement, we can provide a version which will work when the whole
localisation machinery is turned off i.e. we allow the programmer to provide
their own version of the translation function which returns the default language:

prompt=translate("doYouWantToDelete",n,object_name,lambdan,object_name:"Do you want to delete these %s%s(s)"%(n,object_name))

As before, the provided function can be correct or simplistic as desired for
English.

Feedback

There are a few questions in my mind:

Would a solution like this work for the languages you know? What additional
features would be needed to cope with other human languages?

Is this vaguely practical? Could you get translators to be able to edit code
like this? If not, and only programmers would be able to do this, are there
enough programmer-translators to make it a viable solution, at least for some
big projects?

I'm aware that the string concatenation gets ugly fairly quicky, and some
kind of interpolation might be needed (including the ability to call
functions within that interpolation). With that in place, I think you could
achieve a reasonable level of readability.

A translation tool could also have language-specific templates to quickly
insert the code for common forms.

Is it possible to have a simpler language that would still be able to cope
with the examples here?

The examples I've come up with suggest to me that you need a full programming
language, and that attempting to start from the other direction (e.g. build
up from the current gettext approach) will produce a monstrosity.

gettext already does a 95% job, and we are at the point of diminishing
returns. So if we are going to try to tackle the final bit, we need to err on
the side of enough power to get all the of that 5%, rather than put a lot
of effort in and discover we've only arrived at 96%.

You also cover the case of having a client who insists that the program
should output "cet homme" and not "ce homme" - while it might make your
translation file ugly, you've got the power to be able to do it if you want.

Theoretically there's nothing you can't do in either, as long as the
languages are Turing Complete. The more interesting question to me is what's
easy or natural in one vs. the other.

This post is about providing an example to back that up, and to respond to
people who claim that, since you can implement dynamic types in a statically
typed language, statically typed languages give you all the benefits of
dynamically typed languages.

[Edit: to those who think I'm being a language or dynamic typing advocate or engaging in any kind of bashing, please read that last paragraph again, and note especially the use of word 'all'.]

Let's set up a problem. It's made up, but it illustrates the point I want to make:

Given a file, 'invoices.yaml', take the first document in it, extract the
'bill-to' field, and save the data in it as JSON in an output file
'address.json'. You can take it for granted that the contents of that field
can be serialised as JSON (e.g. doesn't contain dates), although that might
not be true for the rest of the document. To keep the example focussed and
simple, everything will be ASCII.

The particular YAML file I used was taken from an example YAML document I found
on the web, and then expanded for the sake of illustration:

I'll use Python and Haskell as representatives of dynamic typing and static
typing, because I know them and many would consider them to be very good
representatives of their camps, and I'm a big fan of both languages.

I also think that examining any programming problem in the abstract, or with
respect to ideas like ‘dynamic typing’ or ‘static typing’, is not very relevant,
because in the real world we have to use real, concrete languages, and they come
with a whole set of properties (in terms of the language definition, tool sets,
communities and libraries) that make a massive impact on how you actually use
them.

So I'm going to try to use real libraries that actually exist, ignore solutions
that could theoretically exist but don't, and ignore problems that could
theoretically exist but don't.

Python

Notes: I didn't have to consult docs once. This isn't just due to my
familiarity with Python — it's also the fact that I can fire up IPython and
go:

In [1]: import yaml
In [2]: yaml.<TAB>

and get a list of likely functions. I can then go:

In [3]: yaml.load_all?

and get help, or go:

In [4]: yaml.load_all??

and get the complete source code of the function/method/class/module, in case I
need it.

Haskell

Now for the Haskell version. First, a disclaimer: I'm much less experienced
in Haskell than in Python. I did manage to write my blog software in Haskell at one point, but I
don't use Haskell on anything like a daily basis, and I do use Python that much.

I first need to parse YAML. I've got a choice of packages. Unlike in Python, for
a library like this, the choice you make is likely to have a big impact on the
code you write — switching to a different (perhaps faster) package won't be just
a case of changing an import, as we will see. The choice of packages represents
the fact that even designing how this thing should work in terms of API and data
structures is not straightforward in Haskell, and represents a much bigger
commitment, and therefore problem, for the library user. In Python, while there
are a few API choices (like supporting streaming or not, potentially), mostly
it's pretty obvious how the library should work.

Looking on Hackage, I first find the 'yaml' package. The first line of the
Data.Yaml API docs
reads:

A JSON value represented as a Haskell value.

(Yes, you read that right). This doesn't look good. The whole file has stuff
about JSON, not YAML, with no indication why I want to be using JSON values, not
YAML. But I have a go anyway, perhaps it was deliberate.

When trying to use the decodeFile function, I get an error about needing a type
signature, due to the way decodeFile is defined:

decodeFile::FromJSONa=>FilePath->IO(Maybea)

There are lots of instances of FromJSON to choose from, but I have to know in
advance the type of data. And it looks like I've got data that isn't going to
fit into any of those types, because it involves heterogenous collections.
[Correction in comments, see below].

I gave up and tried another package - Data.Yaml.Syck.

First try:

importData.Yaml.Syckmain=dod<-parseYamlFile"invoices.yaml"printd

This works - well, I've got some kind of parsing going on, at least. It looks like
I've got some YamlNode datastructure, and the top thing is an EMap (it
looks like it has only parsed the first document, which is worrying, but doesn't
matter given my requirements, so I'll ignore that). But how do I get data out?

OK, let's try yaml-light - it wraps HsSyck and has some easier utility
functions, like lookupYL.:

lookupYL::YamlLight->YamlLight->MaybeYamlLight

That expects the lookup key to be a YamlLight, so I need to create one from
a string, somehow. The docs show how to turn a ByteString into a
YamlLight node, and I need to pass in a String, which from previous
experience requires doing something like pack from Data.ByteString.

Now I have to dump to JSON. From a Python perspective, all I want is a function
that can take some ‘native values’ and dump them to JSON, like the Python
json.dump function. But every piece of data in my data structure is wrapped
in things like YStr and YMap.

In addition, though I can see the structure of my data in front of me, the
requirements I've been given don't make guarantees that it will stay the same,
just that it can be converted to JSON. I need a routine that will convert
anything YAML to the equivalent in JSON, where that is possible.

It looks like I could create a JSON instance for YamlLight, so that the
encode function I want to use (which dumps JSON to a string) could take
YamlLight as an input directly. I end up with this:

This works, and I'm sure there are other solutions. If I were cleverer, and knew
Haskell better, I could perhaps write a cleverer, shorter solution, which would
also be proportionately more difficult for someone else to understand, so I'm
not particularly interested in making this code shorter, as it does the job.

But this illustrates why some people like dynamically typed languages. The fact
that you can implement a variant data type in Haskell (such as YamlLight or
JSValue) doesn't mean much, because these data types are not used
everywhere, and therefore you have multiple competing ones that you've got to
convert between. If you did have a single variant datatype that was used
everywhere... you'd have a dynamically typed language, in effect.

The strictness of the type system gave rise to a choice of libraries and APIs
that made my life harder, not easier. I then had to write glue code to marshall
between the dynamic types used by the two libraries I needed.
[Edit: or, as it turned out, I need to know where to find it, possibly in the form of already written type class instances, or how to get the compiler to write it for me]

Some people might still prefer the Haskell version. It has some nice properties,
like the fact that compiler has checked that it can indeed convert any YAML
object into JSON — you'd get a warning if you missed a case. One response to
that might be that if the two types didn't happen to match so well — for
instance if the YAML library started supporting date/time objects — this benefit
would disappear. If you need to avoid all possible problems up front, Haskell
will help you out more. Python, on the other hand, will allow you to avoid
spending time thinking about theoretical problems which may never happen in
reality.

But there are always runtime errors that you could come across, even in Haskell
— for example, if you want to convert this to cope with non-ASCII documents, the
compiler can't point out all the places you need to fix, and if you forget one
you could still get a runtime exception, or worse, silent data corruption.

So, in my opinion, this is a case where dynamic typing shines, and the ability
to implement dynamic typing on top of static typing simply doesn't give you the
benefits you get in a language that embraces dynamic typing to its core.

There are, incidentally, some interesting developments in Haskell that might
allow the possibility of running programs that aren't quite typed correctly, as
long as you don't encounter the type errors in practice. This could counter some
of the points I've raised — see this interview with Simon Peyton-Jones
, from 27:45 onwards.

To some extent I agree with this, but I want to give some reasons why a strong
and powerful static type checker really does eliminate the need for
automated tests in some cases—that is to say, there are instances when the static type checking makes the automated tests redundant and not the other way around, and does a better job.

I have very few tests in my Haskell blog software. There are significantly more in the Ella library which I wrote alongside it, but still far from complete coverage. While I like test driven development, and did it for some parts of this project, many times it felt like a waste of time. In some cases it was perhaps misdirected laziness, but I'm not convinced it always was. So what are the characteristics of code that doesn't benefit from automated/unit tests?

Trivial code

If code is extremely simple, it can actually be worse to have tests than to not
have them.

In defending that statement, the first thing to remember is that tests can have
bugs in them too. Now, many bugs in the tests will be caught, as long as you
follow the rule of making sure the test fails, then writing the code, then
making sure it passes. However, many bugs of omission, which are
also very common, will not be caught i.e. when the test fails to test something it ought to.

Second, there is always a cost to writing tests. So, as the probability of
making a mistake in your code tends to zero, the usefulness of tests against
that code also tends to zero—and not just to zero, it can go negative. You
spent x minutes writing a test for something that didn't need testing, which is
lost time and money already, and you also have extra (test) code to maintain in
the future, and a longer test suite to run.

Third, you can write an infinite number of tests, and still have bugs. You can
have 100% code coverage, and still have bugs. (I'll leave you to do the research
on code coverage if you don't believe me). So, you have to stop somewhere, and therefore you need to know *when* to stop.

So suppose you write a utility function that is used to sanitise phone numbers
that people might enter. It removes '-' and ' ' characters. (The result will of
course be validated separately, but we want to allow people to enter phone
numbers in a convenient way). In Python:

defsanitise_phone_number(s):returns.replace("-","").replace(" ","")

The testing fanatics might stop to write a unit test, but not the rest of us,
because:

You would mainly be testing that the built-in string library works.

If you think of the ways that the function is likely to be wrong, the test
is just as likely to fail to catch it. For example, the function above
might really need to strip newline chars as well, but that's not going to be
tested unless I think to write a test for that.

If there actually is a bug here, or the implementation gets more complex so
that it merits a test, I can cross that bridge when I come to it, and it
won't cost me extra.

It's more likely that I'll forget to use this function than that I get it
wrong. Therefore, an integration test would be far more useful. But in some
cases, integration tests can be extremely expensive, both to write and to
run, especially when testing javascript based web frontends, or GUIs that
are not very testable. I'm almost certainly going to test this code by at
least one manual integration test, and after that, do I really need to write
an automatic one?

However, if I was writing the function in a language that was less capable than
Python, I might well write a test for the above.

Declarative code

(You could argue that this is an extension of trivial code, but it feels slightly different, and the case is even stronger).

Imagine your spec says that you should have 5 news items on the front page of
your web site. You are using a library that has utility code for getting the
first n items, or page x of n items each. And of course you are going to use a
constant for that 5, rather than code it right in. So somewhere you are going
to write (assuming Python):

NEWS_ITEMS_ON_HOME_PAGE=5

Are you going to write a test that ensures that this value stays at 5, and
doesn't accidentally get changed? Then your code base violates DRY—you now have
two places where you are specifying the number of news items on the home
page. That is, to some extent, the nature of all tests, but it's worse in this
case. With non-declarative code and tests, one instance specifies behaviour,
the other implementation, and it's usually obvious which is correct. But with
declarative code, if one instance is different, how do you know which is
correct?

Or are you going to write a test for the actual home page having 5 items? That
would be pointless, because it's just testing that you are capable of calling a
trivial API, which itself belongs to thoroughly tested code. You might want a
sanity check that you have made a typo, but checking that the page returns
anything with a 200 code will often be enough.

What about something like a Django model? Your spec says that a 'restaurant'
needs to have a 'name' which is a maximum of 100 chars. You write the following
code:

Are you going to write code to test that you've typed this in correctly? It
would again be violating DRY. Are you going to check that this interfaces with
the database correctly? There are already hundreds of tests in Django which
cover this. Are you going to write tests that are effectively checking for
typos? Well, if you use this model at all, it's going to be very obvious if
you've made a mistake, and some other simple integration test is going to catch
it.

Haskell

Now, coming to Haskell. You can guess the point I'm going to make.

In Haskell, a lot of code is either trivial or declarative.

Further, many of the types of errors you could make are caught by the compiler.
Typos and missing imports etc. are always caught, and many other errors beside.

Functional programming languages, especially pure ones, eliminate a lot of the
kind of mistakes that are easy in imperative languages. Everything being an
expression helps a lot—it forces you to think about every branch and return a
value. In monadic code it becomes possible to avoid this, but a lot of your code
is pure functional.

Example 1

Imagine a more complex function than our sanitise_phone_number above. It's
going to take a list of 'transformation' functions and an input value and apply
each function to the value in turn, returning the final value. In some
languages, that would be just about worth writing a test for. You might have to
worry about iterating over the list, boundary conditions, etc. But in Haskell
it looks like this:

apply=foldl'(flip($))

In the above definition, there is basically nothing that can go wrong. We
already know that foldl' works, and isn't going to miss anything, or fail
with an empty list. You can't forget to return the return value, like you can
in Python. The compiler will catch any type errors. If the function doesn't do
anything approaching what it's supposed to then you'll know as soon as you try
to use it. I've used point-free style, so there isn't any chance of doing
something silly with the input variables, because they don't even appear in the
function definition!

For something like the above, you would often write your type signature first:

apply::a->[a->a]->a

Once you've done that, it's even harder to make a mistake. It's almost possible
to try vaguely relevant code at random and see if it compiles. For something
like this, if it compiles, and it looks very simple, it's probably
correct. (There are obviously times when that will fail you, but it's amazing
how often it doesn't. You often feel like you just have to keep doing what the
compiler tells you and you'll get working code.)

Is the above code 'trivial' or 'declarative'? Well, that's a tough call. A lot of code in Haskell quickly becomes very declarative in style, especially when written point free.

Example 2

But what about something much bigger—say the generation of an Atom feed? With a
library that makes use of a strong static type system, this can be actually quite hard to get wrong.

In my blog software, I use the feed library for Atom feeds. The code I've
had to write is extremely simple—a matter of creating some data structures
corresponding to Atom feeds. The data structures are defined to force you to
supply all required elements. Where there is a choice of data type, it forces
you to choose — for example the 'content' field has to be set with either
HTMLContent "<h1>your content</h1>" or TextContent "Your content". (For those who don't know Haskell, it should also be pointed out that there is no equivalent to 'null'. Optional values are made explicit using the Maybe type).

After filling in all the values for these feeds, I wrote some very simple 'glue'
functions that fed in the data and returned the result as an HTTP response. I
created 4 different feeds, all of which worked perfectly first time, as soon as
I got them to compile. I cannot see any value, and only cost, in adding tests
for this. A check for a 200 response code and non empty content might be worth
it, but would be much easier to write as a bash script that uses 'curl' on a few
known URLs.

Had I written this in Python, I might have wanted tests to ensure that the HTML in the Atom feed content was escaped properly and various other things, in addition to a simple check for status 200. But the API of the feed library, combined with the type checking that the compiler has done, has made that redundant, and has tested it far more easily and thoroughly than I could have done with tests.

And it's not in general true that the simple functional test will catch any type errors, because often it will only exercise one route through the code, ignoring the fact that in many places dynamically typed code can return values of different types, which can cause type failures etc.

Example 3

One final example of reducing the need for automated tests is the routing system
I've used in Ella. OK, it's really a chance to show off the only slightly
clever bit of code that I wrote, but hopefully it will explain something of the
power of a strong type system :-)

Consider the following bits of code/configuration in a Django project, which are responsible for matching a URL, pulling out some bits from it and dispatching it to a view function.

Now, there are a number of possible failure points in this code that you might
want some regression tests for. For example, if in the future we change it so
that the URL uses a string such as a user name, rather an integer, we will need
to change the URLconf, the line in member_detail that calls int, and the
definition of get_member (or use a different function).

There is a DRY or OAOO failure here—the fact that we are expecting an integer is specified multiple times, either implicitly or explicitly. This is one of the causes of fragility in this chunk of code — if one is changed, the others might not be updated, introducing bugs of different kinds. Now, there are things you can do about this, with some small or large changes to how URLconfs work. But they are not complete solutions, and one solution not open to Python developers is the one I coded in Ella.

The equivalent bits of code, with type signatures and explanations of them for
those who don't know any Haskell, would look like this in my system.

You should read <+/> as ‘followed by’ and //-> as ‘routes to’. Just
ignore the $ [] bit for now (it exists to allow decorators to be applied
easily in the routing configuration, but we are applying no decorators, hence the empty list).

intParam is a ‘matcher’: it attempts to pull off the next chunk of the URL
(ending in a '/'), match it and parse it as an integer. If it can do so, it
passes the parsed value on to memberDetail as a parameter i.e. it partially
applies memberDetail with an integer.

The beauty of this system is that nothing can go wrong any more. We still have DRY violations at the moment, but it doesn't cause a problem, because the
compiler checks for consistency.

In fact, we can even remove the DRY violation. We could change the code like
this:

We've replaced intParam with anyParam, which is a polymorphic version
that can match any parameter of type class Param. You can define your own
Param instances, so this is completely extensible (and you can also define
your own matchers, for complete power). We've also removed the type signature
from memberDetail. So how can anyParam know what type of thing to
match?

This is where type inference comes in. The function getMember will probably
have a type signature, or it will use its parameter in such a way that its type
signature can be inferred. From that, the type of memberId can be inferred.
From that, the type of value that anyParam must return can be inferred. And
from that, finally, the instance of Param can be chosen. The compiler is
using the type system to pick which method should be used to match and parse the
URL parameters based on how those parameters are eventually used.

This is very nice. (At least I think so :-). We've removed the DRY violation,
or, if we choose to use type signatures or explicitly specify types in
routes, DRY violations don't matter because the compiler will catch them for
us.

Would unit or functional tests have caught any problems? Well, they might. If
they checked the happy case, they will prove whether that still works. But
they're unlikely to check whether the URLconf is too permissive or not. But the
compiler can do that kind of consistency check.

The end result is that there are just fewer things that can possibly go wrong.
I'm not saying that you wouldn't bother to write any tests. But in this case,
if memberDetail was really just glue, you might decide to only test its
component parts (for example, by testing the template that it relies on). Since
most of the glue has been constructed so that it can't go wrong, you can focus
tests on what can go wrong. And some sections of the code sink below the threshold at which tests provide positive value.

There are many other ways in which static type checking can make automated tests
redundant. Parsers are a great example — a spec might define a syntax in BNF
notation. In Haskell, you might well implement that using parsec. But if you
look at the code, it will have pretty much a one-to-one correspondence with the
BNF definitions. Any tests you write will simply check that a few examples
happen to be parsed correctly, as you cannot begin to cover the input space.
It's therefore far better to spend your time manually checking that the code
matches the BNF spec than writing lots of tests. Unit tests often will not catch the type of errors that a compiler can if there is any polymorphism in the code paths.

Conclusion

Before you flame me, don't think that I'm attacking other languages. This
experience with Haskell has actually proved to me that Python is still easily my
favourite language for web development, especially in combination with
Django. (I could do a follow up on why that is—I have a growing list of things I
dislike about Haskell, some of which are fixable). But I often hear the Python
crowd saying things about static typing and testing that come from ignorance,
and the way you would imagine things to be (often based on experience of
Java/C++/C#), and not from experience of something like Haskell.

I finally finished the Haskell blog project that I've been doing for alongtime! You're looking at it now (unless you are reading this a few months/years after I wrote it, in which case I will probably have again re-implemented my blog software in my new language-du-jour...) [EDIT: I switched to blogofile in June 2012]

The blog software itself is not particularly interesting — fairly standard features, Atom feeds etc. It uses HDBC Sqlite for storage, and HStringTemplate for rendering (a nice library, BTW). For framework stuff, it uses my own Ella library. I didn't find a forms/validation library I could use, and ended up just using a few adhoc bits and pieces. I've used the lovely pandoc to allow reStructuredText both for my own posts and for comments, which is a nice feature IMO.

The main interest for me has been the learning process. You get a much better, rounded understanding of a language from a project like this than you do from the small code samples that people knock around.

The project nearly failed at the last hurdle. Everything was working, but when I uploaded to my server, it failed on some URLs. I realised it was a memory problem — the CGI program must have been killed for using too much memory.

At first, I thought the limits on the server must be unreasonably small. Understanding the output of +RTS -s-RTS is kind of difficult. When I eventually found out that GHC compiled programs never release any memory back to the operating system, I realised that it's the first figure—the total amount of memory allocated in the heap—that was killing me. On the bigger pages, this was over 160 Mb. At that point I stopped complaining to my web host!

By changing to ByteString instead of Data.Text for StringTemplate, and using ByteString in a few other places, I achieved a 4-5 fold reduction in memory usage, along with a significant speed up. Most pages now only use about 10-15 Mb to render, which is OK for a short running process I think. It's not ideal, especially when an additional 1k comment on a page seems to require at least 300k extra memory to render, but it's good enough for now. Profiling further will be very hard, as I suspect it will mainly be to do with the guts of HStringTemplate.

I'll be blogging about the experience of developing this over the next few days/weeks, and what I've learnt. It's certainly been enjoyable overall, although it's definitely had its painpoints too!

I've put redirection in for all the old, crufty URLs, so there shouldn't be any broken links. Feed readers will likely be confused, sorry!

If you have problems getting through my spam protection, please let me know. It enforces a 10 second wait before it accepts submissions, which serves to prevent thoughtless comments as well as spam :-)

So, I rewrote my blog softare in Haskell, for kicks. I've finally finished,
after a long time developing, trying out different ideas, learning Haskell etc.

I had already confirmed that I could build a binary for my target machine. That
was a long process, which involved installing GHC 6.4 from binaries, and using that to build GHC 6.8.3. I have to build from source because of bug #2211.

However, in the process of developing, things have moved on, and it was much
easier to develop with GHC 6.10 and newer libraries than the 6.8.* series.
Which means that I now need GHC 6.10.* on the VM that I'm using to build
binaries.

I tried 6.10.4, but due to bug #3179, I found I had to downgrade
to 6.10.1.

Trying to build that, however, produced bug #3639 — it won't build with GHC
6.10.4. I switched to using GHC 6.8.3 install to try to build it, but it still
isn't happy:

Now, GHC 6.8.3 comes with base = 3.0.2.0, which might be the problem here. If
that's right, then you can't build GHC 6.10.1 with 6.8.3. So, it sounds like
I'm going to have to build GHC 6.6.1 in order to build 6.10.1.

This seems pretty crazy! It wouldn't be so bad if GHC was quick to build, but
every build takes many hours.

No prizes for guessing that the output of this program is not "True". It
highlights an essential problem with the Haskell standard library — many of
the functions provided by the Prelude, System.IO, System.Posix and many
others are completely broken (by design) and silently corrupt your data,
unless it is composed only of ASCII characters.

The problem is that these APIs use Strings for operating system calls (such
as reading/writing files, reading environment variables etc). A String is a
list of unicode Chars, but none of the operating system calls have a clue
what unicode chars are — they work entirely with bytes, which are a
completely different kind of thing. Result: your program breaks without
warning if you don't happen to be using ASCII.

And even worse, many libraries are built on the use of Strings and standard
library functions, and they inherit these same problems, so as a user of
those libraries, you can end up with problems that you can't even work
around. For the library developer, too, it can be a very nasty problem — you
start developing code using Strings, which works fine for ages, but a long
time later you realise you can't support just ASCII, and really you need
Data.ByteString, which requires changing function signatures or duplicating
existing code if you don't want to break compatibility.

This is a rather embarrassing situation for the standard library of a modern
language. What's worse is that even if you include the Haskell Platform as it
currently stands, as far as I can see there is no solution to this bug — no
correct way to simply write a string out to disk and read it back! I presume
this is because there is no universally accepted library for dealing with
encodings. Personally, I'd like to see the standard library change to remove
the pretence that you can talk Unicode to the operating system, but at the
very least we need a standardised way of doing the right thing, so that
developers (of both programs and libraries) don't have to use those broken
functions, and know what the correct alternatives are.

What do you do when you are dealing with what seems like a bizarre compiler bug,
with the compiler being nothing less than GHC? First, pinch yourself — check;
then try again, 3 times to be sure — check; clear out 'dist/' and any temporary
build files — check; sleep on it — check.

And it's still happening.

I'm trying to use HStringTemplate for my personal blog
software, in particular the renderf function. I was getting tricky
compilation errors, and in the course of messing around I found the following:

GHC cannot compile a certain function, call it func1 for now, which uses
renderf. But it compiles and works just fine if another function func2
(which doesn't use renderf, but does use a related HStringTemplate function
render) is present in the module, even though func2 is not used
anywhere in the project. Changing some of the details of what func2 does
causes compilation to fail again, though other details can be changed.

That has to be impossible, right? Am I losing my mind?

Ideally I'd create a nice simple test case, but that might take hours, and
changing small things about the voodoo function func2 seems to destroy its
magical properties, and I'm suspecting the problem is in me. So I'll just
post all my code. The bad news is there are lots of dependencies. The good
news is I have used cabal, so the following instructions should suffice if you
have cabal installed.

I don't know whether that compilation error is correct or not, but either way,
it seems crazy that it could depend on the existence and implementation of a
completely unused function.

For reference, I'm using GHC 6.10.1.

Any ideas?

]]>Luke Planthttp://lukeplant.me.uk/bloghttp://lukeplant.me.uk/blog/posts/haskell-regex-problem-help-needed/2008-11-21T19:14:32Z2008-11-21T19:14:32ZI need some help! I did someone a good deed on a blog the other day, so I'm swallowing my pride and asking for a random kind deed from someone who knows something.

If I revert to the bytestring that comes with my system, 0.9.0.1 , the
error goes away. Having finally looked at the differences between 0.9.0.1 and 0.9.0.2, which are tiny, and do not include any differences when it comes to the definition of typeclass instances, it seems clear that this isn't really the problem, but something else very funny is going on. But I do not have the first what.

I was just coping with it by sticking with bytestring-0.9.0.1, but I won't be able to do that forever...

Do I have to rebuild all the packages in my system or something evil? Any ideas?

Thanks in advance!

]]>Luke Planthttp://lukeplant.me.uk/bloghttp://lukeplant.me.uk/blog/posts/ella/2008-11-04T18:37:56Z2008-11-04T18:37:56ZI have been continuing very slowly with my Haskell blog. Yesterday I properly pulled out the Django-inspired framework I am writing alongside it, and called it Ella, after another jazz genuis (though a vocalist -- I much prefer vocal jazz).

There were a number of reasons I didn't like the existing Haskell CGI package, but one of the biggest was the lack of explicit request and response objects. Instead of that it did everything inside a CGI monad, which makes it impossible to do things like reusable pre-processing of the request and post-processing of the response, both of which I will want to be able to do. I wanted something much more in the style of Django, with explicit request and response objects, something, ironically, much more functional instead of imperative -- it is a surprise that I got this from a Python web framework. There were also things about the CGI API itself I really didn't like (e.g. didn't differentiate between GET and POST inputs). Plus, I wanted to have a go at some real Haskell, so I rolled my own.

It is very early days at the moment, and many big things are missing from the API (like proper access to GET and POST parameters, handlers for file uploads, any kind of HTML helpers for form handling etc). I realised that the API for all of that should only be implemented as I needed it in my actual software, otherwise I would just get it wrong. What is implemented so far is a strongly-typed routing mechanism, and not much more. It is enough, though, to implement a useful app -- I wrote a very simple 80-line script that handles (via clickable URLs in emails) subscription to a personal mailing list I'm organising for myself. It also acts as an example at the moment -- see ConfirmCgi.hs.

Currently there is no home page, though there is some good documentation, and you can get the source. All of the API is subject to change at any time, but I think what I've done so far is a reasonable basis.