Jurjan-Paul's blog

Monday, January 19, 2009

Numeroustutorials on Python metaclasses contain this famous quote by Python master Tim Peters:

“Metaclasses are deeper magic than 99% of users should ever worry about. If you wonder whether you need them, you don't (the people who actually need them know with certainty that they need them, and don't need an explanation about why).”

I suppose the quantified part of the statement has long invalidated itself, simply by making a significant portion of already inquisitive Python users aware of the elitist feature... Once you've learned about the existence of this powerful magic weapon it is hard not to stay on the lookout for the 'need' to apply it. :-)

My use case arrived when I was looking for a way to express sentences of first order predicate logic in Python. Just for the fun of the puzzle I set out to build my own logic inference engine based on some text book algorithms. I've certainly enjoyed exploiting Python's expressiveness and got the engine working to some extent, but not far enough to be completely comfortable publishing all of its code... So, in case you're looking for working code I suggest you take a look at some pre-existing projects like AIMA-Python, PyLog and Pyke.

As predicate logic expressions clearly don't share much with a dynamic OO language in the way of syntax and semantics, it makes most sense to define a separate grammar for these expressions and then let Python parse them into an internal representation. Apart from importing text files full of knowledge at once, this should allow you to feed and query a knowledge base as follows:

With a number of excellent Python parser tools readily available it should be straightforward enough to go this 'external domain specific language' route. Even so, I didn't want to start off being side tracked by solving a secondary puzzle that would involve studying parser tools, selecting one (or none?) and then learning its specifics. I totally look forward to doing that in the future :-), but decided to examine whether I could kick-start things by facilitating an 'internal domain specific language' first. (Actually, AIMA-Python's logic.py reveals that it doesn't need to take much code at all to parse an 'external DSL' for this purpose, but when I started I didn't want to spoil the fun by studying existing solutions).

Ideally, the code to build up the internal representation of a predicate logic sentence should closely resemble the conventional way of expressing such a sentence. Python is not considered as well suited for internal DSLs as for instance Ruby. (That language doesn't require braces in method calls, allows for more flexible whitespace, has 'symbols' and supports blocks). Python fortunately supports operator overloading and that helps (as is demonstrated by logic.py and the 'Pythologic' Python cookbook recipe that I'll come back to below).

An interesting hurdle to being able to express the one paradigm inside the other comes from a fundamental difference regarding the use and meaning of 'names'. A predicate logic name (either a variable, a constant, a predicate or a function) doesn't need a meaning other than its respective type and the context in which it is used. Python raises a NameError when a name (variable, class or function/method) is referenced before it was assigned a value. (Of course it's easy to side step this difference by just sticking to Python's model and require all names used in logic expressions to be assigned values first).

The 'Pythologic' Python cookbook recipe however provides a solution that is both very clever and rather hard to understand (I'd say). It involves a @logic decorator that 'defines' undefined names within the function to which it is applied on the fly, by inspecting that function's func_code and func_global attributes and following a naming convention to distinguish between constants and variables. (The idea is incorporated in a symbols module by Leonardo Maffi).

I could have embraced that trick, but felt there should be a more explicit (i.e. more pythonic??) way to introduce new predicate logic names within an expression and not rely on a convention to distinguish between different types of 'symbols'. The natural OO representation for a predicate named 'Human' would of course be an object of type Predicate with an attribute called 'name' that contains the value 'Human'. In Python such an object is normally 'introduced' by invoking Predicate('Human'). It is clear however that this construct wouldn't help making our DSL expressions any more readable, if only because the number of required braces and quotes gets out of hand quickly:

Also, having to re-instantiate the same symbol every time it is used doesn't come across as very efficient. We should not optimize prematurely, but it is likely that the logic inference algorithm's performance would benefit greatly if symbol (in)equality could be established 'instantaneously' through the is operator (by verifying whether two symbol references imply the same memory address) as is 'incidentally' the case with Ruby's symbols. It would therefore be great if a construct could be engineered that not only takes away the need to instantiate the same symbol (of a given type) more than once, but that also makes it (virtually) impossible to do so.

Realizing that Python's __getattr__ hook allows you to control what happens when an attribute is dereferenced on a an object for which that attribute is not yet defined, I started to wonder whether it would be possible to make the syntax look like:

>>> kb.tell(Predicate.Human(Var.X) >> Predicate.Mortal(Var.X))

Basically, this would turn these Predicate and Var classes into infinitely large namespaces that contain any instance of those classes anybody can think of naming. And I believe there's definitely a pythonic feel to that... (which doesn't say much, as anybody can say that about anything).

Because it makes sense for Predicate and Var (along with Const and Function) to inherit from a Symbol base class, I will express the sketched design goals as a unittest TestCase for that Symbol class:

As is often the case I'm afraid, I first developed a solution and then went looking for some 'prior art' (somebody else having solved the same problem already)... In this case I didn't find anything that would satisfy all of the above test assertions, but you might still want to have a look at the SymbolType package by esteemed Python contributor Phillip J. Eby that he mentions in a blog post on Symbols and DSLs in Python. It appears to me that the symbol concept has gone somewhat mainstream, but the idea of subclassing a Symbol type certainly hasn't... :-) Still, that's exactly what I'm looking for because for instance a Predicate object's behaviour and semantics differ fundamentally from a Const object's.

It took me some time to figure out that I had to implement __call__ instead of __new__ on the metaclass, but in retrospect this is obvious. Largely due to the seeming complexity of using super in Python 2.5 it was also a bit of a challenge to find out how I was going to actually construct a new instance, now that __call__ was overriding the standard behaviour. But the resulting code shows clearly just how easy it really is to get what you want in this expressive language.

With a number of other interesting classes added, I have indeed succeeded in making the following work:

This might still look slightly verbose and even somewhat repetitive, but that should be considered a small pain compared to the enormous gain in explicitness ;-). You could of course cut on the verbosity by first 'importing' a name into the local or global namespace: Human = Predicate.Human, but once you start going down that path, the use case for my metaclass quickly loses all legitimacy...

Monday, December 15, 2008

[1] It can apparently be tempting for programmers to introduce home-baked enumeration types in situations where modelling with separate booleans would make the resulting code more readable and maintainable. From what I have seen this applies equally to all types of data constructs that developers have the freedom to design, be it fields in a Java class, XML attributes, SQL table columns, etc. [2]

There is a dose of irony at play here, as it is more common to suggest [3] readability will be improved by replacing uses of booleans by enumeration types (obviously deemed more advanced than that most primitive of types). Particularly with respect to otherwise unclear method arguments (lacking support for named parameters) I think the advise to prefer enumeration types is sound. What I think are unfortunate uses of enumeration types though are really special cases of the general pitfall of not building enough structure into your data by throwing together different 'aspects of state' into a single field.

The problem with home-baked enumeration types ('home-baked' indicating that you're not merely formalizing a law of nature or widely held real-world convention, like 'days of the week' or 'suits in a deck of cards') is that what seems like a fixed set of mutually exclusive options now, might easily fail to facilitate desired flexibility in the future. But well... YAGNI and we can always refactor later, right? And it seems easy enough to add one or more values to the enumeration. Unfortunately, the effort involved in adapting the code's existing logic to the new situation is often underestimated and error prone. This is so because (non-)equality tests involving enumeration values often carry implicit context-specific assumptions that get overlooked easily. Even if you've been disciplined enough to have written a proper amount of unit tests, there's a fair chance these will keep succeeding spectacularly despite now faulty logic, because your test assertions are based on the same old assumptions regarding which values could be expected and what they mean.

The problem manifests itself most clearly when modelling process states. Consider for instance the (simple) enumeration [DRAFT, SENT, RECEIVED] as representations of the possible states of a message inside yet another social networking application's internal e-mail-like messaging subsystem (as, for instance, Facebook and LinkedIn have). Next to fields for sender, receiver, subject and body, every message in the system would have a 'state' attribute that contains one of these three values. Naive? Probably, but: 'worse is better', 'time to market', etc.

At some point the site's user community started to protest that the fact that a message was received didn't necessarily mean or convey that the message was actually read. No problem: a READ value was added to the enumeration...

"Ah, that gives it away! You're arguing a straw man situation. Nobody would do that. Of course read would always be a boolean..." Well, fair enough, but are you really sure about that? Cause I know I've seen some designs by experienced developers... (and who's the straw man here anyway?)

With the READ value added, the RECEIVED state got a more specific meaning: 'the addressed user has been notified of the incoming message, but has not yet read it'. Obviously, some (display) code had to be adapted to deal with the new model. Unfortu­nately, it then took a while to figure out why everybody was suddenly being notified that their messages couldn't be delivered, while in fact these had already been read. This was of course blamed on the junior programmer who had failed to write the following expression to simply check for the SENT state (although he defended himself by saying that all the unit tests succeeded and the QA department had fully approved the previous version):

The State Pattern (moving all state depending logic into polymorph state implementing classes) could have been applied here and I believe this would take out some of the risk of introducing bugs like the above during refactoring. Even so, the State Pattern is obviously only available inside a General Purpose Language and not in a persisted or for-transport form of the data. Also, I am afraid the pattern is not well-known enough that I'd be confident that all future maintainers of the code base will consistently stick to it. Frankly, I have never seen it applied in production code... Had it been applied, the (now surprisingly bug free) code snippet would probably look like this:

message.state.check(message);

or (obviously):

message.stateCheck();

but I remain unconvinced that this is a good idea as the check is only relevant in one particular situation and being forced to build such a 'hook' into every state class suggests we are coupling things too tightly already... So probably this (contrived) example is not the best suited for the State Patter or am I missing something?

After the bug was fixed (with no State Pattern in sight), all users (5 of them by raw estimate) seemed content for a while, until some of them started to complain again. This was great news, because it meant the product was still being used. Actually, it was being used so actively that the users' mailboxes got pretty full, even with lots of messages they'd rather forget about as soon as possible. The team got together and decided that the system needed to be extended with a 'Delete' function (destroying all traces of the message's existence). This turned out not to be completely satisfactory either, so eventually an ARCHIVED state was proposed and added to the enumeration.

I hope the problem with that design is obvious... I grant that this example is a bit far fetched and perhaps not disastrous enough to really bring home my point. I wish I could have used the cases that I actually encountered, but I would have had a lot of explaining to do with respect to the domain context and we can't afford to give away our valuable trade secrets either. :-)

Just to be sure: while a READ state at least implicitly conveys that the message was previously RECEIVED, you can't deduce whether a message was READ or not from the fact that the message is currently in the ARCHIVED state. (Supposedly) relevant information was lost in the state transition. It will not come as a surprise now that I suggest the message 'state' should have been modelled with boolean flags sent and received from the beginning, to be extended with flags read and archived when those features were introduced. The code snippet would simply read:

This approach of modelling 'aspects' of state with separate flags [4] increases flexibility in a big way. Code becomes less interdependent. New aspects can be added later without pain regarding existing code reasoning over existing properties. Aspects of state that were originally modelled to be mutually exclusive (on purpose or by accident), like DRAFT and ARCHIVED, can effortlessly be made to play together nicely as system requirements develop... ("Hey, why shouldn't we allow draft messages to be archived?")

This touches on one argument that might be put forward in defense of using enumeration types: "they enforce data integrity much better, because with separate booleans an object/record might be allowed to contain values that represent an illegal/impossible state....!!!" This is true of course (e.g. sent=false, received=true). However, in all real-world cases that I've reviewed, a programmer would have to put in a lot of extra ill-conceived effort at the business logic level to make that happen. So, unless you're really worried about attackers breaking in at this level (which seems an unlikely proposition), I don't think you need to be focussing on this kind of safety net. And if programmers find a way to get to an supposedly illegal state after all, that state might actually turn out to represent a valid use case that was overlooked in the analysis phase.

Also in complex scenario's with multiple processes influencing an object/record's state, it is very helpful to be able to split up the state representation into separate fields. Depending on their given responsibility each process can then take sole ownership of a particular flag or field, taking away the danger of a 'newer state' (e.g. RECEIVED) getting overwritten with an 'older state' (like SENT), as the result of e.g. unexpected thread scheduling or network latency.

In case somebody is still tempted to protest that the Boolean type is just a special case of an enumeration type... ;-), it might be helpful to observe in summary that the Boolean is superior in its predictability. Never will a boolean need to facilitate for other values than just True or False [5] and never will the possible values of a boolean cease to be mutually exclusive. A programmer dealing with a Boolean field will of course never even give those nightmarish suggestions any thought and can therefore write clear, concise and safe code to reason over its potential values.

Next time you think you know beforehand which (3, 4, 5, etc.) states will only ever apply in a process, ask yourself the question whether those states could be expressed as a combination of meaningful booleans. This may require a bit of creative thinking in some cases, but is often easy to do. If you find this is indeed feasible, it seems obvious to me that you should do exactly that. I only wrote this down because apparently some other guys didn't see it that way... What do you think?

Notes[1] The title of this post was modelled after some of the items in Joshua Bloch's educational ¨Effective Java (2nd edition)¨ that I enjoyed reading recently. The structure and authoritative, yet balanced style of the material alone made it a joy to read. I wouldn't dare to suggest that I think I could match it.

[2] Note that where booleans share pretty universal syntax and semantics across environments, an enumeration type can take many forms: a special language construct, a set of named integer constants, an agreed set of strings, a foreign key to a dedicated code table, a set of classes that implement the same interface (as in the State Pattern), etc.

[3] e.g. in ¨Effective Java (2nd ed.)¨ on page 190 and for instance in [http://c2.com/cgi/wiki?UseEnumsNotBooleans], but note the discussion following the original argument there (no guarantees of course that that wiki entry will stay how I found it).

[4] Boolean flags don't necessarily need to be stored as such 'physically', as they can take an implicit form: a 'virtual' flag is considered 'set' (or 'unset' if that makes more sense in the context) when a corresponding optional field has a value (like a timestamp that indicates when the message was received; a value in the 'due' timestamp field referenced above might similarly imply that the message was sent). Depending on the environment, 'optional' here usually either means 'can have a null value' or 'can be omitted' (XML). And when the interface to the object is separate from its implementation (which is clearly not the case in plain XML and SQL) a client of that interface doesn't even need to know the flag is not actually stored as a separate boolean field. This type of redundancy minimization serves as a pro-boolean argument in itself: it is natural to take this approach when working from a mental model that focuses on boolean reasoning, but it doesn't work coming from an enumeration type perspective.

[5] I know: it is possible, fun and useful in some contexts to define a Boolean-like type that allows for one or more fuzzy values between True and False (e.g. Unknown), but as you write your code dealing with those instances there can never be any doubt that this is not your regular Boolean.

Monday, September 29, 2008

Though the typical mathematics-heavy curriculum for 'Computer Science' studies seems to suggest otherwise, even in its simplest form with only one dimension this is probably the most advanced mathematical problem many programmers ever encounter in their jobs: "Do two intervals overlap?" Naturally, a whole range of applications need to have such a check in place with respect to the scheduling of resources within a time frame.

Over time I've stumbled on a number of implementations in production code by different programmers who undoubtedly set out to take the task very seriously. I imagine sheets of paper with pairs of bars in all mutual configurations possible... (No, my imagination is not usually that creative... Let's say I know someone... And no, I am not mocking anybody.) Still, the resulting code was either overly complicated or flat out wrong, possibly because one possible configuration of two intervals was overlooked. The possibility of open ended intervals, sometimes in combination with using separate fields for date and time, didn't help in keeping the intent of the code immediately obvious, let alone easy to review for correctness.

Of course the solution to the 'interval overlap' question is very simple and the code can be very short and understandable. In mathematical terms: two intervals [a1, a2] and [b1, b2] overlap if and only if a1 < b2 ∧ a2 > b1.

The details of writing that in your language of choice, dealing with open ended intervals and separate fields for date and time are left as an exercise to the reader.

"But wait...", you say. "That looks so simple it can't be true. That expression can't possibly cover all the bar configurations I drew on these sheets of paper... Oh well, I can check with them of course. Hmm, it seems it might be true after all..."

That was my voice (perhaps without the drama added here for effect). My colleague challenged me to proof that the expression above holds after all. Only months later, after encountering yet another piece of complex code that was written with the exact same goal, did I sit down to take up the challenge. Trying to remember from 'Analytics' class how one goes about proving stuff, not much more came up than just the phrases 'Complete Induction' and 'Complete Intimidation'. The latter proving technique is pretty attractive and often rather effective as well, but it's really frowned upon by the elite (of which I obviously want to be part desperately). As the Induction technique didn't seem suitable either... well, I came up with the following, which I think is conclusive enough:

Note that when (a1 = b2) ∨ (b1 = a2)the overlap has a length of 0, which you wouldn't consider an actual overlap. So, for an overlap with length>0 the condition remains: a1 < b2 ∧ a2 > b1

Q.E.D.

Please let me know if I'm oversimplifying things... although you may be forgiven for thinking I just 'overcomplicated' the issue. And in case you were really looking for something slightly more challenging, please have a go at this. ;-)

Update:An anonymous commenter points out that I was indeed oversimplifying and provides the remainder of the proof. Thanks for that! ;-)

I really like it when a little thinking helps to keep our code base maintainable by expressing the logic as compactly (yet readable) as possible! I do hope however that the above is hardly my most important contribution towards that goal...

Saturday, August 23, 2008

At the start of this year I resolved to seriously invest time in writing online about (among other things) the coding puzzles I encounter. A number of the reasons that people generally give for blogging would apply (basically: they say it makes you smarter...). Unfortunately, I never even finished the post outlining those reasons (just as well) and got stuck writing another highly profound and insightful post... ;-)

At least Twitter helped me over the threshold to express some thoughts in public, but of those 140 character observations a few could 'easily' have been expanded upon, if only I had the discipline... :-)

B.t.w. I regret bringing up the 'code ownership' question in that comment. It is beside the point and says more about me (being somewhat intimidated by the hairy legal issues involved in reusing utility code between organisations) than about my former colleague, who did nothing wrong.

Friday, March 14, 2008

Though having applied JavaScript as a 'necessary evil' ;-) in the web applications I work on, I have rarely delved into its powerful OO and meta-programming capabilities. Then, last week, I built a small HTML component that could be opened from (or embedded in) any other HTML page. Values would be retrieved and returned through a small number of callback functions. Consequently, these callbacks became the contract between my little component and any page that would open/embed it.

Not being able to enforce such an 'interface contract' in the dynamically typed JavaScript posed us with a maintainability challenge. So far we've not been able to automate testing of the mostly context specific JavaScript within our dynamically generated HTML and therefore we have always needed to rely on the intrinsically unreliable manual labor of clicking through the application to see if everything works the way it should. In this modus operandi the addition of a new feature can easily break an existing feature, and that mistake can go unnoticed for a long time if the original feature isn't used much, even though it might be vital for the overall quality of the product. Fortunately, our sharp test team would catch it in Beta most of the time, but that of course is not nearly as efficient as finding and dealing with it in the development stage. Therefore we now asked ourselves the question how we could help ourselves noticing breakage as soon as possible in case the 'contract' to the component would need to change in the future and some dependent HTML pages would get overlooked in that refactoring.

I was aware that others had run into similar challenges with the also dynamically typed Python language and came up with tools like Zope interfaces and PyProtocols, neither of which I have used or studied, because... in my small spare time projects the need just hasn't arisen. :-) Even though so far I have not been able to find a similar solution in the land of JavaScript (filled with many powerful libraries/toolkits of which I only know the name, if at all), I find it hard to believe that nobody has picked up this challenge and tackled it, so I probably did not search thoroughly enough. My apologies for reinventing that wheel, though I admit to greatly enjoying the exercise. It certainly helped me grow a better understanding of JavaScript's OO mechanics.

Can you tell that I attempted to design this to somewhat resemble a Java interface definition? ;-)

The idea is to link to this code (should be in a separate .js file) in all pages that intend to implement the contract and put the "fail faster" check after the functions that implement the interface, like so:

The FooEmbedderContract object simply tests whether all functions that should be in window are there, each with the same number of arguments as specified in the interface definition. The signalIfNotCompliant function is meant to give immediate feedback in an HTML document context (although that feedback looks really raw in this implementation, it simply is not supposed to ever occur in a production situation), while the assertCompliance and testCompliance functions can be useful in a pure JavaScript context.

It is a good idea to also make the interface contract explicit within the component itself (increasing readability of the code), by testing the opening/embedding window against the interface at the start of the component's HTML (effectively carrying out the same test twice, but that should not really be a problem performance wise).

This approach doesn't of course take away the need to test the application by clicking through it, but it can signal problems with features behind links and buttons that are not yet clicked (and might not get clicked in a hurried test). As we're optimizing for failing as fast as possible in case of an error, this utility should at least help in 'failing faster' than would have been the case without it. Now, for maximal benefit, we have some work to do in starting to introduce these interface declarations and assertions in more places in our existing code base...

Saturday, January 12, 2008

As I finally started to really learn to decipher Ruby by reading the Pickaxe (2nd edition), I came across one example that I knew must contain a typo of some sort. I fired up irb just to make sure... but came away horrified:

A single name within a single scope that refers to two different things!

It took me quite a while to realise that in many languages (including Java which I use almost exclusively at work) this isn't so bizarre at all, because those grammars make a very clear distinction between variables on the one hand and functions/methods on the other.

When studying Ruby, I am apparently reasoning from a Python and JavaScript mindset, because I expect Ruby to share with those languages the dynamic typing nature and the combination of OO and functional programming influences. In Python and JavaScript the above is simply unimaginable.

It is obvious that I have quite a number of obstacles to overcome in getting used to Ruby...

Friday, January 04, 2008

Adding features to an old code base is hard and often not much fun. It is especially hard when not enough time could be budgetted to really get to understand what that particular piece of code is trying to accomplish, let alone to figure out by which coincidence it is actually almost accomplishing that original goal most of the time.

But of course, if you want to stand a chance of successfully shoveling all the 'accidental features' that you and your teammates introduced in the previous release candidate under the carpet, at the end of the day you need to get pretty familiar with whatever logic apparently appeared most intuitive to a bunch of other clowns.

So this week, after spending hours doing my very best to avoid making 'new major changes', which many of us claiming experience with such matters consider risky at the project stage we're in, I had to concede that I didn't really understand much of what some esteemed predecessor had thought was a good solution to the problem at hand. And if we wanted our users to enjoy working with an application that behaves in a somewhat predictable fashion, the last programmer leaving his mark should at least be able to predict that behaviour rather precisely from what he'd observed. As it was, all of us unlucky enough to have witnessed that particular string of characters were manically trying our best to eventually forget that painful sight.

Coming to realise that there would not be any other way out apparently was the hardest part. Finally, yet another unexpected 'accidental feature' side effect broke what was left of my resistance and I set out to do what we had all agreed by oath to never do... With a next release candidate scheduled for yesterday (or the day before), I was going to seriously touch untouchable code... I started to slowly replace all kinds of magic numbers with self-describing semi-constants (say what?) and flattened deeply nested if-then structures by introducing just enough well defined boolean variables. Eventually a satisfying sense of logic started to emerge... Even the readable and predictable kind of logic that I was looking for! But then, with the spaghetti straigthened out and all alligned, and some initial testing confirming that the program was actually doing what I now believed it should, the scary thing was that quite a number of lines appeared to be missing in action...

Not to worry though! Let's just say "less is more" and 'what he says' etc. etc. and hope nobody is going to actually, you know, miss those lines ;-).

And if you think that I am completely full of myself because of my little far fetched success story... well, you might be right, but I tend to consider this more of a humbling experience really. And don't forget that I now have many boring hours ahead of testing just about all the different scenario's the code attempts to deal with... which was probably the real reason I was putting off this refactoring for so long in the first place. And, err, no, unit tests were not exactly available or straightfoward to introduce in this 'JSP heavy' environment.... :-(

Anyway, what I enjoy about this is that dealing with legacy code can actually become rather satisfying and good fun as soon as you dare fixing things up, even in small ways. That, and that it gave me an excuse to link to a great blog post that advocates keeping code bases small, but loses all credibility by being way too long itself... :-)

I might write another blog post when... I am not enjoying time with my wife, son and daughters, at work developing software in Python, enjoying the company of friends and family, worshipping in church, reading any number of books, commuting, programming just for fun, wasting time reading other people's blogs (getting smarter for it), playing some sports (rather unlikely) or musical instrument (quite impossible). Oh... and sleep is valuable too.