URL design: Insired both by ”Cool URIs don’t
change” and the general REST
emphasis on sensible URL design, I designed a URL scheme for
lagen.nu that maps closely to how swedish laws are identified, and
one that is future-proof in that it hides implementation
details. Since I hope that many people will find it useful to link
to individual pages on lagen.nu, it’s very important that those
links doesn’t break.

To illustrate: the generated html files are placed in directories
named html/year/id.html. Suppose that I
just were to let Apache (or any other webserver) serve content straight
from that directory. For example, the copyright law has SFS id
1960:729, and so with this scheme the URL would behttp://lagen.nu/html/1960/729.html (orhttp://lagen.nu/1960/729.html, depending on what I would
set DocumentRoot to). This would work fine, the URL’s would be
easy to understand, and people would link to all those individual pages from
all over the web.

But now suppose I wake up one day and decide to stick all data in a big
database, and build a PHP frontend? The URL’s would change, and probably be
on the formhttp://lagen.nu/showlaw.php?sfs=1960:729. Again, nice
short URL, easy to understand… but now the links from all over
are broken.

, the resource found at URLhttp://lagen.nu/html/1960/729.html is now also available
at http://lagen.nu/1960:729. This is even nicer to read,
futureproof, and enables someone that knows the ID of any law to
go straight to the page for that law quickly.

As a added bonus, it makes the text and XML version of the laws
easily available too: During generation, I put these versions in
sibling directories to html/, named text/ andxml/, respectively, and then use the following rules:

There are other parts, such as the index pages and the about
pages, where the underlying flat-file nature of the site shines
through, such as http://lagen.nu/om/me.html, but those
pages are not as likely to be linked to. Still, if I should decide
to change it at some point, I’ll probably make some mod_rewrite
based backwards compatibility hack. Also, if you want to do this
on a Win32 platform, you’re out of luck. See my previous
post for alternatives.

Update functionality: As the body of swedish law is always
changing, I had to plan ahead of how to keep the site up to
date. New laws are usually published every Wednesday, and it’s out
of the question to download every page from Regeringskansliets rättsdatabaser
once a week. So instead, I store the highest previously-fetched
ID, and from the update routine, I try fetching laws, increasing
that ID, until I finally get a ”law not found” error. The laws can
be either ”base constitutions”, i.e. new laws, or ”change
constitutions”, laws that specify that some other, older, law
should change (think ”source code file” and ”patch”,
respectively).

If it’s a base constituion it’s pretty simple, just
download it and process it from start to finish. If it’s a change
constitution, however, we find out wich base constitution it
changes, fetch that, see if it has been updated (”patched”), and
if so, store the old versions of that law somewhere, then do the
normal regeneration process. In this way, I can, over time, build
up an archive of old law revisions, so that I can tell how the law
looked at a particular date. For now, I have only a few months
worth of history, but the value of this will grow as the time goes
on. In particular, it would be cool to be able to do CVS-style
diffs between arbitrary revisions of a law, or to be able to link
to a law as it looked at a particular moment of time.

Front page: With the information we gather during the
update process, we can build a list of recently changed laws, and
put it on the front page. Similarly, we build a list of recent new
court verdicts, and also one with site news. All these are
published in a side bar on the front page, while the main content
area is filled with a static listing of some of the most important
laws. The different parts of the side bar leads to different news
pages, which details site, law and verdict news in greater detail.

RSS feeds: Hey, it’s 2004, how could any self-respecting
new site not have a RSS feed? I generate feeds for all three news
types (site, law and verdict) using PyRSS2Gen,
a nice little lib for creating RSS 2.0 feeds. I haven’t tried them
out much, but feedvalidator says they’re OK, and they work fine in
all RSS readers I’ve tested with (although Opera tends to show the
raw HTML instead of rendering it, which probably indicates that I
should include it in some other way than just escaping it and
showing it in the description tag. Maybe it would be
useful to use a service like Feedburner for
this.

Conclusion: This marks the end of this eight-part posting. I
hope that you’ve picked up a trick or two. As should be apparent,
I am no python wizard or even a programming guru in general, but
over the years I’ve found a style of development that works for me
in a single-developer context (doing multiple-developer projects
is a totally different thing), mostly centered around the XP tenet
”Do the simplest thing that could possibly work”, or in my own
words: ”Don’t try to find the right way of doing something — if
you just work away, the right way will find you”.

Like most developers that have been Test infected, I
try to create regression tests whenever I can. A project like
lagen.nu, which has no GUI, no event handling, no databases, just
operations on text files, is really well suited for automated
regression testing. However, when I started out, I didn’t do
test-first programming since I didn’t really have any idea of what
I was doing. As things solidified, I encountered a particular
section of the code that lended itself very nicely to regression
testing.

Now, when I say regression testing, I don’t neccesarily mean unit
testing. I’m not so concerned with testing classes at the method
level as my ”API” is really document oriented; a particular text
document sent into the program should result in the return of a
particular XML document. Basically, there are only two methods
that I’m testing:

The lawparser.parse() method: Given a section of law text,
returns the same section with all references marked up, as
described in part 5.

Law._txt_to_xml(), which, given a entire law as plaintext,
returns the xml version, as described in part 4

Since both these tests operate in the fashion ”Send in really big
string, compare result with the expected other even bigger
string”, I found that pyunit didn’t work that well for me, as it’s
more centered around testing lots of methods in lots of classes,
where the testdata is so small that it’s comfortable having them
inside the test code.

Instead, I created my tests in the form of a bunch of text
files. For lawparser.parse, each file is just two paragraphs, the
first being the indata, and the second being the expected outdata:

Similarly, the code to test Law._txt_to_xml() is also
pretty trivial. There are two differences: Since the indata is
larger and already split up in paragraphs, the indata and expected
result for a particular test is stored in separate files. This
also lets me edit the expected results file using nXML mode in
Emacs.

Comparing two XML documents is also a little trickier, in that
they can be equivalent, but still not match byte-for-byte (since
there can be semantically insignificant whitespace and similar
stuff). To avoid getting false alarms, I put both the expected
result file, as well as the actual result, trough tidy. This
ensures that their whitespacing will be equivalent, as well as
easy to read. Also, a good example of piping things to and from a
command in python:

If the two documents still don’t match, it can be difficult to
pinpoint the exact place where they match. I could dump the
results to file and run command-line diff on them, but since there
exists a perfectly good diff implementation in the python standard
libraries I used that one instead:

So, that’s basically my entire test setup for now. I need to build
more infrastructure for testing the XSLT transform and the HTML
parsing code, but these two areas are the trickiest.

Since I can run these test methods without having a expected
return value, they are very useful as the main way of developing
new functionality: I specify the indata, and let the test function
just print the outdata. I can then work on new functionality
without having to manually specifying exactly how I want the
outdata to look (because this is actually somewhat difficult for
large documents), I just hack away until it sort of looks like I
want, and then just cut’n paste the outdata to the ”expected
result” file.

Now, after having downloaded, analysed and XML-ified the lawtext,
theres just one step left: Convert them to HTML. Enter XSLT.

XSLT is a complex beast, and I find that it’s better to approach
it as a programming language, albeit with an unusual syntax, than
a stylesheet language. I have one master stylesheet that I use for
the transformation (now over 600 lines of XSLT code), and during
devlopment I found out quite a few tricks. By the way, if you want
to see the original XML code for any law on lagen.nu, just add the
suffix ”.xml” to the url, ie http://lagen.nu/1960:729.xml.

Creating useful tooltips: An important aspect of the work
done with lagen.nu is that I’m trying to not only produce more
astethically pleasing layouts of the text, I try to
cross-reference as much as possible, to really take advantage of
the medium.

As one example: The internal references that were discussed in the
last post, should of course be transformed to ordinary hypertext
links, on the form <a href="#P26c">26 c
§</a>, so that you quickly can jump around in the
document and follow the references. But we can do better than
that, by extracting the first 20 or so words from section 26 c,
and stick them into the title attribute of the a tag:

Now, if you hover with the mouse over the link (assuming you have
browser hard/software where these things are possible), thetitle text will be shown as a tooltip. This makes it even
easier to make sense of a section, since you get an instant
reminder of what the referenced section says — in many cases you
don’t even have to click on the link.

So, initially, my plan was to have things like these tooltip
strings prepared in the XML document, and just do a very simple
transform into HTML. But as the work progressed, I found that I
was almost always able to do it in XSLT instead. This is the
relevant part of the link template:

The variables $chapter, $section, and $piece gets their values
earlier in the template, and are set to the place the link goes
to. $hasChapters and $sectionOneCount are set globally for the
document and are used to tell what kind of section numbering this
particular lawtext is using. There are three variants commonly
used:

No chapters, just simple ascending section numbering

Chapters with restarting section numbering (ie, regardless of
the number of sections in chapter 1, the first section in chapter
2 will be numbered ‘1 §’ — ie, we need to know the chapter as
well as the section — just ’17 §’ is ambigious)

Chapters with continous section numbering (ie, if the last
section in chapter 1 is ’16 §’, the first section of chapter 2
will be numbered ’17 §’ — ie, the section number is never needed
to unambigosly determine what ’17 §’ refers to).

The code constructs a XPath expression and finds the node that the
link refers to, and stores it in the variable text. Then,
it trims the string down to a suitable length (max 160 chars, in
this case) by using the user-defined
function tune-witdh, together with normalize-space and
substring. tune-width ensures that we end the string on a word
boundary. The result of all this is assigned to the title
attribute.

Generating a TOC: Again, if you look at the swedish
copyright law, you will notice a big blue link titled ”Visa
innehållsförteckning”. Clicking on this yields (if you have a
browser that supports javascript and DOM level 1) a table of
contents (TOC), generated from chapter headings and other
headlines. It starts out as hidden, since it usually is in the
way, but sometimes it’s very useful.

The XML document in itself do not contain any TOC data. To
generate the TOC, we use a number of mode-specific templates that
extract the relevant information from headlines contained in the
document, all triggered by a <xsl:apply-templates> call:

Things to note: The text-to-XML conversion is responsible for
determining the ‘level’ of the headlines. Level 1 headlines are
usually associated with a chapter (though not always), and we use
some tests to determine if that is the case. If so, the resulting
link uses a ”#K<number>” anchor fragment, otherwise a
”#R<number>” fragment. ”K” is for ”kapitel” (chapter), while
”R” is for ”rubrik” (headline). Not strictly neccesary, but I
prefer a link that explicitly says ”this link goes to chapter 4”,
rather than ”this link goes to the 14th headline”, particularly as
the number of headlines can change in the future, which would make
the link point to the wrong place.

I have a number of ”empty” templates. They are needed, since if I
don’t have them, the base template kicks in and just copies the
entire text, which seriously messes up the TOC. Now, I should be
able to limit that with the select attribute of the
<xsl:apply-templates> tag, but I have been unsuccesful (the
reason I select both headlines AND sections, then do nothing with
the sections, is that I’ve been experimenting with using the first
lines of each section in the TOC as well, but that came out too
messy).

Accessing data in other documents: To properly understand a
section in a law, it helps to read court verdicts that reference
it. In part 2, I described how to fetch data from Domstolsväsendets
rättsinformation, which has this data. Basically, I have a
python code snipped that goes through all 1800+ verdicts, uses the
parser that was mentioned in part 5 to find references
to laws and
sections therein, and then generates a ”cache.xml” file, that
contains references to all verdicts, sorted by law, chapter and
section, like this:

In the top level of the stylesheet, I select the relevant nodeset
from this cache document, and store it in a variable. To be able
to select, I first need to know the id of the law (the
‘SFS-nummer’), and so I pass it as a command line parameter:

Again, a little extra work is done to make sure the explaining
text is cut at a word boundary (we could‘ve done that in
the python code that generates cache.xml), and also that the text
for the actual link makes sense. You see, different courts have
different systems for assigning ID’s to cases: HD (the swedish
supreme court) uses the page number as the case is published in
that year’s anthology of relevant verdicts (Nytt Juridiskt Arkiv,
NJA), other uses a court-specific identifier, and some use a
derivative of the date when it was handled. Since these
identifiers represent different things, they are also represented
with different attributes in the XML file, and a little
<xsl:choose> trickery selects the most appropriate ID.

Optimization hints: Some of the law texts are quite big, and
processing them can take a long time. To speed things up, here are
some of the things I did:

Don’t use the python interface to libxsl! I started out with
this, but it turned out to take twice or even three times as long
for a transformation, as compared to running command-line xsltproc (win32 binaries)
through an os.system() call.

Use the –profile switch to xsltproc. It’s pointless to
optimize if you don’t know where the bottlenecks are, and with
xsltproc it’s so easy to find them.

Store frequently-referenced nodes, nodesets and values in
variables, instead of selecting them through XPath queries all the
time. Yeah, this one is kind of obvious, but it really helps,
particularly in those ”inner” templates that are used all the
time.

Just by using the above tips, the transformation of my standard
test case (1960:729) went
from 90 seconds to three. Still, I have other test cases that
still takes several minutes, so clearly there’s still some work to
do…

The second part of the convert-things-to-xml approach deals with
finding inline references to other paragraphs in the law
text. I’ve written about it in part in this
blog post, but to recap:

Swedish law text contains lots and lots of internal and external
references to other sections. These references have a
semi-standardized format, but they are clearly meant to be parsed
by humans, not machines.

Here, the string ‘3 §’ signifies a reference to the third section
in the current section of the current law. If we can identify and
mark that reference up, we can make the ‘3 §’ into a hyperlink
leading to the definition of section 3. You know, the stuff
hyperlinking was designed for. Currently, this gets transformed
into the following XML (how the XML gets transformed into
clickable HTML is the subject of a later article):

things get way more complicated. Enter EBNF-grammar-based parsing
with the dynamic duo that is SimpleParse and mxTextTools. Also,
the book Text Processing in
Python by David Mertz should be mentioned, as it helped me in
the right direction when I realized regexes weren’t going to cut
it.

My previous
post describes the actual EBNF grammar and how SimpleParse is
used to build a parse tree from it, so you might want to read that too.

However, that is really only half of the problem. After having a
tree of parsed tokens, that might (for a somewhat complicated
scenario) look like the following:

Turns out this is a problem that can be solved in a rather generic
manner with a small amount of planning and a little
recursiveness. Basically, all tokens that end in ‘Ref‘
should generally end up formatted as a <link>. All
tokens underneath such tokens that ends in ‘RefID‘ are
used to find the attributes for these tags. Start at the root,
then recurse depth-first over all child nodes until
done. Sometimes there are notes that end in Ref
underneath other nodes also ending in Ref, in those cases
it’s the top node that is turned into a <link> tag

Of course, there are exceptions and things to be aware of. Two of
these are illustrated in the above example. To correctly insert
the law reference (the SFS id ‘1960:644’) for tag to be produced
from the MultipleGenericRefs token ‘14 § 1 st. 6) och
6 §‘, we have to plan ahead, when dealing with the parent
node (ExternalRefs). The formatter has built-in knowledge
of the special handling needed, and, when encountering aExternalRefs node, finds the ExternalLawRef node
child, stores the underlying LawRefID value, and later
picks this value up when formatting the tags for the twoGenericRef tokens.

To make this looping and recursing easier, I build a OO wrapper
around the array-based parse tree that mxTextTools return. This
was the subject of another
post.

Note also that the ExternalLawRef token did not result in
a <link> tag, but the underlying LawRefID
token did. This was a consious decision (I thought it looked
better that way), and was implemented by creating a number of
extra subroutines in the formatter. Basically, the main function
acts as a generic dispatcher, by looping over all subtokens in a
tree, and for each token, checks if there’s a correspondingformat_tokenname() function. If so, call that,
otherwise call a generic formatter (which may, recursivly, make
use of the generic dispatcher). The code is pretty simple, but is
indicative of how neat stuff like this can be done in dynamic
langugages:

The other wierd thing is that ‘|’ sign in the term
‘varumärkes|lagen‘. This is swedish for ‘the trademark
law’, but since we like to write words together, this creates an
interesting challenge for creating the EBNF gramar. Basically, I
cannot find a way to match a word that ends in a specific suffix,
such as ‘lagen‘. The resulting parser is always ‘greedy’,
so there seem to be no way of matching these words without
matching all words. So, to fix this, I preprocess the text
before putting it through the parser, using normal python regular
expressions, which can be non-greedy, to put in that ‘|’
sign, which solves the problem. Then, after retrieving the string
from the tag formatter, I remove those signs.

One particular satisfying thing of the problem described in this
post is that how well it lends itselfs to automated regression
testing. Any new feature can easily be specified in a test case
before coding begins, and after it’s done, it’s easy to verify
that nothing that previously worked has been broken. More on
regression testing in a later post.

We’re taking a break from our ”the tech behind lagen.nu series (1,2,3,4),
to bring you this information:

For those of my readers that cannot read swedish, see, or discern
text in really blurry pictures, the above images tells that I am
now officially qualified for law school, starting
2005-01-16. I had goodreason
to assume that I would qualify, but it’s still nice to have it in
black on white. Especially considering that tomorrow is my last day at
work. Now I can look forward to a full month of vacation without any
clouds over my head.

On a side note, the picture was taken with my new toy, a Sony Ericsson
P910i. This little device really is starting to grow on me. More
about it in a later post, as well.

If you look at a sample law text as they are presented in SFST,
they are 100% plaintext. In order to convert them to semantically
sensible XML, we must look for patterns in the plaintext, to
identify things like headlines, start of sections and references.

I did this with a two-part approach. First; I break down the text
into it’s individual paragraphs and determine what each paragraph
is. This analysis operate on a ‘block level’ — either a block of
text is a headline, part of an ordered list, the start of a
section etc, or it isn’t. A block can’t be half headline and half
ordered list.

Secondly, if the paragraph can contain references to other parts
of the law (or parts of other laws, for that matter), I analyze
the text in greater detail to find and resolve these
references. This step operates on a ‘token’ or character level —
a block of text can contain zero, one or many of these references.

The first part is fairly easy, at least conceptually. I wrote a
bunch of functions like is_chapter(p),is_section(p), is_headline(p), where each
individual function just performs a simple test. These are then
used from a simple loop that uses a bunch of local variables to
keep track of what kind of structures we’ve encountered so far —
a so-called state machine.

These functions started out as very simple regexp-based inline
expressions, but as I encountered more and more exceptions to my
simple rules, their complexity grew. The body of current swedish
law is over 250 years old, and consistency has not been the law
makers forte. Here’s an example of how to recognize the
start of a section:

re_SectionId = re.compile(r'^(\d+ ?\w?) §[ \.]') # used for both match+sub
re_SectionIdOld = re.compile(r'^§ (\d+ ?\w?).') # as used in eg 1810:0926
def is_section(self,p):
section_id = self.section_id(p)
if section_id == None:
return False
if section_id == '1':
if self.verbose: print "is_section: The section numbering's restarting"
return True
# now, if this sectionid is less than last section id, the
# section is probably just a reference and not really the
# start of a new section. One example of that is
# /1991:1469#K1P7S1. We use util.numsort to get section id's
# like "26 g" correct.
a = [self.current_section,section_id]
if a == util.numsort(a):
# ok, the sort order's still the same, which means the potential new section has a larger ID
if self.verbose: print "is_section: '%s' looks like the start of the section, and it probably is (%s < %s)" % (
p[:30], self.current_section, section_id)
return True
else:
if self.verbose: print "is_section: Even though '%s' looks like the start of the section, the numbering's wrong (%s > %s)" % (
p[:30], self.current_section, section_id)
return False
def section_id(self,p):
match = self.re_SectionId.match(p)
if match:
return match.group(1).replace(" ","")
else:
match = self.re_SectionIdOld.match(p)
if match:
return match.group(1).replace(" ","")
else:
return None

So, the start of a section looks like ‘<number> [letter] §’,
unless it looks like ‘§ <number> [letter]’, and as long as
the section id is larger than the previous section id, unless it’s
restarting at 1, and also taking into account that section ids can
contain an optional letter (like ’26 g’). Simple as that.

For example, below is the start of the swedish copyright law
(which has, during development, been my foremost test case). Now,
if you don’t read Swedish, just know that the first paragraph
signifies the start of chapter one (”1 Kap.”), the second the
start of a section(”1 §”), and then there are three items in an
ordered list, and finally a plain ol’ paragraph of text. (By the
way, for my swedish readers: The swedish legal term ‘paragraf’ is
NOT the same as the english typographical term ‘paragraph’, but
rather translates into ‘section’. When the english word
‘paragraph’ is used, it is in the same sense as the swedish word
‘stycke’)

For the first paragraph, is_chapter() will return True,
so we transition to the in_chapter state. This will emit a<chapter id="1"> tag to the result, along with
the text.

For the second paragraph, is_section() will return
True, transitioning us to the in_section stage as
well. It’s important to realize that these states are mostly
independent; we can be in a chapter w/o being in a section, and
vice versa (some laws have only sections, no chapters). This
transition will of course emit a a <chapter id="1"> tag
to the result.

For the third paragraph, is_ordered_list() will return
True, transitioning us to the in_ordered_list state, and
emitting <ol><li>1. skönlitterär eller[...]</li> to the result. A couple of things to
note about this:

These tags may look like HTML tags, but they’re not. They do, in
this case, share the same semantics, though.

A HTML ordered list (<ol>) keeps track of
it’s numbering, and any user-agent should add the numbers to
the displayed result. Since this is not HTML, we don’t do
that, and instead keep the number that was in the original
text. Mostly because we’re not sure that no ordered list in
the entire body of swedish law doesn’t contain sequence gaps
or things like ’26 g’.

For the fourth, is_ordered_list() will again return
true, but now we’re already in the in_ordered_list
state, so we don’t emit the initial <ol> tag.

Then we omit some boring junk, and then we encounter a normal
paragraph. There is no function namedis_ordinary_paragraph(), this is just infered from the
fact that none of our other test matched. Now, the start of a
normal paragraph is implicit evidence that our ordered list must
have ended, and so the code to transition intoin_normal_paragraph state checks to see if we’re in thein_ordered_list state, and if so, emits the trailing</ol>. After that, the normal paragraph gets
emitted as <p>Till litterära verk [...]</p>.

Similarly, when we encounter the start of a new section (”2
§ Upphovsrätt innefattar”), we transition out of any
ordered lists we might be in, out of the in_section state, emit</section>, and then transition into the same state
again.

The largest part of this high-level parsing was to find all the
different structures and discover the implicit state transitions
that needed to take place. One interesting problem is determining
wheter a given paragraph is a headline or just a really short
ordinary paragraph. Well, it turns out that a headline never ends
with a period, and a ordinary paragraph always does. Except for
headlines that end in ”m.m.” (swedish for ”etc.”). And ordinary
paragraph that end with ”,”, ”och” or ”samt”. But a headline is
always followed by a section, so we can peek ahead to see if the
next paragraph matches is_section(). Except for those cases where
a headline is followed by ANOTHER headline, which is then followed
by a section…

A lot of work, and there are still a lot of places
where we break (for example, things like definition lists, nested
ordered lists, and tables, are not yet supported). Furthermore,
tweaking to satisfy one case can easily make other cases break. I
will return to this problem in the part about regression testing.

Some other things: As you may have guessed by the usage of the
term ”emit”, I construct the XML by hand in a single pass. This
means that I write XML data straight to a file, without using DOM
or anything similar. It was just easier to start out that way. I
am thinking of overhauling the state machine mechanism a bit, and
I might take a DOM approach then.

Some other parts of the code that constructs XML documents (like
the code that handle court verdicts) use DOM, and it works pretty
fine, with one small exception: It does not play nice with xml
fragments in string form. I have a separate class to find law
references in text (to be covered in deeper detail in the next
post), and the interface of that class is a plain string one: it
returns strings like ”<link law="1960:644" piece="2"
section="1">1 § 2 st.</link> varumärkeslagen”. Since
there isn’t any method on individual element objects likeele.ParseAndAppendMixedContent(), I first have to create
a ‘fake’ XML document and transform it into a node tree with
parseString (from minidom):

Another drawback of writing XML the raw way is that there is no
guarantee that your output will be valid. Special care is taken to
escape things like < and &, and to ensure the document is
well-formed, it’s also run through HTML Tidy (with the
options -q -n -xml --indent auto --char-encoding latin1),
which, despite it’s name, is an excellent tool to tidy up XML as
well.

There are a lot of ways to extract data from a HTML file. You can
do simple string
searching (by the way, why is the python documentation for
basic string objects hidden under the non-descript heading
”Sequence types”, and why is there no reference to that part of
the documentation from the separate
string module, which hardly does anything?) and rexep
munging, or you can use more
sophisticatedHTML
parsers. Funnily enough, there are two of these in the Python
standard library, and both of them are callback based — why no
tree-based interface? If the HTML code is modern and well-formed,
you can even use a vastarrayofXMLtools (and if
it’s not, you can fix it with HTML Tidy).

I ended up using the BaseHTMLProcessor
approach from Dive
Into Python., which has a whole
chapter devoted to the art of HTML parsing. Basically, you
subclass BaseHTMLProcessor, implementing callbacks for various
tags, which are called as these tags are encountered in the
document. Your class is responsible for keeping track of whatever
state (ie what depth you are in the document, what tags were
encountered before this one, and so on) that needs to be kept.

There are some things that are cumbersome with this approach. For
example, automatic HTML entity resolving would be good. The HTML
fragment
”<h1>r&auml;ksm&ouml;rg&aring;s</h1gt;”
represents a single header with the string ”räksmörgås” (a commontest phrase for
swedish programmers), and so it should only result in three
callbacks: start_h1, handle_data (which should
be called with the string ”räksmörgås”), and end_h1.

Instead, the following callbacks are called:

start_h1

handle_data (called with the string ‘R‘)

handle_entityref (called with the string ‘auml‘)

handle_data (called with the string ‘ksm‘)

handle_entityref (called with the string ‘ouml‘)

…you get the idea. There exists a
mapping that helps with the entity resolving, but for the HTML
case, this could have been solved at a lower-level stage.

Still, for the parsing problems I have, the
callback-based/keep-your-own-goddam-state-approach works. Most of
the time I’m just concerned with finding the elements
in a table, meaning I have to keep track of what cells I’ve
seen and when a new table row starts, things like that. As I go
along, build up a list of mappings or something similar, and then
just use that list once done. The calling code gets quite nice and
simple:

(Note that the ‘c’ object here is not a standard dictionary, but a
mapping-ish object that also keeps track of the order keys have
been inserted. That’s why it’s c.item('SFS-nummer') and
not c['SFS-nummer']. That, and the fact that I was too
lazy to implement the special
methods needed to do a proper Duck Typed
dictionary.)

The one exception is the problem of finding all the plaintext in a
law text like this
one, but it’s even easier: Just increment a counter whenever a<pre> tag is encountered, decrement it when seing</pre>. In handle_entityref and handle_text, just
check if the counter is > 0 and if so, append the text to a StringIO
object.

Now, the first thing that needs to be done is to actually get
the text of the laws from the web. Before that can be done, a
list of available laws must be fetched. In Swedish law, most
laws have nice ID’s known as ”SFS-nummer”, usually on the form
”yyyy:nnn”, where yyyy is the year it was issued, and nnn is
incremented for each law passed that year. Some of the older
laws don’t strictly follow this convention, and can have ID’s
like ”1736:0123 2” or ”1844:50 s.2”.

To get a list of all laws passed between two years, one can use this form fromRegeringskansliets
Rättsdatabaser. It uses the normal GET method, and so it’s
quite easy to construct a URL that will return all laws between,
say, 1600 and 1850 (linebreaks inserted for readability):

To fetch this, I used urllib2. Now for a little aside rant: Why
are there two urllibs in the standard distribution? I understand
that basic urllib has a simple interface and urllib2 a more
complex, but would it be so hard to design a single module
that lets you do simple things easy, and progress onto hard
things? In the Perl world, you can start with LWP::Simple and
then go on to more advanced stuff, but with python it’s either
simple urllib requests with practially no control at all, or
urllib2 with it’s way-too-complex system of chained handlers. I
will return to this rant in a little bit, but for now let’s have
some useful content. This is the code used to fetch stuff:

So, as long as your needs are simple, like just wanting to do a
simple GET or POST, urllib and/or urllib2 will work. However, I
encountered a more complex scenario when I wanted to download
court verdicts from Domstolsväsendets
rättsinformation: This is a web app that relies on HTTP
posts, HTTP redirects, session tracking cookies, frames,
javascript links and is, in general, incredibly fragile. The
slightest ”error” in what you send and the server answers with a
”500 Server Error: java.lang.NullPointerException”

The first problem was that the application requires cookie
support, which urllib2 doesn’t have (as it doesn’t have any
concept of state between requests). At first I thought I could fix
the cookie support by reading and setting headers, the way it was
done back in the day when men were men and knew the cookie
spec by heart. Turns out the web service sets a cookie when
you issue a search request, but the answer from the search request
is a HTTP redirect. To get the resulting list of matches, you need
to present that same cookie that was set.

Now, let’s continue the rant: urllib2 blindly follows the
redirect, giving us no chance to set the Cookie header. From the
documentation, it appears that it should be possible to override
this behaviour by subclassing HTTPRedirectHandler and passing
the instance to build_opener, which creates a chain of instances
of BaseHandler or a subclassed class. Reading the documentation
for urllib2 makes me think that someones OO/design patterns
fetish was not kept properly in check. Anyway, I could not get
that to work.

Another thing that bugs me about urllib2 is that is has no
support for implementing Robots Exclusion Standard
(RES) support. Right now, neither Regeringskansliets databaser
or Domstolsväsendets rättsinformation has a/robots.txt, but if they put one in tomorrow I think I
should respect it.

I did briefly use ClientCookie,
which is an add-on module for urllib2 that provides automatic
cookie support, and it did solve my first problem. Although I
did not try it, it can also be used to provide RES support,
proper Referer setting, and some other goodies. It seems that
at least the cookie handling functionality of ClientCookie has
been folded into urllib2 in Python 2.4, which is a good thing.

However, some time after I first got some code to work with the
site, they changed something around and made it even more
fragile. No matter what I did, I couldn’t get the site to
respond with anything other than a ”500 Server Error”,
even though I checked the client-server communication when using
IE (with the excellent Fiddler utility),
and replicated the behaviour down to the exact header level.

So, I remembered that Erik had told me about
the virtues of webscraping using IE
and COM automation. Since I’m only running the code on
windows machines, giving up platform independence wasn’t that
big a deal, and the rich COM support in both Python and IE made
it quite easy (after installing pywin32 for
COM support). Here’s the basic code:

With such a javascript- and browser behaviour dependent web app
such as this, you can really save yourself a whole lot of
trouble if your code can use that web browser instead of
trying to emulate that web browser. For one thing,
behaviour implemented in javascript (OnClick-handlers and the
like) is reproduced correctly without any extra work.

Well, that’s all for now about fetching stuff from the web. Next
installment will center around making sense of what we’ve just
fetched, i.e. parsing HTML and stuff.

Now that lagen.nu has been out for
some time, it might be a good
idea to write down what I’ve learned from it so far, in blog
form. Much of the discussion will be centered around python, a
language I’m far from proficient in, but it’s possible that
someone will learn at least something from it.

First, take a look at this
post that explains what lagen.nu is, from a user
perspective.

This post is about how the site is produced. When I started out, I
had no clear idea of what I wanted to do, other than to download
the text of all swedish laws and convert it to some sort of nice
HTML. I knew I wanted to do as much as possible with static HTML
files, and I had a hunch that XML would be involved in some way.

So, essentially, the code only needs to run off-line, with no
GUI required.

I thought about doing this in C#, since it would be a good
experience building project in a language for which expertise is
highly sought after. But since I’m no longer
programming for food (actually I am, for another four days,
but still), I took the opportunity to do it in python, a language which I’ve
always liked but never become friends with.

From a high level, the code does the following:

Finds out what laws are available

Downloads the law text HTML documents

Converts the text to XML

Transforms the XML to HTML

There are some extra steps involved in creating the front page,
RSS feeds, and handling the verdicts database, but these are the
main steps.

The result of the program is a tree with static HTML files, ready
for deployment.

(That is, if I want to test the get method of theVerdictCollection class). To test the code, I just press
C-c C-c in the python editor window. The entire python buffer gets
sent to the python shell, and the last part (after if __name__
== "__main__":) executes.

Things that are good about this environment:

Free, in both senses of the word

The intendation support really works, which is quite important with python

Reasonably fast edit-run cycle

The interactive python shell

Things that are bad:

I can’t debug stuff. It seems like it should be
possible, but I have no pdb.exe, which seems to be a
requirement. In particular, it would be nice to be able to
automatically start debugging when an unhandled exception is
raised.

Copy and paste from the *Python* buffer has character set
problems. For example, if my code outputs a § sign, and I cut’n
paste it into another file, emacs will complain:

These default coding systems were tried:
iso-latin-1-dos
However, none of them safely encodes the target text.

This is bogus, since the § sign is perfectly legal in latin-1.

I use the standard python.org distribution of Python 2.3 (I
haven’t gotten around to upgrading to 2.4 yet), not the ActiveState
one. I tried it, and like the fact that the win32com module is
bundled, but the python.org version is a leaner download and has a
more usable HTML help application (particularly the good index).

Rasmus blogs
about open standards and open access for law texts, a topic of
great interest to me as of lately (or ‘obsession’, apparently :-). We
both agree that there are too little of that, and that the
much-heralded open governement could do better in this area.

Anyway, I came across an interesting
report, a summary from a conference held about two years ago. It
featured various people working with legal information systems, both
in the government and private companies, sharing their views on
standardisation of document formats and systems. There are views from
the people behind Rixlex, Infodata and Notisum, amongst others, but
also an interesting view into the state of legal information standards
in Norway. They seem to be way ahead of Sweden in this area.

The general consensus seemed to be ”standardization is good, and we
should do it”, but with no real commitments or timeplans. Maybe there
has been developments that I don’t know about since then. This was,
after all, two years ago.

There are now at least three document standards, or efforts to
create such, for marking up law texts and other legal doucments on my
radar: uscfrag
(mentioned earlier), used
by Cornell University for marking up US Code, LegalXML which seems to be
US-centric, and LEXML, which
appears to be more EU-centric. It even has it’s own Sourceforge page!

I had no idea so much was going on in so many committees when I
started working on the XMLization of swedish law. In a way I’m glad
that I didn’t, since I probably would have focused too much on
adhereing to these emerging standards and less time to, you know,getting things done. Or worse, just waited for them to actually finish.