This year I attended the FOSDEM in Brussels for the first time. I have been
doing free software for more than 20 years, but for some reason, I had never
been to FOSDEM. I was pleasantly surprised to see that it was much larger than I
thought and that it gathered thousands of people. This is by far the largest
free software event I have been to. My congratulations to the organizers and
volunteers, since this must be a huge effort to pull off.

I went to FOSDEM to present Logilab's latest project, a reboot of CubicWeb to
turn it into a web extension to browse the web of data. The recording of the
talk, the slides and the video of the demo are online, I hope you enjoy them and
get in touch with me if you are to comment or contribute.

As usual, the "hallway track" was the most useful for me and there are more sets of
slides I read than talks I could attend.

I met with Bram, the author of redbaron and we
had a long discussion about programming in different languages.

I also met with Octobus. We discussed Heptapod, a project to add Mercurial support to Gitlab. Logilab
would back such a project with money if it were to become usable and (please)
move towards federation (with ActivityPub?) and user queries (with GraphQL?). We
also discussed the so-called oxydation of Mercurial, which consists in rewriting
some parts in Rust. After a quick search I saw that
tools like PyO3 can help writing Python
extensions in Rust.

Some of the talks that interested me included:

Memex, that reuses
the name of the very first hypertext system described in the litterature, and
tries to implement a decentralized annotation system for the web. It reminded
me of Web hypothesis and W3C's annotations recommendation which they say they will be compatible with.

Web Components are one of the options to develop user interfaces in the
browser. I had a look at the Future of Web Components, which I
relate to the work we are doing with the CubicWeb browser (see above) and the
work the Virtual Assembly has been doing to implement Transiscope.

Pyodide, the
scientific python stack compiled to Web Assembly, I try to compare it to using
Jupyter notebooks.

Chat-over-IMAP
another try to rule them all chat protocols. It is right that everyone has more than
one email address, that email addresses are more and more used as logins in many
web sites and that using these email addresses as instant-messaging / chat addresses
would be nice. We will see if it takes off!

"Will my refactoring break my code ?" is the question the developer asks himself because he is not sure the tests cover all the cases. He should wonder, because tests that cover all the cases would be costly to write, run and maintain. Hence, most of the time, small decisions are made day after day to test this and not that. After some time, you could consider that in a sense, the implementation has become the specification and the rest of the code expects it not to change.

Let us assume you want to add a cache to a function that reads data from a database. The function would be named read_from_db, it would take an int as parameter item_id and return a dict with attributes of the items and their values.

You could experiment with the new version of this function like so:

importlaboratorydefread_from_db_with_cache(item_id):data={}# some implementation with a cachereturndata@laboratory.Experiment.decorator(candidate=read_from_db_with_cache)defread_from_db(item_id):data={}# fetch data from dbreturndata

When you run the above code, calling read_from_db returns its result as usual, but thanks to laboratory, a call to read_from_db_with_cache is made and its execution time and result are compared with the first one. These measurements are logged to a file or sent to your metrics solution for you to compare and study.

In other words, things continue to work as usual as you keep the original function, but at the same time you experiment with its candidate replacement to make sure switching will not break or slow things down.

I like the idea ! Thank you for Scientist and Laboratory that are both available under the MIT license.

I had to work on enhancing an Angular-based application and wanted to provide the additionnal functionnality as an isolated component that I could develop and test without messing with a large Angular controller that several other people were working on.

Here is my Angular+React "Hello World", with a couple gotchas that were not underlined in the documentation and took me some time to figure out.

<react-component> is not a React component, but an Angular directive that delegates to a React component. Therefore, you should not expect the interface of this tag to be the same as the one of a React component. More precisely, you can only use the props attribute and can not set your react properties by adding more attributes to this tag. If you want to be able to write something like <react-component firstname="person.firstname" lastname="person.lastname"> you will have to use reactDirective to create a specific Angular directive.

You have to set an object as the props attribute of the react-component tag, because it will be used as the value of this.props in the code of your React class. For example if you set the props attribute to a string (person.name instead of person in the above example) , you will have trouble using it on the React side because you will get an object built from the enumeration of the string. Therefore, the above example can not be made simpler. If we had written $scope.name = 'you' we could not have passed it correctly to the react component.

The above was tested with angular@1.5.8, ngreact@0.3.0, react@15.3.0 and react-dom@15.3.0.

All in all, it worked well. Thank you to all the developers and contributors of these projects.

Please forward it to whom may be interested, underlining that pizzas
will be offered to refuel the chatters ;)

Conveniently placed a week after the Salt Conference, topics will
include anything related to salt and its uses, demos, new ideas,
exchange of salt formulas, commenting the talks/videos of the
saltconf, etc.

If you are interested in Salt, Python and Devops and will be in Paris at that time, we
hope to see you there !

As you can see, I had three types of calculators, hence at least three
Kata to practice, but as usual with beginners, it took us the whole tutorial to
get done with the first one.

The room was a class room that we set up as our coding dojo with the coder and his copilot working
on a laptop, facing the rest of the participants, with the large screen at
their back. The pair-programmers could freely discuss with the people facing
them, who were following the typing on the large screen.

We switched every ten minutes: the copilot became coder, the coder went back
to his seat in the class and someone else stood up to became the copilot.

The session was allocated 3 hours split over two slots of 1h30. It took me less
than 10 minutes to open the session with the above slide, 10 minutes as first
coder and 10 minutes to close it. Over a time span of 3 hours, that left 150
minutes for coding, hence 15 people. Luckily, the whole group was about that
size and almost everyone got a chance to type.

I completely skipped explaining Python, its syntax and the unittest framework
and we jumped right into writing our first tests with if and print
statements. Since they knew about other programming languages, they picked up
the Python langage on the way.

After more than an hour of slowly discovering Python and TDD, someone in the
room realized they had been focusing more on handling exception cases and
failures than implementing the parsing and computation of the formulas because
the specifications where not clearly understood. He then asked me the right
question by trying to define Reverse Polish Notation in one sentence and
checking that he got it right.

Different algorithms to parse and compute RPN formulas where devised at the
blackboard over the pause while part of the group went for a coffee break.

The implementation took about another hour to get right, with me making sure
they would not wander too far from the actual goal. Once the stack-based
solution was found and implemented, I asked them to delete the files, switch
coder and start again. They had forgotten about the Kata definition and were
surprised, but quickly enjoyed it when they realized that progress was much
faster on the second attempt.

Since it is always better to show that you can walk the talk, I closed the
session by praticing the RPN calculator kata myself in a bit less than 10
minutes. The order in which to write the tests is the tricky part, because it
can easily appear far-fetched for such a small problem when you already know an
algorithm that solves it.

Yams is a pythonic way to describe an entity-relationship model. It is used at the core of the CubicWeb semantic web framework in order to automate lots of things, including the generation and validation of forms. Although we have been using the MVC design pattern to write user interfaces with Qt and Gtk before we started CubicWeb, we never got to reuse Yams. I am on my way to fix this.

Here is the simplest possible example that generates a user interface (using dialog and python-dialog) to input data described by a Yams data model.

The result is a program that will prompt the user for the title of a form and the text/number of a question, then enforce the type constraints and display the inconsistencies.

The above is very simple and does very little, but if you read the documentation of Yams and if you think about generating the UI with Gtk or Qt instead of dialog, or if you have used the form mechanism of CubicWeb, you'll understand that this proof of concept opens a door to a lot of possibilities.

I will come back to this topic in a later article and give an example of integrating the above with pigg, a simple MVC library for Gtk, to make the programming of user-interfaces even more declarative and bug-free.

It started with a desire to draw diagrams of hierarchical systems with Python. Since this is similar to what we do in CubicWeb with schemas of the data model, I read the code and realized we had that graph submodule in the logilab.common library. This module uses dot from graphviz as a backend to draw the diagrams.

Reading about UML diagrams drawn with GraphViz, I learned about UMLGraph, that uses GNU Pic to draw sequence diagrams. Pic is a language based on groff and the pic2plot tool is part of plotutils (apt-get install plotutils). Here is a tutorial. I have found some Python code wrapping pic2plot available as plugin to wikipad. It is worth noticing that TeX seems to have a nice package for UML sequence diagrams called pgf-umlsd.

Since nowadays everything is moving into the web browser, I looked for a javascript library that does what graphviz does and I found canviz which looks nice.

If (only) I had time, I would extend pyreverse to draw sequence diagrams and not only class diagrams...

I upgraded to Debian Squeeze over the week-end and it broke my custom Xmodmap. While I was fixing it, I realized that the special keys of my Microsoft Natural keyboard that were not working under Lenny were now functionnal. The only piece missing was the "zoom" key. Here is how I got it to work.

I found on the askubuntu forum an solution to the same problem, that is missing the following details.

The EuroSciPy 2010 conference will be held in Paris from july 8th to 11th at Ecole Normale Supérieure. Two days of tutorials, two days of conference, two interesting keynotes, a lightning talk session, an open space for collaboration and sprinting, thirty quality talks in the schedule and already 100 delegates registered.

I have been wondering for some time why debsign would not use the DEBSIGN_KEYID environment variable that I exported from my bashrc. Debian bug 444641 explains the trick: debsign ignores environment variables and sources ~/.devscripts instead. A simple export DEBSIGN_KEYID=ABCDEFG in ~/.devscripts is enough to get rid of the -k argument once and for good.

The Mercurial 1.5 sprint is taking place in our offices this week-end and pair-programming with Steve made me want a better looking terminal. Have you seen his extravagant zsh prompt ? I used to have only 8 colors to decorate my shell prompt, but thanks to some time spent playing around, I now have 256.

Yay, I now have an orange prompt! I now need to write a script that will display useful information depending on the context. Displaying the status of the mercurial repository I am in might be my next step.

I have been doing free software since I discovered it existed. I bought an OpenMoko some time ago, since I am interested in anything that is open, including artwork like books, music, movies and... hardware.

I just learned about two lists, one at Wikipedia and another one at MakeOnline, but Google has more. Explore and enjoy!

As said in a previous article, I am convinced that part of the motivation for
making package sub-systems like the Python one, which includes distutils,
setuptools, etc, is that Windows users and Mac users never had the chance to use
a tool that properly manages the configuration of their computer system. They
just do not know what it would be like if they had at least a good package
management system and do not miss it in their daily work.

I looked for Windows package managers that claim to provide features similar to
Debian's dpkg+apt-get and here is what I found in alphabetical order.

Appupdater provides functionality similar to apt-get or yum. It automates the
process of installing and maintaining up to date versions of programs. It claims
to be fully customizable and is licensed under the GPL.

Win-get is an automated install system and software repository for Microsoft
Windows. It is similar to apt-get: it connects to a link repository, finds an
application and downloads it before performing the installation routine (silent
or standard) and deleting the install file.

It is written in pascal and is set up as a SourceForge project, but not much has
been done lately.

I have not used any of these tools, the above is just the result of some time
spent searching the web.

A more limited approach is to notify the user of the newer versions:

App-Get will show you a list of your installed Applications. When an update
is available for one of them, it will highlighted and you will be able to
update the specific applications in seconds.

GetIt is not an application-getter/installer. When you want to install a
program, you can look it up in GetIt to choose which program to install from a
master list of all programs made available by the various apt-get clones.

The appupdater project also compares itself to the programs automating the
installation of software on Windows.

I once read about a project to get the Windows kernel into the Debian
distribution, but can not find any trace of it... Remember that Debian is not
limited to the Linux kernel, so why not think about a very improbable
apt-get install windows-vista ?

Today I felt like summing up my opinion on a topic that was discussed this year on
the Python mailing lists, at PyCon-FR, at EuroPython and EuroSciPy... packaging
software! Let us discuss the two main use cases.

The first use case is to maintain computer systems in production. A trait of
production systems, is that they can not afford failures and are often deployed
on a large scale. It leaves little room for manually fixing problems. Either
the installation process works or the system fails. Reaching that level of
quality takes a lot of work.

The second use case is to facilitate the life of software developers and computer
users by making it easy for them to give a try to new pieces of software without
much work.

The first use case has to be addressed as a configuration management
problem. There is no way around it. The best way I know of managing the
configuration of a computer system is called Debian. Its package format and its
tool chain provide a very extensive and efficient set of features for system
development and maintenance. Of course it is not perfect and there are missing
bits and open issues that could be tackled, like the dependencies between
hardware and software. For example, nothing will prevent you from installing on
your Debian system a version of a driver that conflicts with the version of the
chip found in your hardware. That problem could be solved, but I do not think
the Debian project is there yet and I do not count it as a reason to reject
Debian since I have not seen any other competitor at the level as Debian.

The second use case is kind of a trap, for it concerns most computer users and
most of those users are either convinced the first use case has nothing in
common with their problem or convinced that the solution is easy and requires
little work.

The situation is made more complicated by the fact that most of those users
never had the chance to use a system with proper package management tools. They
simply do not know the difference and do not feel like they are missing when
using their system-that-comes-with-a-windowing-system-included.

Since many software developers have never had to maintain computer systems in
production (often considered a lower sysadmin job) and never developed packages
for computer systems that are maintained in production, they tend to think that
the operating system and their software are perfectly decoupled. They have no
problem trying to create a new layer on top of existing operating systems and
transforming an operating system issue (managing software installation) into a
programming langage issue (see CPAN, Python eggs and so many others).

Creating a sub-system specific to a language and hosting it on an operating
system works well as long as the language boundary is not crossed and there is
no competition between the sub-system and the system itself. In the Python
world, distutils, setuptools, eggs and the like more or less work with pure
Python code. They create a square wheel that was made round years ago by
dpkg+apt-get and others, but they help a lot of their users do something they
would not know how to do another way.

A wall is quickly hit though, as the approach becomes overly complex as soon as
they try to depend on things that do not belong to their Python sub-system. What
if your application needs a database? What if your application needs to link to
libraries? What if your application needs to reuse data from or provide data to
other applications? What if your application needs to work on different
architectures?

The software developers that never had to maintain computer systems in
production wish these tasks were easy. Unfortunately they are not easy and cannot be. As I said, there is no way around configuration management for the one
who wants a stable system. Configuration management requires both project
management work and software development work. One can have a system where
packaging software is less work, but that comes at the price of stability and
reduced functionnality and ease of maintenance.

Since none of the two use cases will disappear any time soon, the only solution
to the problem is to share as much data as possible between the different tools
and let each one decide how to install software on his computer system.

In his keynote, Fransesc Alted talked about starving CPUs. Thirty years back,
memory and CPU frequencies where about the same. Memory speed kept up for about
ten years with the evolution of CPU speed before falling behind. Nowadays,
memory is about a hundred times slower than the cache which is itself about
twenty times slower than the CPU. The direct consequence is that CPUs are
starving and spend many clock cycles waiting for data to process.

In order to improve the performance of programs, it is now required to know
about the multiple layers of computer memory, from disk storage to CPU. The
common architecture will soon count six levels: mechanical disk, solid state
disk, ram, cache level 3, cache level 2, cache level 1.

Using optimized array operations, taking striding into account, processing data
blocks of the right size and using compression to diminish the amount of data
that is transfered from one layer to the next are four techniques that go a long
way on the road to high performance. Compression algorithms like Blosc increase
throughput for they strike the right balance between being fast and providing
good compression ratios. Blosc compression will soon be available in PyTables.

Fransesc also mentions the numexpr extension to numpy, and its combination with
PyTables named tables.Expr, that nicely and easily accelerates the computation
of some expressions involving numpy arrays. In his list of references, Fransesc
cites Ulrich Drepper article What every programmer should know about memory.

Maciej Fijalkowski started his talk with a general presentation of the PyPy
framework. One uses PyPy to describe an interpreter in RPython, then generate
the actual interpreter code and its JIT.

Since PyPy is has become more of a framework to write interpreters than a
reimplementation of Python in Python, I suggested to change its misleading name to
something like gcgc the Generic Compiler for Generating Compilers. Maciej
answered that there are discussions on the mailing list to split the project in
two and make the implementation of the Python interpreter distinct from the GcGc
framework.

Maciej then focused his talk on his recent effort to rewrite in RPython the part
of numpy that exposes the underlying C library to Python. He says the benefits
of using PyPy's JIT to speedup that wrapping layer are already visible. He has
details on the PyPy blog. Gaël Varoquaux added that David Cournapeau has
started working on making the C/Python split in numpy cleaner, which would
further ease the job of rewriting it in RPython.

Damien Diederen talked about his work on CrossTwine Linker and compared it
with the many projects that are actively attacking the problem of speed that
dynamic and interpreted languages have been dragging along for years. Parrot
tries to be the über virtual machine. Psyco offers very nice acceleration, but
currently only on 32bits system. PyPy might be what he calls the Right
Approach, but still needs a lot of work. Jython and IronPython modify the
language a bit but benefit from the qualities of the JVM or the CLR. Unladen
Swallow is probably the one that's most similar to CrossTwine.

CrossTwine considers CPython as a library and uses a set of C++ classes to
generate efficient interpreters that make calls to CPython's
internals. CrossTwine is a tool that helps improving performance by
hand-replacing some code paths with very efficient code that does the same
operations but bypasses the interpreter and its overhead. An interpreter built
with CrossTwine can be viewed as a JIT'ed branch of the official Python
interpreter that should be feature-compatible (and bug-compatible) with CPython.
Damien calls he approach "punching holes in C substrate to get more speed" and
says it could probably be combined with Psyco for even better results.

CrossTwine works on 64bit systems, but it is not (yet?) free software. It
focuses on some use cases to greatly improve speed and is not to be considered a
general purpose interpreter able to make any Python code faster.

Wolfram Alpha is a web front-end to huge database of information covering very different topics ranging from mathematical functions to genetics, geography, astronomy, etc.

When you search for a word, it will try to match it with one of the objects it as in its database and display all the information it has concerning that object. For example it can tell you a lot about the Halley Comet, including where it is at the moment you ask the query. This is the main difference with, say Wikipedia, that will know a lot about that comet in general, but is not meant to compute its location in the sky at the moment you enter your query.

Searches are not limited to words. One can key in commands like weather in Paris in june 2009 or x^2+sin(x) and get results for those precise queries. The processing of the input query is far from bad, since it returns results to questions like what are the cities of France, but I would not call it state of the art natural language processing since that query returns the largest cities instead of just the cities it knows about and the question what are the smallest cities of France will not return any result. Natural language processing is a very difficult problem, though, especially when done in the open world as it is the case there with a engine available to the wide public on the internet.

For more examples, visit the WolframAlpha website, where you will also be able to post feature requests or, if you are a developer, get documentation about the WolframAlpha API and maybe use it as a web service in your application when you need to answer certain types of questions.

Once again Logilab sponsored the EuroPython conference. We would like to thank
the organization team (especially John Pinner and Laura Creighton) for their
hard work. The Conservatoire is a very central location in Birmingham and
walking around the city center and along the canals was nice. The website was
helpful when preparing the trip and made it easy to find places where to eat and
stay. The conference program was full of talks about interesting topics.

I presented CubicWeb and spent a large part of my talk explaining what is the
semantic web and what features we need in the tools we will use to be part of
that web of data. I insisted on the fact that CubicWeb is made of two parts,
the web engine and the data repository, and that the repository can be used
without the web engine. I demonstrated this with a TurboGears application that
used the CubicWeb repository as its persistence layer. RQL in TurboGears! See my
slides and Reinout Van Rees' write-up.

Christian Tismer took over the development of Psyco a few months ago. He said he
recently removed some bugs that were show stoppers, including one that was
generating way too many recompilations. His new version looks very
promising. Performance improved, long numbers are supported, 64bit support may
become possible, generators work... and Stackless is about to be rebuilt on top of
Psyco! Psyco 2.0 should be out today.

I had a nice chat with Cosmin Basca about the Semantic Web. He suggested using
Mako as a templating language for CubicWeb. Cosmin is doing his PhD at DERI and
develops SurfRDF which is an Object-RDF mapper that wraps a SPARQL endpoint to
provide "discoverable" objects. See his slides and Reinout Van Rees' summary
of his talk.

I saw a lightning talk about the Nagare framework which refuses to use
templating languages, for the same reason we do not use them in CubicWeb. Is
their h.something the right way of doing things? The example reminds me of the
C++ concatenation operator. I am not really convinced with the continuation idea
since I have been for years a happy user of the reactor model that's implemented
in frameworks liked Twisted. Read the blog and documentation for more
information.

I had a chat with Jasper Op de Coul about Infrae's OAI Server and the work he
did to manage RDF data in Subversion and a relational database before publishing
it within a web app based on YUI. We commented code that handles books and
library catalogs. Part of my CubicWeb demo was about books in DBpedia and
cubicweb-book. He gave me a nice link to the WorldCat API.

Souheil Chelfouh showed me his work on Dolmen and Menhir. For several design
problems and framework architecture issues, we compared the solutions offered by
the Zope Toolkit library with the ones found by CubicWeb. I will have to read
more about Martian and Grok to make sure I understand the details of that
component architecture.

I had a chat with Martijn Faassen about packaging Python modules. A one sentence
summary would be that the Python community should agree on a meta-data format
that describes packages and their dependencies, then let everyone use the tool
he likes most to manage the installation and removal of software on his system.
I hope the work done during the last PyConUS and led by Tarek Ziadé arrived at the
same conclusion. Read David Cournapeau's blog entry about Python Packaging for
a detailed explanation of why the meta-data format is the way to go. By the way,
Martijn is the lead developer of Grok and Martian.

Godefroid Chapelle and I talked a lot about Zope Toolkit (ZTK) and CubicWeb. We
compared the way the two frameworks deal with pluggable components. ZTK has
adapters and a registry. CubicWeb does not use adapters as ZTK does, but has a
view selection mechanism that required a registry with more features than the
one used in ZTK. The ZTK registry only has to match a tuple (Interface, Class)
when looking for an adapter, whereas CubicWeb's registry has to find the views
that can be applied to a result set by checking various properties:

interfaces: all items of first column implement the Calendar Interface,

dimensions: more than one line, more than two columns,

types: items of first column are numbers or dates,

form: form contains key XYZ that has a value lower than 10,

session: user is authenticated,

etc.

As for Grok and Martian, I will have to look into the details to make sure
nothing evil is hinding there. I should also find time to compare zope.schema
and yams and write about it on this blog.

I presented CubicWeb at several conferences recently and I used the following as an introduction.

Web version numbers:

version 0 = the internet links computers

version 1 = the web links documents

version 2 = web applications

version 3 = the semantic web links data [we are here!]

version 4 = more personnalization and fix problems with privacy and security

... reach into physical world, bits of AI, etc.

In his blog at MIT, Tim Berners-Lee calls version 0 the International Information Infrastructure, version 1 the World Wide Web and version 3 the Giant Global Graph. Read the details about the Giant Global Graph on his blog.

We recently added the book cube to our intranet in order for books available in our library to show up in the search results. Entering a couple books using the default HTML form, even with the help of copy/paste from Google Book or Amazon, is boring enough to make one seek out other options.

As a Python and Debian user, I put the python-gdata package on my list of options, but quickly realized that the version in Debian is not current and that the books service is not yet accessible with the python gdata client. Both problems could be easily overcome since I could update Debian's version from 1.1.1 to the latest 1.3.1 and patch it with the book search support that will be included in the next release, but I went on exploring other options.

Amazon is the first answer that comes to mind when speaking of books on the net and pyAWS looks like a nice wrapper around the Amazon Web Service. The quickstart example on the home page does almost exactly what I was looking for. Trying to find a Debian package of pyAWS, I only came accross boto which appears to be general purpose.

Registering with Amazon and Google to get a key and use their web services is doable, but one wonders why something as common as books and public libraries would have to be accessed through private companies. It turns out Wikipedia knows of many book catalogs on the net, but I was looking for a site publishing data as RDF or part of the Linked Open Data initiative. I ended up with almost exactly what I needed.

The Open Library features millions of books and covers, publicly accessible as JSON using its API. There is even a dump of the database. End of search, be happy.

Next step is to use this service to enhance the cubicweb-book cube by allowing a user to add a new book to its collection by simply entering a ISBN. All data about the book can be fetched from the OpenLibrary, including the cover and information about the author. You can expect such a new version soon... and we will probably get a new demo of CubicWeb online in the process, since all that data available as a dump is screaming for reuse as others have already found out by making it available as RDF on AppEngine!

As some readers of this blog may be aware of, Logilab has been developing its own framework since 2001. It evolved over the years trying to reach the main goal (managing and publishing data with style) and to incorporate the goods ideas seen in other Python frameworks Logilab developers had used. Now, companies other than Logilab have started providing services for this framework and it is stable enough for the core team to be confident in recommending it to third parties willing to build on it without suffering from the tasmanian devil syndrom.

CubicWeb version 3.0 was released on the last day of 2008. That's 7 years of research and development and (at least) three rewrites that were needed to get this in shape. Enjoy it at http://www.cubicweb.org/ !

For those interested in the Semantic Web as much as we are at Logilab, the announce of the new DBpedia release is very good news. Version 3.2 is extracted from the October 2008 Wikipedia dumps and provides three mayor improvements: the DBpedia Schema which is a restricted vocabulary extracted from the Wikipedia infoboxes ; RDF links from DBpedia to Freebase, the open-license database providing about a million of things from various domains ; cleaner abstracts without the traces of Wikipedia markup that made them difficult to reuse.

DBpedia can be downloaded, queried with SPARQL or linked to via the Linked Data interface. See the about page for details.

It is important to note that ontologies are usually more of a common language for data exchange, meant for broad re-use, which means that they can not enforce too many restrictions. On the opposite, database schemas are more restrictive and allow for more interesting inferences. For example, a database schema may enforce that the Publisher of a Document is a Person, whereas a more general ontology will have to allow for Publisher to be a Person or a Company.

DBpedia provides its schema and moves forward by adding a mapping from that schema to actual ontologies like UMBEL, OpenCyc and Yago. This enables DBpedia users to infer from facts fetched from different databases, like DBpedia + Freebase + OpenCyc.
Moreover 'checking' DBpedia's data against ontologies will help detect mistakes or weirdnesses in Wikipedia's pages. For example, if data extracted from Wikipedia's infoboxes states that "Paris was_born_in New_York", reasoning and consistency checking tools will be able to point out that a person may be born in a city, but not a city, hence the above fact is probably an error and should be reviewed.

With CubicWeb, one can easily define a schema specific to his domain, then quickly set up a web application and easily publish the content of its database as RDF for a known ontology. In other words, CubicWeb makes almost no difference between a web application and a database accessible thru the web.

Graphical user interfaces help command discovery, while command-line interfaces help command efficiency. This article tries to explain why. I reached it when reading the list of references from the introduction to Ubiquity, which is the best extension to firefox I have seen so far. I expect to start writing Ubiquity commands soon, since I have already been using extensively the 'keyword shorcut' functionnality of firefox's bookmarks and we have already done work in the area of 'language interaction', as they call it at Mozilla Labs, when working with Narval. Our Logilab Simple Desktop project, aka simpled, also goes in the same direction since it tries to unify different applications into a coherent work environment by defining basic commands and shorcuts that can be applied everywhere and accessing the rest of the functionnalities via a command-line interface.

The Openmoko Freerunner is a computer with embedded GSM, accelerometer and GPS. I got mine last week after waiting for a month for the batch to get from Taiwan to the french company I bought it from. The first thing I had to admit was that some time will pass before it gets confortable to use it as a phone. The current version of the system has many weird things in its user interface and the phone works, but the other end of the call suffers a very unpleasant echo.

I will try to install Debian, Qtopia and Om2008.8 to compare them. I also want to quickly get Python scripts to run on it and get back to Narval hacking. I had an agent running on a bulky Palm+GPS+radionetwork back in 1999 and I look forward to run on this device the same kind of funny things I was doing in AI research ten years ago.

I bought last week a new laptop computer that can drive a 24" LCD monitor, which means I do not need my desktop computer any more. In the process of setting up that new laptop, I did what I have been wanting to do for years without finding the time: spending time on my ion3 config to make it more generic and create a small python setup utility that can regenerate it from a template file and a keyboard layout.

If you take a look at the list of pending tickets, you will guess that I am using a limited number of pieces of software during my work day and tried to configure them so that they share common action/shortcuts. This is what simpled is about: given a keyboard layout generate the config files for the common tools so that action/shortcuts are always on the same key.

I use ion3, xterm+bash, emacs, mutt, firefox, gajim. Common actions are: open, save, close, move up/down/left/right, new frame or tab, close frame or tab, move to previous or next tab, etc.

I will give news in this blog from time to time and announce it on mailing lists when version 0.1 will be out. If you want to give it a try, get the code from the mercurial repository.

While working on knowledge management and semantic web technologies, I came across the Simile project at MIT a few years back. I even had a demo of the Exhibit widget fetching then displaying data from our semantic web application framework back in 2006 at the Web2 track of Solutions Linux in Paris.

Now that we are using these widgets when implementing web apps for clients, I was happy to see that the projects got a life of their own outside of MIT and became full-fledged free-software projects hosted on Google Code. See Simile-Widgets for more details and expect us to provide a debian package soon unless someone does it first.

Speaking of Debian, here is a nice demo a the Timeline widget presenting the Debian history.

We have been using many different tools for doing statistical analysis with Python, including R, SciPy, specific C++ code, etc. It looks like the growing audience of SciPy is now in movement to have dedicated modules in SciPy (lets call them SciKits). See this thread in SciPy-user mailing-list.

The presentation of Python as a tool for applied mathematics got highlighted at the 2008 annual meeting of the american Society for Industrial and Applied Mathematics (SIAM). For more information, read this blogpost and the slides.

At Google IO, a large part of the Tools track was dedicated to
AppEngine. Brett Slatkin gave a talk titled Building scalable Web
Applications with Google AppEngine which focused on optimizing the
server part of web apps. As other presenters demonstrated it, like
Steve Souders in his talk Even Faster Websites, optimizing the
browser part of webapps is not to be neglected either.

First of all, I must confess I am used to repeat that "early
optimisation is the root of all evil" and "delay commitment until the
last responsible time". But reading about AppEngine and listening to
the Google IO talks, it appears that the tools we have today ask for
human intervention to reach web-scale performance, even when "we"
stands for "Google".

In order for web-scale applications to handle the kind of load they
are facing, they must be designed and implemented carefully. As
carefully as any application was designed before the exponential
growth of PC computation power let us move away from low-level
implementation details and made some inefficiencies acceptable as long
as the time spent developing was short enough.

It all depends on the parameters of your cost function, but for
web-scale applications, it seems like we have not enough computer-time
and can not trade it for human-time.

To get a better idea of the work constraints, one should know that a
disk seek is about 10ms, which means there will be a maximum of 100
accesses per second. On the other hand, if we need consistent data as
opposed to transactional data (the latter implying that data is
fetched each time it is asked for), data can be read from disk once
then cached. Following reads are done from memory at a rate of about
4GB/sec, which means 4000 accesses per second if entities are around
1MB in size. Result of this back of the envelope approximation is 40
reads equals one write.

It follows that, although the actual time depends on the size and shape
of data, writes are very expensive compared to reads and both are
better done in batches to optimise disk access.

The AppEngine Datastore was designed with this constraints in
mind. Entities are sets of property name/value pairs. Each entity may
have a parent. An entity without a parent is the root of a hierarchy
called an entity group.

Entities of the same group are stored on disk close to each other, but
two distinct entity groups may be stored on different computers. Read
access to entities of the same group is thus faster than read access
to entities of different groups.

Write access is serialized per entity group. As opposed to a
traditionnal RDBMS that provides row locking, the datastore only
provides entity group locking. Writes to the a single entity group
will always happen in sequence, even though changes concern different
entities.

There is no limit to the number of entity groups or to the number of
entities per group, but because of the locking strategy, large entity
groups will cause high contention and a lot of failed
transactions. Since writes are expensive, not thinking about write
throughput is a very bad idea when designing an AppEngine application
if one want it to scale.

On the other hand, the parallel nature of the datastore make it scale
wide and there is no limit to the number of entity groups that can be
written to in parallel, nor to the number of reads that can be done in
parallel.

To understand this design in details, you will have to read about GFS,
BigTable and other technologies developed by Google to implement
large-scale clustering.

Counters are a good example to address when discussing write
throughput, because the datastore locking strategy makes writing to
global data very expensive.

Let us assume that we want to display on the main page of a wiki
application the total number of comments posted.

A global counter would serialize all its updates. If 100 users were to
add comments at the same time, some of them would have to wait several
seconds for their action to complete: one write for the comment, one
write for the counter, at most 100 writes per second for the counter
and a lot of time lost due to failed transaction that need to be
restarted.

The solution to make the counter scale is to partition it among all
entity groups then sum these partial counters when the global value is
needed.

Since chances are low that a given user will write more than one
comment at a time, comment entities for a user can be grouped together
and a partial counter can be added to the same entity group. Creating
a new comment and increasing the partial counter will be done in the
same batch.

When a new request for the main page is received, the counter total is
looked up in the cache. If it is not found, all partial counters are
fetched and summed up, then the cache is refreshed with a short
timeout, for example one minute.

During the next minute, the counter will be "consistent", read no too
far-off, and served extremely fast from the cache.

As a conclusion, the interface AppEngine is exhibiting today requires
to optimize early, but I would bet that in the years to come, new
languages and domain-specific compilers or database engines will take
part of that burden off the hands of the developers.

Did not Yahoo and Google start developping PigLatin and Sawzall to
make it easier to write parallel data-processing programs ? The same
could happen with describe a data-model in a high-level language and
get a tool to optimize it for write contention and web-scale
application.

Several of us went to San Francisco last week to attend Google IO. As usual with conferences, meeting people was more interesting than listening to most talks. The AppEngine Fireside Chat was a Q&A session that lasted about an hour. Here is what I learned from this session and various chats with AppEngineers.

Google has decided to provide its scalable datastore architecture as a service. At this point, the datastore is the product and the goal it to make it as widely accessible as possible.

The google.appengine.api.datastore API alone would not have made for a very sexy launch. In order to attract more people and lower the bar the beginners would have to jump over, they looked for a higher level programming interface.

Since some people working at Google have been using Django and know it, they reimplemented part of its interface for defining data models. Late in the project, they added GQL because Django-like queries were a bit too difficult. In both case, the goal was to make it easier for external developers to get started.

But Google is not in the business of providing web application frameworks and AppEngineers made explicit that they would not be officially supporting a specific framework or a specific version of a given framework (not even Django 0.96, although there is a django-appengine-helper project on code.google.com). They expect frameworks to be provided by communities of developers.

My conclusion is twofold:

They will be focusing on supporting other languages in AppEngine (I would bet on Java being the next one available) rather than extending Python frameworks support.

Anyone is free to join with his own framework and provide support for it, the One True Interface being the one defined by google.appengine.api.datastore, not the one defined by db.model and GQL.

This is why Logilab published its own framework running on App Engine as free software and is providing support for it: Logilab Appengine eXtension.