Tuesday, July 27, 2010

While it is a great positive change that data is being released through numerous efforts around the world, data release is not the same as Open Data release. A number of Canadian cities have announced Open Data initiatives, but they are not releasing Open Data. They are just releasing data. Of course, this is better than not releasing data. But let's at least be honest about what we are doing.

Why aren't they Open Data? Because their licenses are not Open Data licenses:

Not Open Data: Edmonton: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - from Terms of Use

Not Open Data: Vancouver: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - Terms of Use

Not Open Data: Ottawa: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - from Terms of Use

Not Open Data: Toronto: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - from Terms of Use

All of these licenses also suffer from the additional mis-feature of arbitrary retroactivity:

"The City may at any time and from time to time add, delete, or change the datasets or these Terms of Use. Notice of changes may be posted on the home page for these datasets or this page. Any change is effective immediately upon posting, unless otherwise stated"

These two clauses mean that there is no stability for someone using this data. If, something they do or say (data related or not) is not liked by the city whose data they are using, they can lose access. Or if the city finds that many data users are doing things they do not like, they can change the terms of reference to impact data previously obtained by users.

How to fixObligatory versioning of both datasets and licenses, and losing the above two clauses. When a dataset is released, it is given a version, and that release is matched to a (usually the most recent) license version, that will always apply to that version of that data release. Any change to a license generates a new version, only applicable to subsequent releases that choose to use the new license.

This is how things work in the Open Source world. It means that if you possess a piece of Open Source software, with a license of a specific version, someone half-way across the world from you cannot turn you into criminal and/or shut you down by retroactively changing the license. It means that you have stability. Of course, you may be shut out of the next version if they change its license, but that doesn't necessarily shut you down today. You have some level of stability.

An example: an SME builds a business based on data released by the cities. This business perhaps includes data mining tools that reveal some things that some of the cities do not like revealed or discussed. They change the license (remember: "...cancel or suspend ...without notice and for any reason...") or simply cancel or suspend the company's data access to shut this company out, and the company goes out of business.

-----

So, if you want to release Open Source code or Open Data, you must be willing to accept that it will be used in ways that you may find offensive, to you (and/or your constituents). That is how it works.

where I've been working on large scale journal visualization, a continuation of the Torngat project. I've been working on a couple of things, including applying Mulan to the multi-label problem of the corpus I am working with, so I can get precision and recall to evaluate this method empirically. My productivity has been hampered by a recurring stomach problem (which appears to be gone this last week: yay!), so I've not progressed as much as I would have wanted to.... :-(At the end of last week I gave a presentation at CSIRO (in the same building) on this work entitled: Search refinement: visualizing research journals in semantic space. After this talk I had a discussion with Alex Krumpholz and Hanna Suominen, and it is looking like we will be working together on a project involving Torngat.

I've also enjoyed the company of John Maindonald, (went on a very nice Sunday walk with him and his wife and some of their friends) one of the important players in the R universe. He's arranged an invite for me to talk tomorrow about how I've used R in the Torngat project, with the Canberrra R Users Group.

I also enjoyed an afternoon this past week meeting with the Australian National Data Service (ANDS) people, arranged by the wonderful Monica Omodei (formerly Berko), learning about their success in putting together ANDS and where they were going. They were also interested in Torngat, so I gave them a brief presentation on it.

A bit of a surprise collaboration: I have committed myself to helping improve the single-threaded Lucene indexing benchmark in the DaCapo Java benchmarks, after discussions with ANU's Steve Blackburn, a Java VM and GC guru. I've also committed to implementing a new multi-threaded indexing benchmark. Most of the code will be derived from existing code from my LuSql tool (it is actually from the yet un-released LuSql v1.0 codebase).

While it has been winter (Spring/Fall by Canadian standards...) here in Canberra, I have still been amazed at the fantastic birds that are (still) here. Like nothing we have at home, there are the loud and raucous-yet-endearing sulfur crested cockatoos:

I think this is Mount Stromlo, (the low hill/mountain to the right, with the mountains of the Brindabella Range (I think) in the background) from Black Mountain. You can see some of the Mount Stromlo Observatory at white dots on the crest of Mt. Stromlo. The observatory and the forest that was on Mt. Stromlo were mostly destroyed in the 2003 Canberra bushfires. When I was up at the observatory earlier in the month there were many burnt-out tree stumps to be seen. And 'roos. :-)

Thursday, July 08, 2010

The recent conference E-Government ICT Professionalism and Competences Service Science (IFIP International Federation for Information Processing, IFIP 20th World Computer Congress, Industry-Oriented Conferences, September 7-10, 2008, Milano, Italy) is of interest to those involved with ICT in government and eGovernment in general (although the conference is rather EU-centred).Note the content is behind a pay-wall, so you can't read the articles unless you belong to an institution that has a subscription or you have one yourself.

"The selected software to develop the dashboard has been Pentaho suite, which is an open source application. It better fits all the project needs that can be summarized by the following drivers: • Low license costs. It has no license costs. • Low impact on current systems architecture. It does not need a complex integration with source systems. • Availability of “off the shelf” features (reporting and KPIs analysis). It has rich libraries of graphical objects and reports to better show indicators. • Short Time to Delivery. The Dashboard has been delivered in three months including a tuning phase in which some new features had been added."