Thursday, 30 June 2011

During the Expert Workshop last week in Copenhagen, we had a session talking about configuring IPT to reflect different organisational structures. I think it's worth to explain about that part as a blog post here, since some of our readers would like to help deploy IPT in the GBIF Network. It's usually started by questions like this:

Why am I asked for a password of the organisation that I choose to register IPT? Why am I asked again when I want to add an additional organisation?

The short answer is, by having the password of the organisation, that means you have got the permission from that organisation and the organisation is aware of the fact that you're registering an IPT against it.

So, why is this the way of registering an organisation?

The organisational structure

Remember the GBIF Network is not only a common pool of sharing biodiversity data, to form such a pool, it's also the social network in which biodiversity data publishers interact. IPT, serves as the technical skeleton of the network, needs to tie to the organisational structure in order to properly accredit institutions or individuals, by helping to reflect those relationships among the organisations, hosted resources, IPTs and endorsing GBIF nodes/participants. The relationship can be seen on the GBIF Registry.

Take VertNet Hosting as an example. VertNet Hosting is an IPT installation that hosts data resources authored by those users in the IPT. In the second half of its page on the Registry, you see:

Which means, this IPT has a public resource called uafmc_fish, which has been registered to the GBIF Network, and this IPT is registered against University of Kansas Biodiversity Research Center, which is the hosting organisation of IPT.

That means the uafmc_fish resource has been registered against University of Arkansas Collections Facility, UAFMC, which should have been added to the VertNet hosting IPT to be available for users to choose from, other than University of Kansas Biodiversity Research Center.

So the administrator of the VertNet Hosting IPT, in this case Ms. Laura Russell, had been asked twice the different passwords she needed to register her IPT and add another organisation.

The activities can be summarised as this graphic:

Which says, with Vertnet Hosting (IPT) registered against University of Kansas Biodiversity Research Center, uafmc_fish can be hosted by the IPT but tied to University of Arkansas Collections Facility, UAFMC. This means an organisation doesn't necessarily need to have the capacity to install IPT in order to host published resources. Any other IPT, with the agreement of hosting organisations, can host resources of others. This is why here the password is required.

The relationships of these units then are reflected on the Registry as:

By this design the relationship and accreditation of the GBIF Network is maintained.

That means, these 2 organisations are endorsed by USA, which is a member of GBIF, and since they are endorsed, they are therefore available in the organisation drop-down list of IPT.

The endorsement process

Chances are, you're looking for your organisation in that list and you are pretty sure it's not there. What should you do?

Either you're registering IPT, or one of your user is requesting an organisation that is not available yet, you should try talk to administration level people and seek to get endorsed by a GBIF member. Normally, ➊ you or the representative of your institution write to helpdesk@gbif.org, provide some background information and at least a technical contact, ➋ the helpdesk will look for appropriate node for you to get endorsed. Upon ➌ positive feedbacks from an endorsing node, ➍ the helpdesk will inform you the availability of your organisation and the password.

This process runs administratively because we rely on the social level to ensure the responsibility for the registered IPT and published resources. This also makes the accreditation goes to correct persons or organisations.

Friday, 24 June 2011

This
post should be read in the line of Tim’s post about Decoupling Components,
as it takes for granted some information written there.

During
the last week, I’ve been learning/working with some
technologies that are related to the decoupling of components we want to
accomplish. Specifically, I’ve
been working with the Synchronizer component of the event driven architecture Tim described.

Right now, the synchronizer takes the responses from the
resources and gets those responses into the occurrence store (MySQL as of today, but not final). But it has
more to it: The responses from the resources come typically from DiGIR, TAPIR
and BioCASe providers which render their responses into XML format. So how does
all this data ends up in the occurrence store? Well, fortunately my colleague Oliver Meyn
wrote a very useful library to unmarshall all these XML chunks into nice and simple objects, so on my side
I just have to worry about calling all those getter methods. Also, the synchronizer
acts as a listener to a message queue , queue that will store all the resource responses
that need to be handled. All the queue’s nuts & bolts were worked out by Tim
and Federico Méndez. So yes, it has been a nice collaboration from many
developers inside the Secretariat and it’s always nice to have this kind of head
start from your colleagues :)

So,
getting back to my duties, I have to take all these objects and start
populating the occurrence target store taking some precautions (e.g.: not
inserting duplicated occurrence records, checking that some mandatory fields
are not null and other requirements).

For
now, it’s in development mode, but I have managed to make some tests and
extract some metrics that show current performance and definitely leaves room for improvement. For the tests, first the message queue is loaded with some responses that need to be attended
and afterwards I execute the synchronizer which starts populating the occurrence
store. All these tests are done on my MacBook Pro, so definetely response times
will improve on a better box. So here are the metrics:

Environment:

MacBook
Pro 2.4 GHz Core2 Duo (4GB Memory)

Mac OS
X 10.5.8 (Leopard)

Message
Queue & MySQL DB reside on different machines, but same intranet.

Threads: synchronizer spawns 5 threads to attend the queue elements.

Message
queue: loaded with 552 responses (some responses are just empty to emulate a real world scenario).

Records in total: 70,326
occurrence records in total in all responses

Results Test 1 (without filtering out records):

Extracting
responses from queue

Unmarshalling

Inserting into a MySQL DB

202022
milliseconds (3 min, 22 secs)

Results Test 2 (filtering out records):

Extracting
from queue

Unmarshalling

Filtering out records (duplicates, mandatory
fields, etc)

Inserting into MySQL DB

over 30 minutes... (big FAIL)

So, as you see there is MUCH room for improvement. As I have just joined this project in particular, I need to start the long and tedious road of debugging why the huge difference, obviously the filtering out process needs huge improvement. Obvious solutions come to mind: increasing threads, improve memory consumption and other not so obvious solutions. I will try to keep you readers posted about this, and hopefully some more inspiring metrics, and for sure in a better box.

Monday, 20 June 2011

This is the third (and final) post related to the GBIF Metacatalogue Project. The first 2 were dedicated to explain how the data is harvested and how that information is stored in Apache Solr. Those post can be consulted in:

One the nicest features of Solr is that most of its functionalities are exposed via Rest API. This API can be used for different operations like: delete documents, post new documents and more important to query the index. In cases when the index is self-contained (i.e: doesn't depend of external services or storages to return valuable information) a very thin application client without any mediator is viable option. In general terms, "mediator" is a layer that handles the communication between the user interface and Solr, in some cases (possibly) that layer manipulates the information before send it to user interface. Metadata Web application is a perfect example of the scenario just described: it's basically an independent storage of documents that can be used to provide free and structured search. All the information is collected from several sources and then is store in the Solr index, even the full XML documents are stored in the Solr Index.
AJAX Solr is a Javascript framework that facilitates querying a Solr server and the display of results in a Web Application; to implement the remote communication only requires a way of sending requests to Solr, in our case, we used JQuery. In AJAX Solr the information is displayed to the end-user by widgets like: facets widgets, list of results, cloud widgets, etc.
Widgets

Widgets

In this context widgets are user interface componenents whose functionality doesn't depend on other widgets, each one has an specific responsabilty.
All the communication (between Solr and the UI) is handled by Manager whose main responsability is send the requests and communicates the responses to the widgets. The image below shows some widgets examples:
From the implementation point of view the code below shows how the manager is created and the way of attach widgets to it:

Other libraries

Some other components were used in order to provide a better user experience, those are:

JQuery/JQuery UI (http://jquery.com/): AJAX Solr requires a library to implement the AJAX requests. JQuery was chosen for this purpose. Additionally, several JQueryUI widgets are extensively used for a richer user experience.

SyntaxHighlighter(http://alexgorbatchev.com/SyntaxHighlighter/): this is a code syntax highlighter developed in JavaScript, This component is used for displaying the XML view of a metadata document.

The prototype application is available here; this application is 99.9% free of server-side code, there's only one line of code with server dependency, and is for indicate the Solr server url:

Friday, 17 June 2011

This week we hit 300 million indexed occurrence records. As you can see in the picture we have got a monitor set up that shows us our current record count. It started as an idea a few weeks ago but while at the Berlin Buzzwords conference (we were at about 298 million then) I decided it was time to do something about it.

I've been playing around with Scala a bit in the last few months so this was a good opportunity to try Lift, a web framework written in Scala. In the end it turns out that very little code was needed to create an auto-updating counter. There are three components:

We've got a DBUpdater object that uses Lift's Schedule (used to be called ActorPing which caused some confusion for me) to update its internal count of raw occurrence records every ten seconds. The beauty is that there is just one instance of this no matter how many clients are looking at the webpage.

The second part is a class that acts as a Comet adaptor called RawOccurrenceRecordCount which waits for updates from the DBUpdater and passes these on to the clients.

The last part is the Bootstrap code that schedules the first update of the DBUpdater and sets up the database connection and other stuff.

To get to this point, though, took quite some time as I have to say that the documentation for Lift is very lacking especially in explaining the basic concepts (I've read Simply Lift, bits and pieces in the Wiki and am halfway through Exploring Lift) for beginners like me. I'm really looking forward to Lift in Action and really hope it serves as a better introduction than the currently available documentation.

That said I liked the end product very much and I hope to be able to extend the work a bit more to incorporate more stats for our wallboard display but so far I haven't managed to call JavaScript functions from my Comet Actor. That's next on my list. Ideas for a wallboard are piling up and I hope to be able to continue doing it in Lift and Scala.

Wednesday, 15 June 2011

Over the last few years a number of new technologies have emerged (inspired largely by Google) to help wrangle Big Data. Things like Hadoop, HBase, Hive, Lucene, Solr and a host of others are becoming the "buzzwords" for handling the type of data that we at the secretariat are working with. As a number of our previous posts here have shown, the GBIF dev team is wholeheartedly embracing these new technologies, and we recently went to the Berlin Buzzwords conference (as a group) to get a sense of how the broader community is using these tools.

My particular interest is in HBase, which is a style of database that can handle "millions of columns and billions of rows". Since we're optimistic about the continued growth of the number of occurrence records indexed by GBIF, it's not unreasonable to think about 1 billion (10^9) indexed records within the medium-term, and while our current MySQL solution has held up reasonably well so far (now closing in on 300 million indexed records) it certainly won't handle an ever-growing future.

I'm now in the process of evaluating HBase's ability to respond to the kinds of queries we need to support, particularly downloads of large datasets corresponding to queries in the data portal. As in most databases, schema design is quite important in HBase, as is the selection of a "primary key" format for each table. A number of the talks at Berlin Buzzwords addressed these issues and I was very happy to hear from some of the core contributers to HBase and their conclusion that figuring out the right setup for any particular problem is far from trivial. Notable among the presenters were Jean-Daniel Cryans from StumbleUpon (a fellow Canadian, woot!) and Jonathan Gray from Facebook (with luck their slides will be up at the Buzzwords slides page soon). Jonathan's presentation especially gives a sense of what HBase is capable of given the truly huge amount of data Facebook drives through it (all of Facebook's messaging is held in HBase).

Apart from learning a number of new techniques and approaches to developing with HBase, more than anything I'm excited to dive into the details knowing such a strong and supportive community is out there to help me when I get stuck. You can follow along my testing and deliberations on the wiki page for our occurrence record project.

Avro is surprisingly difficult to get into, as it is lacking the most basic "getting started" documentation for a new-comer to the project. This post serves as a reminder to myself of what I did, and hopefully to help others get the hello world working quickly. If people find it useful, let's fill it out and submit it to the Avro wiki!

I am evaluating Avro to provide the high performance RPC chatter for lookup services while we process the content for the portal. I'll blog later about the performance compared to the Jersey REST implementation currently running.

Friday, 3 June 2011

I wanted to write about a MySQL performance optimization using partitioning as I recently applied it to the Harvesting and Indexing Toolkit’s (HIT) log table. The log table was already using a composite index (indexes on multiple columns), but as this table grew bigger and bigger (>50 million records) queries were being answered at a turtle’s pace.
To set things up, imagine that in the HIT application there is a log page that allows the user to tail the latest log messages in almost real time. Behind the scenes, the application is querying the log table every few seconds for the most recent logs, and the effect is a running view of the logs. The tail query used looks like this:

In effect this query asks: “give me the latest logs for datasource with id X having having at least a certain log level”.
Partitioning basically divides a table into different portions that are stored and can be queried separately. The benefit is that if a query only has to hit a small portion instead of the whole table, it can be answered faster.
There are different ways that you can partition tables in MySQL, and you can read about them all in the MySQL reference manual. I first experimented using Key partitioning using the table ID. Unfortunately, because different logs for a datasource could be spread across different partitions, the tail query would have to hit all partitions. To check how many partitions the query hits, I used the following query:

mysql> explain partitions select * from … ;

This resulted in an even slower response than without partitioning, so Tim thought about it from a different angle. He discovered a nice solution using Range partitioning by datasource ID instead. This way the table would get divided into ranges of datasources that are contiguous but not overlapping. A range size of 1000 was used, so the 1st partition would contain all logs for datasources with IDs between 0 – 999, the 2nd partition would contain all logs for datasources with IDs between 1000 – 1999 and so on. Part of the command used to apply Tim’s partitioning strategy (having 36 partitions) is displayed below:

Checking how many partitions the tail query would hit, I confirmed that it only ever uses a single partition. The result was impressive, and initial tests resulted in a speed-up of over 9000 times!
Important to note is that the primary key must include all fields in the partition. Therefore because we were partitioning using the datasource id, this field had to be included in the primary key before partitioning would work. Also, an index on the id was also added to further optimize the query - why not right?
The speed-up might be dramatic now, but as more log messages get written to a partition and it starts to swell, I envisage having to either delete old logs or repartition the table again using smaller range sizes in order to sustain good performance. There is a trade-off between the number of partitions and performance, so some tweaking is needed in every case I guess. Lastly, I’ll reiterate that improper partitioning can actually make things worse. Perhaps it could work for you too, but please apply with caution.

Wednesday, 1 June 2011

When updating a postgres table you sometimes want the update to happen in a specific order. For example I found myself in a situation when I wanted to assign new sequential ids to records in the alphabetical order given by a text string column.

With postgres 8.4 the solution using an updateable, ordered view didn't work (anymore?). After experimenting a little I found that clustering a table according to the desired order is a simple solution that works exactly as hoped for. Clustering changes the actual order of the table data instead of only adding a new index. And apparently postgres uses this native order for updates.