From my experiences attempting to integrate microformats into XML structural summaries, the results have all been workarounds.

Microformats are integrated into an XHTML page through the ‘class’ attribute of an element. I won’t go into the issues with doing this and while the additional information embedded into the page is welcome, it doesn’t conform to the standardized integration model offered by XML. A good reference on integrating and pulling microformat information from a page is here.

Microformats are not easily retrieved from a page because there is no way to know ahead of time what formats are integrated into the page. A workaround in creating an XML structural summary based on microformats can be obtained by applying an extension of the XML element model to indexing attributes and furthermore their values (in order to identify differing attributes). Since the structural summaries being developed using AxPREs are based on XPath expressions, they will be able to handle microformats but with advanced planning on the user.

The screenshot below is of DescribeX with a P* summary of a collection of hCalendar files. Using Apache Lucene, the files are indexed to include regular text token, XML elements, XML attributes and their associatd values. On the right-hand side you can see a query has been entered searching using Lucene’s default regex ‘*event*’ to search for ‘class’ attributes that contain that term. The vertices in red represent the elements which contain it and while it would be nice to assume that the descendants of the highlighted vertices are related to hCalendar events, it is not the case.

I’m happy to say that two projects that I work on, DescribeX (a team effort with Sadek Ali and Flavio Rizzolo) and VisTop-k, both of which are supervised by Dr. Mariano Consens, will be demonstrated at IBM’s CASCON Technology Showcase on October 22 – 25, 2007. There were quite a few interesting projects last year and I’m looking forward to seeing what new ideas have arisen, especially since my Eclipse plugin skills have increased a tremendous amount. As a student I’m also looking forward to the food ;)

I’ve attached a (very very) draft version of building a GEF-based graphical editor for Eclipse. It doesn’t have any pictures but is step by step and is complete enough allowing me to recreate a plug-in from scratch without having to memorize the process. I wrote it as the initial learning curve for getting started with GEF was higher than I initially anticipated. Feedback is always appreciated.

Sometimes it’s desirable to maintain the library code separately from the plug-in being developed. Converting the library to a plug-in won’t change its structure and there is minimal maintenance required. To go through the conversion, follow these steps:

1. right-click on the project representing the library and in the context menu proceed to “PDE-Tools” and select “Convert Project to Plug-in Projects…

It’s that easy. The plugin is automatically exported when running/debugging an Eclipse instance during plug-in development.

If your library itself relies on jar files, those will have to be included in the manifest file.This will also give you access to some of the features you allow to be exposed to other plug-ins dependent on your library plug-in. In the manifest file:

– in the runtime tab, select the packages that you would like other plug-ins to have access to, preferrably your library packages at the minimum and optionally the external library packages that your library itself uses.

– in the overview tab you can rename your plug-in and apply versioning

Once the plug-in is configured as desired, simply include this plug-in in the dependency tab of the other plug-in’s manifest file and you’ll have access to your library plug-ins exported packages.

I did some more reading into GEF and have advanced considerably. I’ve added bookmarking of timeframes and these are saved with the search terms to the currently edited VisTopK file. I also added mousewheel zooming for when the document graph scrolls off the screen. The bookmarks are marked off with a purple border similar to the screenshot in my previous post but without the t# text. Maybe it should be there?

Although there was a large break in between VisTopK-related posts, the project is now complete. A labelled screenshot is provided for your benefit. The report will be uploaded soon as well. I’m very happy with the result and am excited by the possibilities offered by the plug-in as it allows integration of many other projects, existing and new.

I’ve created a screencast of VisTopK in action using the great application Wink, originally referenced from Greg’s blog entry, but WordPress doesn’t allow Flash (SWF) uploads. Anyone have any ideas on how/where I can post it? I tried converting it to an AVI to maybe post it to YouTube but the size of 1GB stopped that attempt cold.

Today I officially started coding for the project and was very productive! I needed to setup a set of inverted indexes based on a collection of files. To setup these indexes, I used Apache Lucene and was it ever easy. I had some difficulties initially as I was trying to use some contributed modules (the in-memory indexer to be exact, but I figured it might come in handy, at some point). I also incorrectly decided to use an embedded relational database to allow for a more “natural” way to access indexes. Based on the information found here, I decided to give HSQLDB a shot and it was extremely easy to setup and use, but instead, I removed the relational database, used Lucene’s built-in query engine, and accessed all the inverted indexes for the terms within the document collection.

Now it’s a matter of deciding to take TReX’s existing No-Random-Access threshold algorithm code or just roll my own. Reasons to stay away from the existing code are: tight integration with TReX’s data structures, lack of parameters for its use, and a bad code smell. If I roll my own, then it’s to decide whether I should integrate it into Lucene or keep it as a simple external algorithm engine. An ambitious Lucene contrib vision possibly? I’ve never contributed to open source but am truly inspired by the dedication required as described in Karl Fogel’s (FREE) ebook Producing Open Source Software. Starting from scratch will also allow for a nicer class hierarchy to take advantage of some interesting concepts mentioned in the IO-Top-k paper, concepts such as propabilistic inference or skew detection to terminate the algorithm even sooner.

Once that’s done I’ll have a prototype for XML document collection indexing and retrieval using a threshold algorithm for top-k query processing.
Some related tools that seem interesting are: Luke, a Lucene index-modification/viewing tool, very nice looking and feature filled. Lius, which I haven’t tried and seems to do the same thing as Luke.

This project is for a course credit and is supervised by Dr. Consens. I will provide a visualization of Top-K results using static and dynamic views of Threshold Algorithm (TA) traces. This involves the use of inverted indexes from an XML document collection and will explore the effect of various factors such as compression, encoding, and retrieval/indexing algorithms. The deliverable will be a plug-in in Eclipse, due to its extensibility and ease of integration with various components. Current thoughts of the plugin are a main view containing a stock-chart and a property view. The visualization will be a front end to the XML format trace data collected during a separate TA run. Static and dynamic views will be provided for snapshots and temporal views. The visualization will be based on the depiction in:IO-Top-k: Index-access Optimized Top-k Query Processing. Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, Gerhard Weikum.
An interesting paper which provides a good basis for TAs, the algorithms themselves are relatively easy to understand but requires acquiring the terminology of the context. The relation between, advantages, and differences of several different TA methods are discussed, some of which are No Random Access (NRA), Random Access (RA), and a combination of both.