Just thought I’d post an update on our work since we have given the ok for our library administration to start internally testing the tool.

API Development:

Lets see, new enhancements to the API/backend server:

New session manager. This allows the software to capture and cache all user data over the life of a session which is generically useful in terms of allowing users the ability to quickly move back and forth between search results (i.e., once a query has been done, it is never redone for the life of your session) — more interestingly, it forms the foundation of eventually allowing the software to provide users with saved session results, saved queries, saved search histories, saved citation lists, etc. We won’t implement this fully in the first release (too bad — Jeremy has been cracking the whip lately to prevent feature creep –something I’m notorious for since some of my best ideas come to me after I’ve had a chance to visualize them within a realized application), but I’ll push to have it included in our next development cycle.

Citation services added. Allows users to email themselves citations. At this point, citations are emailed simply in HTML to an email client — but we’ll look to add more robust citation services in the future so that citations are email in a user-defined citation format.

I’ve added LDAP authentication — though its not being implementated at this point.

Our custom OpenURL implementation has been tweak to allow direct resolution of resources during the federated search process. This way, if you see a link on the results screen — you can feel confident that its going to get you to the resource.

New enhancements to the UI services:

I’m sure Jeremy is continuously thankful that I’m not responsible for the graphics or look of this service. The other day, we were discusing icons/names for the tool. I suggested, somewhat tongue in cheek, “Ask Benny” and provided this icon:
He was remarkably quiet concerning this suggestion — I think he’s hoping it would all just go away.

So actual improvements. First, we’ve done some work splitting results into more logical groups. A general search to query our most widely used resources, an image search that will eventually query close to 15 million harvested items and a books and more search which is limited to our ILS.

The Results themself are very clean. I think our two UI folks (Tami Herlocker and Seikyung Jung) have done a fabulous job of capturing our vision for a clean, minimalist UI. If you look at the example (below), you can see that they’ve setup filters, there’s pagination (at the bottom of the page) and the list of databases which have returned results — in addition to the ability to go directly to the targets website and filter all the results by a particular target, materials type, etc. Results are deduped by default by title and date. The beauty of this UI is its pretty much all XSLT generated. Our UI folks and myself have collaboratively designed the XSLTs (my role has been mostly to provide examples of how to interact with the results API and setup things like the deduping/pagination — I claim no credit for the look and feel of the UI — that’s all them) setting up what is essentially a middleware between the user and the API xml. I love it.

Anyway, lots of cool stuff going on with this project. I image the UI will continue to change in the next few days/weeks before we look to cut this over to our production environment — but at this point — I’m pretty happy with the results. And hopefully our patrons will be as well. Given that our current vendor supplied metasearch tool has set such a low-bar, I don’t think we have much to worry about. In fact, I’m more concerned about too much success. Given federated searchings fairly large footprint — I’m thinking we will have to carefully monitor resource useage for a while to make sure our current systems can handle expected load.

I remember it wasn’t too long ago that I could work close to 72 hours straight without crashing. Well, this weekend, I worked close to 34 hours from Friday to Saturday around 5 pm — and then I actually had to take a nap. A nap. Who takes naps. Of course, after my 1 1/2 hour power nap, I was ready to go again — but I’ll admit, I didn’t recover nearly as fast as I normally do. How sad.

Just a few random thoughts about Google Chat. Since use my Gmail account for quite a bit of communication, I find that I’m nearly always logged in. Well, because of that, I use Google’s Chat a lot. Generally, in other chat clients, I tend to never log my conversations — but just today, I was wishing I had and found that Google does. I’m not sure what I think about that — personally. On the one hand, it was really handy having an archive of this conversation — while on the other — there are many potential areas for abuse. I realize that you can simply delete conversations as you would any email — but I’m not sure I like the direction Google has been going lately. In the past, Google took a very hands off approach to software and feature development — essentially forcing users to opt in when giving up information. However, that’s seemed to change to an opt out model. Its a small change — but a big one.

I’m starting to work on a new webservice to run on one of our Unix boxes (thank you MONO). Essentially, I’m looking to create a webservice that will be accessible via a WSDL file. Basically, this will let any pipe files to this webservice and allow them to translate metadata into or out of any metadata framework current defined within MarcEdit (which is a few). Its not ready at this point — but will be soon. When its ready, I’ll let folks know so that they can start testing it out.

Whew! It’s been a while since my last post. Busy, busy, busy. Well, I’ve got a few posts I’d like to make tonight, so lets start with OSU’s Metasearch development. I’ve been busy. New API for doing inline filtering, adding new metadata formats and then building sample XSLT transform types for our UI folks so that they can then get working on making it “pretty”.

So what am I still looking to do? Well, to start — harvest. Harvesting images to start. I’ve identified some 15 million CONTENTdm images that we will be harvesting into our site. This should give our patrons access to one of the largest academic, archive image repositories around. Second, faster….I’m going to start experimenting with the Yaz proxy to see if I can get some of these poky databases to respond a little faster. When working with just EBSCO, our catalog and our harvested content, queries average about 6 seconds for response and rendering. When I add CSA into the mix, processing time jumps to about 12 seconds. Maybe the proxy can help (I hope).

So a screenshot — here is the latest from our UI group. I think its starting to look pretty good.

On the todo list:

Citations — building email citations for all resource types.

Saving/Exporting search results. This will allow users to save their search results to disk or online and just open these files any time that they like.

Caching queries. This is being done already to some degree — but this will be enhanced.

So lets talk about some of the questions I’ve received:

How’s OpenURL being utilized?
Well, this is actually the fun part. Since I’ve written my own OpenURL resolver, its given us a lot of flexibility in terms of what get resolves. For one, I’m actually resolving all resources. So, when a query is done, the OpenURL checks to see if we have holdings. If we do — the OpenURL resolver quickly resolves the entries to see if the user can actually get to the resource. If they can, the direct URL is offered to the UI. If not, the OpenURL resolve quickly inserts a link directly to the journal instead. A link to the OpenURL resolve is also displayed to the user via the UI so users can see if we have the resource in our catalog or ill the item directly through the system. I guess that the gist is, our OpenURL resolver is actually running the federated search. In most cases, items we query do not return URLs to the item. They return metadata. The OpenURL resolve solves a number of problems relating to resolving resources as well as providing resolution capabilities for doi’s, pmid’s, lccns, oclc numbers, etc.

How do you plan on dealing with the knowledge-base?
This is a harder question. The program is metadata-based with definitely saves me from having to code connectors, but as Roy Tennant has pointed out to me, you still have to maintain the metadata to the connectors — and if you have to deal with thousands of items, this could be tedious. Well, one thing that I’m doing to try to help make this easier, is make the knowledge-base a community commodity. I’ve done this with the OpenURL implementation that I’ve created (I’ve been sharing it with some folks in Oregon that need an OpenURL resolver) and will do this with the Metasearch tool as well. This way, the process becomes a community effort.Another element that I’ve added to the knowledge-base managment is the ability to create virtual collections. This will allow a user to create sets that can then inherit the properties of the virtual collection. An example of where this might be used. We subscribe to Ebscohost — ~12 databases…each with its own connection profile. However, the profile is hardly unique. The only change to the connection profile is essentially the database being connected to. By creating a virtual collection, I can then propogate common properties to all items that belong to this collection. This way, if Ebscohost changes their connection profile, I only have to modify the virtual collection and then all 12 databases change as well. Pretty cool.

Filtering, FRBR, etc?
Filtering actually turned out to be easy. One api call and a specialized stylesheet and filtering was a snap. FRBR will be coming. In many cases, resources provide subjects, authors, etc. We’ll be initially setting up FRBR elements that show groupings of Subjects and Authors. Well have to see what else we do from there.

What to do about databases that cannot be search?
Well, I have a couple of ideas. We have a handful of very rich databases that store our database resources. I’m thinking that what I’m planning on doing is adding some API that will take the search criteria — break it down — and then find the databases within these resources that best match the search criteria.

Added a status window and ability to cancel processing of the Batch Processing utility. This allows you to see how many files are being processed as well as canceling the processing if you get tired of waiting (it will cancel after the in-process file has completed).

In adding the status windows — I had to expose some new events in the MarcEngine. I’ll rebuild the docs shortly to include those.

MARC21XML=>MADs stylesheet changes. I’ve included the MADS stylesheet from LC for folks wanting to use it — however, there are some problems that I’ve noted and sent in to Jackie R. at LC. The problems that I’ve found:

Slow — this stylesheet is just slow. Lets look at some benchmarking. Working with 8000 records: MARC=>MARCXML (4 seconds), MARC=>MODS (50 seconds), MARC=>DC (34 seconds) and finally, MARC=>MADS (520 seconds). I have no idea why this stylesheet is so slow — but it runs like is stuck in mud.

Repeating elements. For example, the topic in the related element actually repeats itself. Here’s the original code:

xsl:template name=”topic”
topic
xsl:if test=”@tag=550 or @tag=750″
xsl:call-template name=”subfieldSelect”
xsl:with-param name=”codes”ab/xsl:with-param
/xsl:call-template
/xsl:if
xsl:call-template name=”setAuthority”/
xsl:call-template name=”chopPunctuation”
xsl:with-param name=”chopString”
xsl:choose
xsl:when test=”@tag=180 or @tag=480 or @tag=580 or @tag=780″
xsl:apply-templates select=”marc:subfield[@code=’x’]”/
/xsl:when
xsl:otherwise
xsl:call-template name=”subfieldSelect”
xsl:with-param name=”codes”ab/xsl:with-param
/xsl:call-template
/xsl:otherwise
/xsl:choose
/xsl:with-param
/xsl:call-template
/topic
xsl:apply-templates/
/xsl:template
The first part of this template is called if a marc field 550 or 750 is encountered — however, rather than breaking out of the template, it allows the data to be extracted again later in the template. I’ve found that you can just remove the first section so that it looks like:

xsl:template name=”topic”
topic
xsl:call-template name=”setAuthority”/
xsl:call-template name=”chopPunctuation”
xsl:with-param name=”chopString”
xsl:choose
xsl:when test=”@tag=180 or @tag=480 or @tag=580 or @tag=780″
xsl:apply-templates select=”marc:subfield[@code=’x’]”/
/xsl:when
xsl:otherwise
xsl:call-template name=”subfieldSelect”
xsl:with-param name=”codes”ab/xsl:with-param
/xsl:call-template
/xsl:otherwise
/xsl:choose
/xsl:with-param
/xsl:call-template
/topic
xsl:apply-templates/
/xsl:template
But this is just one of what looks like many problems with this stylesheet. Anyway, as always, the update can be downloaded from: MarcEdit50_Setup.exe

For those that have been interested and following development, I’ve completed the harvesting component of the Metasearch tool. Basically, we are invisioning this tool as a hybrid search…we harvest as much data as we can but federate search when we have to. Then we bring together the results and rank them within the context of the returned results. Anyway, here’s the updated search screen:

You can see from the screen shot, the search is querying ~25 databases in about 8 seconds. The reason why we are getting such good results is many of these items have been harvested an indexed within our mysql harvested database (which by the way, is internally normalized to dublin core. I realize we lose some granularity in the metadata, but for our purposes [search], I think that its ok — though I guess we’ll see).

Currently, the ranking algorithem is fairly simple. It uses the following to create a numeric rank:

Exact title match

Instring title match

words in title (with first word in the phrase ranking higher)

match within the subjects

match within the creators

instring match in all metadata (all words together)

instring match of each search word within the metadata

The number that comes up isn’t a percentage by any sense of the word — but it does seem to do a pretty good job of putting the most relevant result in the returned record set on the top. Anyway, I have a list of 2500 actual user searches and I’m going to be writing a script to beat the heck out of this tool, capturing error messages, time to process, number of results, etc. to see how this might work under load. Currently, we have a metasearch tool that we pay for, Innovatives MetaFind. However, looking at the numbers sent to us by III, usage for this tool (and you have to realize, its been available for a year), has hovered around 90 queries a day. I know the system could easily handle this type of load — but we are expecting this to be successful.
–Terry

Due to a request by Dan Chudnov in regards batch processing a set of MARC Authority records into MADS — I’ve made a couple of modifications to MarcEdit’s Batch Processing tool. First, I’ve setup MarcEdit so that it now can batch process any file type that is currently defined in the XML functions list. Second, I added the MARC=>MADS crosswalk to MarcEdit.

Oregon obviously isn’t known for its snow — especially on the Valley floor. But I get up today to ride my bike to work and what do I see….SNOW. Not a lot, but enough and in March. How odd is that. Anyway, got go hope on my bike.