I’m was intending on releasing the first version of the prototype this week, I suspect it will be end of next week now. Came across a few problems. Well the first isn’t really a problem but it is time consuming and must been done right. The question is what Rights package are we going to release the prototype under? Having been to the Web2Rights events and spoken at some length with Naomi from the project I know how critical this is.

So I decided I best go through the process the Web2Rights team have developed and check everything’s in order. First I checked out the diagnostic tool they have developed. As we have developed a new product the process id fairly straight forward. So I downloaded the appropriate checklist. The main things we have to consider are:

Any third party Library and tool use – this has already been attributed in the code where applicable. Just need to pull out to the top level the CP notices for the libraries I have used.

Attribute the authors – I’ve assumed I can refer to the project by name rather than the individual members with LeedsMet being named as the lead institution with funding from JISC.

JISC’s policies – form what I have read they opt for an open source strategy, please correct me ASAP if I’m wrong about this.

Rather than pen our own licence I would suggest using the GNU which according to the information on Web2Rights is compliant with JISC polices. I also suggest going with their General Public License rather than Lesser General Public License for the reasons stated on their information pages .

Now I just have to cut and paste the header into all the code files and problem solved.

Second problem slightly less legal slightly more complex, I’m talking documentation and support. So far I have the help file developed for the recent repository day. This I think needs a bit more detail. I have also hacked the prototype this afternoon so this same doc is available form the help menu.

While talking Tony through the prototype this morning, with a view to evaluation, it was clear from his comments that its not as obvious as I first thought to know what to do with the application. I had intended to do some popup tips for the fields on the edit metadata panels, some are already there. But it is clear that this is not going to be sufficient. I know this is a beta stage prototype, but I think I need to rethink this before we do some serious evaluation.

So I’ve asked Nick and Tony to be guinea pigs to find out which bits critically need some inline help. This is scheduled over the next couple of weeks.

The final thing is name. What are we to call this application? It is an automatic metadata generator, which is a bit long winded and not so catchy. I keep referring to it as the Autogen which really doesn’t say much about it. Suggests on a post card please.

Today I started looking at the repository in more detail. There are several jobs that need to be done in order to ensure that metadata produced by the automatic generator prototype will be successful integrated. The first of these is to define the subset of LOM that is to be applied to all LOs. Having come from eCat the auto-generator currently implements a sub set of the LOM standard. Discussions with Ben when we were looking at adapting eCat suggested that the subset it was using was well researched and was most practical to users.

From general discussion around the university there are small pockets of individuals producing LOs in a packaged format. Some of these maybe produced using compendle which is SCORM IMS compliant and contains all the necessary metadata. Others may be using wimba create (formally course genie) which I have been told produces little or no metadata, (although this is contradictory to the information on their web site, this maybe a version issue). Some may also be using the eCat plug-in for word, which we promoted last year. For most, leaning content will be produced using more familiar tools such as power point and word documents and possibly some web pages for the more adventurous. This type of content is the main focus for the auto-generator. But these different generation process produce different metadata and some none at all.

From the perspective of the repository just about any type of content can be submitted or referenced. So this isn’t a problem. The problem lies in what metadata (application profile form Interlibrary view) should be required. In order to account for those applications that produce and package LOs all possible fields should be made available. However for those having no metadata with their content and possibly no means of producing any, a very minimal set should be presented to get some basic information off the depositor.

So bearing these two extremes in mind I opted to set all metadata fields as optional and thus accounting for all possible subsets of LOM, but then presented the user with only a minimal mandatory set, viewable by the depositor. These included, the LO’s title, description, keywords, authors detail (for content only), contribution date and Rights details.

The second job I need to do was test how the process of upload works with already packed objects. For this I used those LOs produced the by the replica project. These are IMS content packages most seam to be dated as 2005. IntraLibrary should detect this and extract the embedded metadata to populate its own metadata records. This does not appear to be working for most of the objects uploaded. IntraLibrary also has a preview function which should work with most of the file types it can store. Again this function only worked properly with some of the LOs. Downloading and unpacking these LOs was fine and the content was the same as the originals. This might suggest that either IntraLibrary is not backwards compatible with previous standards or the packages uploaded are not well-formed.

The final job I had to do was to check the upload of external XML LOM files for attachment to content already uploaded. This was one of the features I most liked about IntraLibrary and also enable me to develop a standalone auto-generator with no packaging functionalities. Again I was to be disappointed. Nothing happens when I use this facility on IntraLibrary. I’ve double checked the XML format and it all seams correct so at the moment I have no idea why this is happening.

Nick is arranging for a meet with IntraLibrary within the next couple of weeks. Hopeful we should be able to resolve these problems then.

Today the Streamline project finally got to show off the completed metadata prototype to non-project members. As part of a joint taster session with the Repository and PERsONA projects, we invited members of Leeds Met staff to enjoy a presentation and get hands on with various tools. Over the course of the day various groups were entertained by Nick’s presentation and his amazement at having got twenty odd fans on his Leeds Met Facebook group. Just goes to show the power of social networking.

The majority of people were interested in getting a look inside the repository. I got the impression that at first they thought we were designing the interface to Interlibrary. I mentioned this to John who suggested that we need to devise a clear map of how all these applications and interfaces work and which bits are being developed.

It was a little difficult to get people involved with the metadata tool unless I was there to show them around. The instruction sheet I had put together was a bit long winded given the time people had to look at each of the tools available. Meg, promoter of the eCat application at the start of this project, had a go with it. She seemed suitably impressed, both by the application and how we had incorporated the suggestions she had made when discussing the difficulties with eCat.

Others were impressed with the keyword generation process and interested in the method of text extraction we had used. Generally the response I got from those that tried out the prototype was good. I was unsure, other than Meg, whether the participants all had had previous metadata creation experience, although two were producing learning objects. I think this tool needs to be tested by individuals whose current workflow includes metadata creation rather than those who are new to the experience.

So far no one has added any comments to the blog John has set up for repository tool evaluation. 😦 So I can’t comment on that 😉

In a previous post I discussed using colour to identify for the user the sources for the generated metadata. Whether it had come from the documents they had passed into the application or from one of the data collections. While I liked the idea I felt it was to complex which effectively defeated the whole reason for using it. So I dropped it and reverted back to plain black text.

However what is important for the user is to known whether or not the metadata is complete. By this I mean that those fields that are required to be filled are filled and some warning is passed to the user where this is not the case. I thought of using popup warnings that did not allow the user to export the metadata file (effectively complete the process) unless certain fields were filled in. Again this was too complex from a coding perspective and I felt was irritating to the user. There was also the fact that different repository and packaging systems don’t all follow the LOM standard explicitly.

Thinking back to the colour coding I decided to use that to provide none intrusive warnings to the user. The user readable output is colour coded red if a field is missing and essential to the LOM standard and grey if missing and none essential as viewed here. This works much better than the popups and also enables the user to decide whether or not to fill in the relevant feeds before exporting. Allowing them to tailor the metadata produced to what ever application they intend to use it with.

The basic interface and functionality of the automatic metadata generator is almost complete. As identified in the report on eCat’s use of metadata there are three potential sources of metadata.

The Learning objects (LOs) content and supporting documentation.

Persistent collections of data – data that can be reused with each new metadata file.

System data – data that can be generated from the system architecture and file formats etc.

The prototype has been designed to use collections of data for contributor (both content, LOM 2 Lifecycle, and metadata LOM 3 Meta-metadata) information, requirement information (LOM 4 Technical) and rights information (LOM 6 Rights). Personal preferences, another collection of data, can be utilised for LOM 2.1 version, 2.2 Status, the majority of LOM 5 educational section and some aspects of the LOM 4 Technical section.

System data is captured to identify file types and size. Ideally login information from embedded organisation systems should also be used to capture the user’s personal details. This is not implemented at this time as linking in to university systems is a prolonged and difficult task. A separate flat file login process has been setup to represent this enable the user to write their details once and use many times.

The final source of metadata is that of the LO and any organisational documentation or development notes (referred to as Scripts). This data can be used to identify potential keywords and possible classification of the LO. Classification has not been explored at this stage as the LeedsMet repository (the main test bed) has not identified the classification system it is going to use with LOs. The money is currently on JAC, but we shall have to wait and see.

I have focused on word docs to start with but should be able to utilise HTML, PowerPoint and possible pdf by the end of the project. Mark did point out that there is a substantial difference between word 2003 (my current version) and word 2007 (the new XML format for vista), but there are limitations to what we can achieve here. Maybe someone else can hack that for me 🙂

The aim of the extraction process is to generate a set of potential keywords from the documents supplied by the user. I have been running tests on some student essays at the moment as their topics are easy to distinguish. This potential set is then presented to the user so they can select what they think is most appropriate or add new words. Its more of a brain stimulator than a definitive answer to the keyword generation problem.

To do this I’ve started with the basic methods used in Information Retrieval (IR) problems. Simple term frequency (TF) scans the document counting the number of times each word appears. There is usually some pre processing of the document such as removal of Stop Words and Stemming . I’ve opted for just the stopping process as stemming returns many words that don’t convey the true contextual meaning from the perspective of keywords. For example computing becomes compute.

TF can expand into various other methods on of the most common being term frequency–inverse document frequency (tf-idf). Basically a weighted measure across a document set (not a single document). This determines the highest frequency of terms that occur for each document with the least number of occurrences across the set. This can only be used if the user submits several documents to the auto generator. So it has limitations.

The final method I tested was weighted document structures. This counts terms again but adds greater weight to those that appear in headings and titles. This can be used both on a single document and on a document set.

General I found very little difference across the three methods. The top three to five terms tended to be the same (ordering was often a little different), with the next five to ten words being a mix of useful and not so useful terms. No particular method stood out from this but they all seamed to be putting relevant words at the top of their lists.

Now I need to consider how to mix these basic techniques with the different types of content the user may submit. A textual learning object may benefit from the weighted term frequency where as scripts and university documents may perform better using the cross document set tf-idf.

As a new member of the team I thought it would be useful to produce an overall architectural view of the main workflow packages that are to be built as part of the streamline project. I produced a conceptual architecture model that identifies the key elements. The final aim of the working system is to allow access to the streamline functionality via multiple mechanisms. Initially however user involvement will be via a Java based GUI developed by Dawn. Eventually a web-based GUI will be the way to go, once the interface requirements have been developed, tested and evaluated by end users. The underlying areas of the system will be required to do some number crunching, e.g. the latent semantic analysis which will allow related search results to be displayed to the user. Initially these components have been developed using C++ and Matlab due to the suitability of these languages for rapid mathematical programming. The intention is however, to convert all the components into Java in order to provide ease of integration and allow the system to be opened up to third party application via web-services and a Java API. A prototype of a parsing algorithm (to be used for meta-data extraction) has also been developed in C++ by Elizabeth. In order to allow ease of translation a UML model of this prototype has been created. This should also ease further development of the prototype up to a stage where it ready for translation and integration into the full system.

At the May Streamline meeting I present the prototype in it’s current state (half baked). Since then I have so many other things to do I have had little time to fill in the missing pieces. That’s now underway and I anticipate good progress this month. Should have something user presentable at the very least.

Anyway while I’m hacking away I was thinking (I do that occasionally) about the comments made at the last meeting. One of the functions I have built in is a none-XML-marked up, human readable version of metadata content for a learning object. Basically a plain text version. This was done in response to some comments made at one of the eCat training sessions I attended back in January. One user pointed out that after filling in all (about a dozen) metadata screens the only way to check what you had done was to go back through all of them. There was no overview of the metadata entered, except of course as the outputted XML.

I made this screen really simple, just headings and metadata content associated with them, grouped in the UK LOM CORE structure, (I don’t wont it to be come a long overcomplicated input screen). One of the comments made by the Streamline members (I forget who) was that how do you identify what’s missing and then direct the user to where they can correct this. This was a fair point. Well I’ve tackled the first aspect of this by colour coding the output to highlight null or empty values. This screen shot shows the result. I also thought it might be useful to indicate the difference between manual, auto-generated and preference-generated input. I’ve got it working in a couple of instances but wanted feedback on whether it was worth while as it will take a couple of days to implement across the whole system.

I’m still working on the second part of that comment – directing the user to the correction point and will get back in due time.