Meta

This post goes through some of the techy stuff behind it. If you’re just interested in features, I’m afraid there’s none yet, but you can now compare more than 4 repositories, but that’s as far as you’ll want to read.

Subversion and Track

The first thing I did a few months a go was create a Subversion repository. I seemed to time this quite well in learning how to use subversion just as every man and his dog gets excited by GIT.

I also installed Trac (using Dreamhost‘s useful one-click install), which I haven’t really used other than to browse the code and view changes (time-line).

It took me a while to get svn working well. Originally I would edit the files in a local working copy of my Macbook, and then use subversion to load these to the server to test. Of course, this meant every little change had to be manually checked in to test it. I got apache and php working on the Macbook, and setup the Mysql db (on the Dreamhost server) to allow connections remotely which allowed me to test/use files located in my local copy. This seems to work well.

I don’t do much code writing or developing. Anyone glancing at my work will take that as plain obvious.

I’ve realised when writing code I have a tendency to be very linear and unconsciously put efficiency before good design. Anything more than one database call per page was unthinkable, loading in other pages a sin, which often led to large unwieldy while loops processing the results from a massive database call. Database calls are mixed with html output mixed with logic.

This is just about ok when showing information about one Institutional Repository. But when comparing a number, it becomes unreadable, and not in any state to be reused. The key aspect of comparing a number of IRs is that you need to ascertain a number of facts for the page as a whole (the earliest data collection date for the page, the higher record count for the page – for the chart for example – which could be from any of the IRs).

My aim was that the page files should be little more than calls to a few discrete functions.

I’m not quite there yet, but it’s a start. archive.php is mainly a set of function calls, but there is still too much in there, and random bits of html code dotted around. There’s also too many arrays holding information the repository (php) objects can provide. include.php holds the functions, but is now a little bit unwieldy itself, with functions ordered randomly. The third file of note is class.archive.php. This is a repository object, which can grab data from the db for the repository, and return it to the calling page in various ways.

My original plan was to merge the code for showing one repository, with the code to show more than one, to make it easier to implement changes (not having to update two files). By the end of it, I’m now wondering if it would be easier to have two files again, for the little changes, which both call the same core functions.

However one of the problems of the last version of ircount is that the URLs for the chart images often became too long (more than 2,000 characters) resulting in no chart being shown. The Chart URL includes each data point separated by a comma, so four repositories multiplied by 100 weeks (for example), multiplied by 4 or 5 digits per datapoint (4 digit number plus a comma), it soon adds up.

One solution was to only pass data per month rather than per week (roughly reducing the number of data points to a quarter of the original). Another would have been (and probably will do in the future) to make use of Google Chart’s encoding function, made easy using these helpful functions.

Overcoming this in an efficient way was a challenge. Originally I had a Google Chart PHP object. IR data would be passed to it in one method, and another method would return the URL.

This seemed sensible at the time, but deciding which php object did what became confusing. For example, the first thing the chart object needed to do was decide if the URL would be too long for the Repositories in question, taking in to account the data we had for each repository. Should the chart object loop around each data point for each repository to first decide how many there are? or should the repository object be handling this by telling the chart object how much data it has? How to avoid the need to loop through the same data several times. Does it matter which object does the work? It’s for the chart, so the chart object should do it, yet other parts of the page may want this info about the repositories, so the repository objects should provide it for all.

In the end, I did away with the chart object and used a function instead, which is passed an array of repository objects, which in turn handle a lot of work.

Future

The foundations are about there. For any page I (or anyone else) wishes to create. A couple of lines are all that are needed to take one or more repository IDs passed in the URL and load in all the data for them, ready to be used as needed. We can then easily call a table or chart to display for these repositories (or a subset of them).

As I mentioned above, the only real improvements are the ability to show more than 4 repositories at once (the chart stops showing once you get to about nine repositories), and the chart is more robust and will now show when it would have failed to do so in the past.

Google Chart does have an encoding which allows far more to be passed in a condensed way, and this php function looks very useful for using it.

If I was starting again today I would look to use a framework such as Zend Framework or CakePHP, or maybe even have a go at Ruby on Rails. But perhaps a third rewrite is a little over the top for now.

I need to tidy up the table view a bit (some nasty code there) and then look to a few new features, may be collecting more data, and exposing it in some computer friendly formats such as atom.

So ircount is really no more than a play thing for a bad coder to make mistakes and learn a little bit along the way. slowly. but if you have any ideas or thoughts I would love to hear them.