User login

Architects of the Information Age

Paul Miller reports on a recent UKOLN-organised event at the Office of the e-Envoy, and explores the need for an architecture to scope what we build online.

In July of this year, Interoperability Focus [1] organised a meeting at the Office of the e–Envoy [2], the Cabinet Office unit responsible for driving forward the UK's e–Government initiatives.

Across an increasing number of initiatives and programmes, there is a growing recognition of the need for common 'architectures' within which truly useful applications and services may be constructed.

Partly, these architectures form a philosophical basis within which developments may be undertaken. Further, though, such architectures in their broadest sense may include the specification of a common technical basis for working (such as the New Opportunities Fund's Technical Guidelines [3]), consideration of shared middleware services (such as the ATHENS [4] service which controls access to a range of resources within further and higher education), as well as often detailed technical specifications.

It remains important that such architectures not be driven forward solely in a technological context, but that their design, implementation and evolution continually be informed by institutional and user requirements and aspirations.

This one day colloquium [5] sought to encourage an open discussion of the issues related to a number of emerging architectures, with a view to informing those at an earlier stage in their deliberations, encouraging an information flow between more established infrastructures, and hopefully serving to reduce the potential for duplication of effort and the adoption of unnecessarily different solutions to essentially similar problems.

The proceedings were introduced by three presentations on quite different approaches; the DNER Architecture being developed by UKOLN for the UK's Further and Higher Education sector [6], the e-Government Interoperability Framework [7] (e-GIF) mandated across the UK Public Sector, and the Department of Culture, Media & Sport's (DCMS) vision for Culture Online [8].

The presentations themselves are available online [5]. Rather than discuss them, this paper seeks to draw out a number of the issues raised in the presentations and the ensuing wide–ranging discussion. In most cases, it is only possible to flag issues in need of further study, rather than to offer concrete solutions.

Architectures...

Various uses of the term 'architecture' are increasingly to be found in association with consideration of various information systems, and meanings for the term often vary quite markedly from application to application. In this paper, I broadly divide architectures up into four types, using definitions of my own, as below.

A technical architecture is often sketched out for individual systems, or small clusters of systems. Such an architecture will invariably detail specifications for components of the whole, and address such issues as the protocols to be used for communication between one component and another. In most technical architectures, softer issues such as the purpose of the system or the manner in which users will interact with it are invariably implicit at best or, more often than not, far from fully considered.

A functional architecture, such as that developed through the MODELS series [9], instead takes a process–driven view of the system. Through such a view, the architecture will often address the functions that such a system is expected to fulfil (allow discovery of records, allow a request to be made for the associated resource, etc.) or the functions that a user may wish to use it to undertake (discover records, request the associated resource, etc.). The two may appear similar in many cases, but functional architectures need to remain clear as to whether they are system– or user–focussed, or attempting explicitly to encompass both viewpoints. Such clarity from the outset makes understanding more complex aspects of the architecture simpler at a later date.

Perhaps less well developed is the idea of a landscape architecture. This serves to bound the realm of possibilities, to define what is 'in' and what is 'out', and to (ideally unambiguously) describe the relationships between users, resources, and technical systems. A landscape architecture might, for example, model large scale IPR issues and other relationships. The extent to which many of these relationships can be modelled and expressed at an architectural rather than individual system level is a question requiring further work.

The combination of all three, and more, is the information architecture. Such an architecture scopes the systems, data model, content, machine–machine (m2m) and machine–user interactions, and the environment within which interactions and transactions occur. Although there are many documented information architectures, few if any manage the totality of this definition today; although some, at least, are working towards it.

Terminological issues

One issue that comes up with great regularity in practically all domains is that of terminology [10]. People wish to increase the precision of description and discovery, and see the application of constrained sets of labels to this process as one means of solving current problems. Superficially, it appears an easy problem to solve.

As the work of the High Level Thesaurus Project (HILT) [11] has shown, a wide range of existing, incomplete, terminologies, category lists, classification schemes and thesauri are already in use, and there is marked reluctance to give any of them up, despite their faults. HILT proposes piloting a mapping agency, to look at the feasibility of allowing systems and users to cross–search resources catalogued using different schemes, but it is likely to be some years before the benefits of any such project are seen.

More immediately, there appears to be growing consensus on the need for a number of quick win solutions, with users in further and higher education, the cultural sector, government and beyond calling for someone to provide them with answers. It remains to be seen, of course, whether the same answers will suit all of these communities.

Crudely, the areas in which terminological control are most sought may be characterised as categorisation of level or audience grouping for resources; a small set of broadly applicable subject terms; a small set of broadly applicable resource types; and a (potentially extremely large) set of names for geographical locations.

In some cases, the best solution may be to point to an existing resource and, despite its faults, suggest or mandate that it be used in preference to other existing resources. In this way, projects funded by the New Opportunities Fund [12], say, might be encouraged to select subject terms from the UNESCO Thesaurus [13] for all of the content they create, as well as using any locally preferred terminological controls. In this way, the individual projects gain the precision and detail of their local systems, whilst ensuring a degree of interoperability and cross–search capability across the body of NOF material. In other cases, there is a need for a small group to simply create a new resource, filling a gap in existing provision, and solving a widely recognised problem. Where we consider creating new resources, we are quite explicitly not suggesting the construction of something approaching the size or complexity of the existing UNESCO Thesaurus [13] and its equivalents from other sectors. Rather, we are attempting to fill quite tightly defined gaps in existing provision, often by providing new resources comprising no more than a few hundred terms.

Audience and Level

Often, a resource will be intended primarily for a particular category of user. A government resource might be aimed at civil servants, or the general public, or tax payers, or school children. An educational resource might be aimed at pupils studying Key Stage 2 History in the English National Curriculum. Adding complexity, a Physics text book on superstring theory may be considered to be introductory in nature, provided the reader has a degree–level general awareness of Physics.

Increasingly, people are identifying the need to categorise the audience of a resource, but the approaches they wish to use in practice are often incompatible. The simple examples, above, show how quickly this area becomes complex. Within the Dublin Core Metadata Initiative (DCMI) [14] alone, both the Education and Government working groups have called for an extension to handle notions of Audience. Even when granted, the terms they choose to fill this new element with will be far from interoperable in most cases.

In advance of any overarching set of terms to describe aspects of Audience — were such a thing even feasible — the Metadata for Education Group (MEG) [15] here in the UK is working on a document to define a controlled set of terms to describe UK educational levels. Similar work is required for other audience categories, and to explore the feasibility of joining such resources together, either within the UK or internationally.

Subject Terms

Despite the existence of large and often complex subject thesauri and classifications, such as those examined by HILT, there remains a perceived need for a single higher–level set of terms, ideally as small as is feasible. Such a set would not be intended to describe subject–specific detail, but rather to allow subject or domain–spanning services such as the Resource Discovery Network [16] to place resources in some form of context. To borrow an example used in several instances by HILT, such a set of terms would allow the user searching for Lotus to know whether resources being returned to them were concerned with engineering, biology, or computer software. The set would probably not contain the detail required to further specify automotive design for sports cars, a particular genus or species of flower, or to denote that Lotus 1–2–3 is a spreadsheet program.

In services, such as the RDN, that attempt to aggregate existing bodies of material, the greatest need is often for terms which denote what would be — within a single service or resource — almost ridiculously obvious. In the early development of the Arts & Humanities Data Service [17], for example, a great need was to be able to say unambiguously that resources being delivered to the central cross–domain portal from the Archaeology Data Service were about archaeology, that resources coming from the Performing Arts Data Service were music, film, etc. For those visiting the websites of each service provider directly, such statements were largely redundant; of course resources actually on the web site of the Archaeology Data Service were about archaeology, and because of this these 'obvious' statements were, initially, rarely made.

On the service provider sites there were similar problems, albeit at a different level of detail. Whilst all resources on the Archaeology Data Service might be about archaeology, visitors to the site needed to know that all resources returned to them from a search of English Heritage's Excavation Index were one type of excavation or another, or even that they were all in England; facts that would be 'obvious' to someone interacting directly with the Excavation Index itself, but which are much less so when searching a catalogue containing a wide range of archaeological resource types from countries all over the world.

There is, then, a need for a single set of high level terms — a genuine High Level 'Thesaurus' — upon which a wide range of communities might draw. Services with a clear need for this, such as the Resource Discovery Network, the Arts & Humanities Data Service, the New Opportunities Fund Digitisation Programme, the National Grid for Learning, the University for Industry and others have expressed interest in working towards some common solution. As elsewhere, the feasibility of agreeing the detail, rather than the high–level ideal, remains to be seen.

Ideally, such a resource would follow the model proposed for the new Government Category List [18], which is a set of a few hundred terms developed to describe resources of interest to the citizen and delivered by all branches of UK local, regional and national government. This Category List is not intended to replace the detail of existing departmental thesauri, but instead sits above all of them and provides a degree of cross–departmental interoperability. The Government Category List currently comprises 2–300 terms, clustered under twelve main headings.

Interest has been expressed in any cross–sectoral high level list being even smaller; perhaps no more than twenty terms in total. The reality of meeting even the needs of those services listed above is likely to result in a somewhat larger set, but decisions will need to be made early in the process about just how large any such set should become before it ceases to be High Level in any meaningful sense.

Resource Type

Resources are of many types, and take many forms. Resources can be physical or digital, and may be (either physical or digital) books, videos, audio recordings, etc. Information about the Type of a resource is important, and has implications for storage, usage, and preservation. Type is also closely bound up with format in many practical instances; the resource may be classified as being of the Type "video", and therefore of interest to the searcher, but stored in a North American rather than European Format on the tape and consequently unplayable.

Resource Type has long been identified as important to enumerate within the work of the Dublin Core Metadata Initiative [14], with a list of Types [19] being one of the first they produced. It remains unsatisfactory, though, and intermittently a source of much debate. Although unliked, and widely considered to be far from useful, no one has yet managed to appease all of the interested parties and propose a replacement to which they are all happy to subscribe.

As our portals become increasingly Hybrid, accessing a wide range of physical and digital multiple media resources, useful enumerations of Type become ever more important, and this is yet another area in which the funding of a focussed piece of work would be of great potential benefit to the community.

Geography

Finally, place. Here, too, there is a need to be able to consistently and unambiguously describe the location that is being described. Is a resource about "Hull", for example, actually about the properly named Kingston–upon–Hull in England's East Riding of Yorkshire, about Hull in Quebec, Canada, or about one of the myriad other Hull's there must be in the world? One solution is to tie all such placenames to sets of co–ordinates, such as a UK National Grid Reference, or a latitude and longitude. This is not ideal, though, as complex or large features such as rivers and countries cover extensive areas and would be expensive to describe in this way, and less defined geographical concepts such as "The West Country" or America's "Mid–West" are actually quite difficult to place boundaries around.

Time adds a further complicating factor. "Strathclyde" was, at one time, a territory stretching from western Scotland all the way to Wales. Between the boundary changes of the 1970's and 1990's it was a unit of Local Government in western and south–western Scotland. Now, it doesn't exist, except in the names of a few quasi–public sector services left after the last round of boundary reforms. Similarly, the city of York has had a number of names over the past 2,000 years, all with approximately the same centre, but covering very different areas. How do we cope with these changes in ways that both reflect that which is 'correct' (Cumbernauld really isn't in Strathclyde anymore) and the (probably different) ways in which people view the past and present Geographies around them?

Geography is, as the paragraphs above can only begin to demonstrate, a complex problem. There are those who would argue that we should only implement a solution if it works for all the complexities of spatial and temporal fuzziness, and that as to do so even for the needs of UK Further and Higher Education would be prohibitively expensive, we just shouldn't bother. As elsewhere in this paper, though, I'd argue that there are things we can do, relatively easily and cheaply, that go some way towards a solution. To begin with, we could all use a single authoritative list for naming modern places in the UK. Although only containing around 250,000 terms, and therefore far from complete, the existing 1:50,000 scale Ordnance Survey gazetteer might offer just such a list. It includes the names of places mapped on Ordnance Survey's 1:50,000 scale maps, and provides each name with a grid reference, locating it to a point within 1 kilometre of its position. Might 'the community' enter into discussion with Ordnance Survey about the feasibility of this resource entering the public domain, ensuring a degree of consistent naming, and acting as a far–from expensive advert for Ordnance Survey products and good will? The great success of the recent collaboration between JISC and the British Library, through which MIMAS were able to deliver ZETOC [20] to the Further and Higher Educational sectors, is a good example of the way in which products previously considered as revenue generators can be freed up to a broad community of users, very probably reaping financial gains through other avenues (such as Inter Library Loan requests in this case) downstream.

Beyond this quick win, further work is needed in working with agencies such as Ordnance Survey, the Post Office/ Consignia, the Office of National Statistics and others to explore the extent to which existing — extremely comprehensive — databases of place names and their hierarchical relationships to one another can enter more widespread usage.

Persistence

Things change. People reorganise websites and whole organisations. Publishing houses buy other publishing houses, and absorb their titles into the catalogue of the parent. Local government boundaries alter, and the academic world reaches new consensus about the ways in which species are categorised, or the Dynasty to which a particular Egyptian Pharaoh belonged. All of this is a not unwelcome fact of life.

These changes become a problem, though, when the wholly unnecessary changes made by others threaten the distributed philosophy at the heart of the architecture deployed by many current web–based services, and so central to the web–of–association notions of the Semantic Web [21].

To take a few simple examples, each of which is apparently obvious, and each of which is flouted again and again by those who should know better;

why should the URI of a significant and oft–cited document change, just because the Service hosting it has decided to reorganise their web site?

why should the fact that a project has ended, and transferred its results to a third party for archiving and continued provision, mean that users need a whole new set of references for documents and resources?

why should e–mail addresses and web site URIs across Government change overnight, just because of a Cabinet restructuring?

why should a reappraisal of scientific data, leading to a completely different interpretation, result in a report with the same identifier as its predecessor, making it impossible for others to compare the two interpretations as the reinterpretation effectively overwrites its predecessor on the Web?

At the heart of all of these, of course, is the requirement for appropriate persistence, and a need for information architects everywhere to devise solutions that separate the identification of the resource itself from the logical (and changeable) location in which it is made available.

The solutions to many problems of persistence are relatively straightforward:

implementers should be required to think about the implications for others of changes that they introduce

greater attention should be paid to available solutions, such as the Digital Object Identifier [22]

project funding bodies such as the JISC should be encouraged to include clauses on the transfer of URIs and the resources to which they resolve to designated archives at the end of projects, if appropriate.

Certification, Identification, and the like

As services increasingly provide access to resources over which they do not necessarily have control for users from whom they are geographically and contractually remote, there is a growing need for more reliable identification of the components in any transaction, and certification that the identifications are accurate and authentic.

Building the Web of Mistrust

When dealing face–to–face with supposedly knowledgeable people, or entering the Bricks and Mortar premises of an allegedly reputable organisation, we are well equipped to make (admittedly sometimes wrong) judgements about the bona fide nature of those with whom we are dealing.

Online, it becomes much harder to reach realistic judgements based simply upon the appearance of a web site or the contents of an e–mail message. How many, for example, were duped by deplorable scams to obtain money from concerned individuals around the world, supposedly to assist in the relief effort in New York?

It is, it unfortunately appears, only common sense to mistrust online content, unless persuaded by some means to accept that which you see. Content, assertions, goods for sale, payment for goods, individuals, organisations and more all need to be unambiguously identified, and all of the identifications need to be certified by a trusted third party to whom the cautious or aggrieved can turn for assurance or redress. This is the possibly misnamed Web of Trust so essential to the further commercial expansion of the Web, but also increasingly important in cultural and educational contexts, where services and the very resources to which they point may be devolved to many players.

In identifying organisations, the commercial sector has made some progress in the development of digital certificates and certification agencies. In UK and European law, a suitable digital 'signature' on an electronic document is considered binding. Identifying people, transactions and individual resources is proving a greater problem to solve. Civil rights concerns, whether real or unfounded, prevent unambiguous ubiquitous identification of individuals. Early ideas within government of using the National Insurance number to identify citizens interacting online with government have been dropped, and attention is now turning to the use of task– or purpose–oriented identifiers, such as a student number for educational interactions, a National Insurance number for claiming benefit, etc. In a number of local authorities, trials are underway in the use of smartcards that identify and authenticate users for participation in a number of transactions with local government.

It is important to remember, though, that these identifications and authentications of users should serve a defensible purpose. It is unreasonable, for example, to require users to identify themselves and log in just to view your web site, or to interact with resources over which there are no relevant usage restrictions. Further, any system of identification must be capable of identifying all potential users, and should not add to the problems of the socially excluded. How many people in the UK, for example, might not have a National Insurance number, or even know it if they have one? It is also illegal, of course, to make use of a users' identification or registration information for other purposes (such as marketing) without their permission.

Although it seems unlikely that the current public mood will tolerate the creation and application of a single identifier for individuals, there may well be scope for the development of a single standard for gathering, storing, and utilising such identifications, building upon existing developments within education (such as ATHENS [4] and its successor) and beyond.

Being inclusive

Many of the problems facing those building the DNER are also faced by the architects of the Government Portals, or of Culture Online, or the National Grid for Learning. There are also similarities with commercial service developments, and the different communities have a great deal to offer one another, and a great deal to learn, assuming we can weaken the barriers of language, working practice and financial model which keep us apart. Interoperability Focus and others already work to actively cross some of these barriers, especially within the relative familiarity of the public sector. There is certainly scope for more to become involved in this work, and for its active extension into the commercial sphere.

Discovering what the user wants

Information architects, and others, always make sweeping assumptions about what users want, often based upon their own behaviour or upon unrealistic expectations of the 'typical user', were such an individual to exist. Only rarely do we engage actively in finding out what users really want, and this is one area in which Government and its focus groups have made valuable progress. The rest of the community has much to learn from these focus groups, and there may be potential for harnessing them to the benefit of a broader set of service builders than simply those within central government.

The SuperHighway Code

The e–Government Interoperability Framework (e–GIF) [7] serves as a blueprint for services provided by Government, and for those parts of external services which wish to interact with Government. At its heart, the e–GIF mandates a set of commonly deployed Web and industry standards, and selects XML as the syntax of choice for exchanging data. In many ways, this document is a first step towards a sort of 'Highway Code' for the public sector. Analagously to the Highway Code for road users in the UK, which doesn't specify what colour your car should be, how many wheels it should have, or how big its engine should be, but which does make sure we all drive on the same side of the road, give way to vehicles coming from the right, and generally interact with other road users relatively painlessly, the e–GIF doesn't specify software, hardware, or day–to–day working practice, but does ensure the viability of efficient information transfer.

There would appear to be scope for taking a document such as the e–GIF as a model, and seeking to develop it into some form of SuperHighway Code for 'responsible' providers of Information Age services, whether in the public or private sectors. Such a document might move beyond the topics covered in the e–GIF into notions of Persistence and some of the other issues raised in this paper.

It would be interesting, indeed, to see if sufficient interest and support could be gathered to make such a notion real.

Towards a shared infrastructure?

The DNER Architecture [6] introduces the useful notion of Shared Services. These services are aspects of Middleware that are provided by some third party for the benefit of the community as a whole, rather than embedded within each content service in turn. Such Shared Services might be provided once (ATHENS [4] supports users across Higher Education) or might be federated in some fashion, with individual institutions conceivably taking responsibility for certifying their own members to those services they wish to use, for example.

Within the DNER Architecture, ATHENS is the only existing example of such a service, but clear uses are identified for shared Collection Description Services, Authentication Services, Authorisation Services, Service Description Services, and even others such as Personalisation and Preference Services.

This is an area of ongoing development for the DNER, but is equally of value to other communities. It would be valuable, at an early stage, to broaden at least the intellectual debate to include other viewpoints (as is already happening, to the extent possible with current funding), and ideally to begin exploration of the potential for shared services across domains, or at least for the establishment of similarly structured services within domains, that might communicate at a later date if required.

Conclusions

An effective and useful Information Architecture is a complex proposition, requiring careful planning and design, and an awareness of many different issues. It seems that we have a good understanding of technical architectures, with many of the necessary building blocks essentially in place. Functional architectures, too, are increasingly well developed, especially from the perspective of the system. There is still scope, though, for more work to understand the functions that real users actually wish to perform. The areas in need of most work — and those concentrated on in the body of this paper — fall much more readily under the less well undersood landscape architecture, and within the overarching information architecture itself, as components that make the technical and functional substructure genuinely useful rather than merely technically elegant.

As in the real world, information architectures need to be driven by real world requirements, rather than merely the research interests and obsessions of their designers. Many an elegant and technically sophisticated building has been loathed by its unfortunate occupants who find it impossible to inhabit, and the same is all too true of information systems.

As a plea from someone who grew up in one, please, let us not build the Internet equivalent of a New Town...!