Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information from it for particular purposes. In previous work, we claimed that compression is a key technology for text mining, and backed this up with a study that showed how particular kinds of lexical tokens—names, dates, locations, etc.—can be identified and located in running text, using compression models to provide the leverage necessary to distinguish different token types (Witten et al., 1999)

2000/2

Text categorization using compression models

Eibe Frank, Chang Chui, Ian H. Witten

Text categorization, or the assignment of natural language texts to predefined categories based on their content, is of growing importance as the volume of information available on the internet continues to overwhelm us. The use of predefined categories implies a “supervised learning” approach to categorization, where already-classified articles - which effectively define the categories - are used as “training data” to build a model that can be used for classifying new articles that comprise the “test data”. This contrasts with “unsupervised” learning, where there is no training data and clusters of like documents are sought amongst the test articles. With supervised learning, meaningful labels (such as keyphrases) are attached to the training documents, and appropriate labels can be assigned automatically to test documents depending on which category they fall into.

2000/3

Reserved for Sally Jo

2000/4

Interactive machine learning—letting users build classifiers

Malcolm Ware, Eibe Frank, Geoffrey Holmes, Mark Hall, Ian H. Witten

According to standard procedure, building a classifier is a fully automated process that follows data preparation by a domain expert. In contrast, <I>interactive</I>machine learning engages users in actually generating the classifier themselves. This offers a natural way of integrating background knowledge into the modeling stage—so long as interactive tools can be designed that support efficient and effective communication. This paper shows that appropriate techniques can empower users to create models that compete with classifiers built by state-of-the-art learning algorithms. It demonstrates that users—even users who are not domain experts—can often construct good classifiers, without any help from a learning algorithm, using a simple two-dimensional visual interface. Experiments demonstrate that, not surprisingly, success hinges on the domain: if a few attributes can support good predictions, users generate accurate classifiers, whereas domains with many high-order attribute interactions favor standard machine learning techniques. The future challenge is to achieve a symbiosis between human user and machine learning algorithm.

Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea's effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.

2000/6

μ-Charts and Z: hows, whys and wherefores

Greg Reeve, Steve Reeves

In this paper we show, by a series of examples, how the μ-chart formalism can be translated into Z. We give reasons for why this is an interesting and sensible thing to do and what it might be used for.

2000/7

One dimensional non-uniform rational B-splines for animation control

Abdelaziz Mahoui

Most 3D animation packages use graphical representations called motion graphs to represent the variation in time of the motion parameters. Many use two-dimensional B-splines as animation curves because of their power to represent free-form curves. In this project, we investigate the possibility of using One-dimensional Non-Uniform Rational B-Spline (NURBS) curves for the interactive construction of animation control curves. One-dimensional NURBS curves present the potential of solving some problems encountered in motion graphs when two-dimensional B-splines are used. The study focuses on the properties of One-dimensional NURBS mathematical model. It also investigates the algorithms and shape modification tools devised for two-dimensional curves and their port to the One-dimensional NURBS model. It also looks at the issues related to the user interface used to interactively modify the shape of the curves.

Algorithms for feature selection fall into two broad categories: <I>wrappers</I>that use the learning algorithm itself to evaluate the usefulness of features and <I>filters</I>that evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems. This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. The algorithm often out-performs the well-known ReliefF attribute estimator when used as a preprocessing step for naïve Bayes, instance-based learning, decision trees, locally weighted regression, and model trees. It performs more feature selection than ReliefF does-reducing the data dimensionality by fifty percent in most cases. Also, decision and model trees built from the prepocessed data are often significantly smaller.

2000/9

A development environment for predictive modelling in foods

G. Holmes, M.A. Hall

WEKA (Waikato Environment for Knowledge Analysis) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning/data mining algorithms. Non-programmers interact with the software via a user interface component called the Knowledge Explorer.

Applications constructed from the WEKA class libraries can be run on any computer with a web browsing capability, allowing users to apply machine learning techniques to their own data regardless of computer platform. This paper describes the user interface component of the WEKA system in reference to previous applications in the predictive modeling of foods.

2000/10

Benchmarking attribute selection techniques for data mining

Mark A. Hall, Geoffrey Holmes

Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation.

Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted.

This paper presents a benchmark comparison of several attribute selection methods. All the methods produce an attribute ranking, a useful devise of isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the rankings with respect to a learning scheme to find the best attributes. Results are reported for a selection of standard data sets and two learning schemes C4.5 and naïve Bayes.

2000/11

Steve Reeves, Greg Reeve

2000/12

Malika Mahoui, Sally Jo Cunningham

Transaction logs are invaluable sources of fine-grained information about users' search behavior. This paper compares the searching behavior of users across two WWW-accessible digital libraries: the New Zealand Digital Library's Computer Science Technical Reports collection (CSTR), and the Karlsruhe Computer Science Bibliographies (CSBIB) collection. Since the two collections are designed to support the same type of users-researchers/students in computer science a comparative log analysis is likely to uncover common searching preferences for that user group. The two collections differ in their content, however; the CSTR indexes a full text collection, while the CSBIB is primarily a bibliographic database. Differences in searching behavior between the two systems may indicate the effect of differing search facilities and content type.

99/1

Lexical attraction for text compression

Joscha Bach, Ian H. Witten

New methods of acquiring structural information in text documents may support better compression by identifying an appropriate prediction context for each symbol. The method of “lexical attraction” infers syntactic dependency structures from statistical analysis of large corpora. We describe the generation of a lexical attraction model, discuss its application to text compression, and explore its potential to outperform fixed-context models such as word-level PPM. Perhaps the most exciting aspect of this work is the prospect of using compression as a metric for structure discovery in text.

99/2

Generating rule sets from model trees

Geoffrey Holmes, Mark Hall, Eibe Frank

Knowledge discovered in a database must be represented in a form that is easy to understand. Small, easy to interpret nuggets of knowledge from data are one requirement and the ability to induce them from a variety of data sources is a second. The literature is abound with classification algorithms, and in recent years with algorithms for time sequence analysis, but relatively little has been published on extracting meaningful information from problems involving continuous classes (regression).

Model trees-decision trees with linear models at the leaf nodes-have recently emerged as an accurate method for numeric prediction that produces understandable models. However, it is well known that decision lists-ordered sets of If-Then rules-have the potential to be more compact and therefore more understandable than their tree counterparts.

In this paper we present an algorithm for inducing simple, yet accurate rule sets from model trees. The algorithm works by repeatedly building model trees and selecting the best rule at each iteration. It produces rule sets that are, on the whole, as accurate but smaller than the model tree constructed from the entire dataset. Experimental results for various heuristics which attempt to find a compromise between rule accuracy and rule coverage are reported. We also show empirically that our method produces more accurate and smaller rule sets than the commercial state-of-the-art rule learning system Cubist.

99/3

A diagnostic tool for tree based supervised classification learning algorithms

Leonard Trigg, Geoffrey Holmes

The process of developing applications of machine learning and data mining that employ supervised classification algorithms includes the important step of knowledge verification. Interpretable output is presented to a user so that they can verify that the knowledge contained in the output makes sense for the given application. As the development of an application is an iterative process it is quite likely that a user would wish to compare models constructed at various times or stages.

One crucial stage where comparison of models is important is when the accuracy of a model is being estimated, typically using some form of cross-validation. This stage is used to establish an estimate of how well a model will perform on unseen data. This is vital information to present to a user, but it is also important to show the degree of variation between models obtained from the entire dataset and models obtained during cross-validation. In this way it can be verified that the cross-validation models are at least structurally aligned with the model garnered from the entire dataset.

This paper presents a diagnostic tool for the comparison of tree-based supervised classification models. The method is adapted from work on approximate tree matching and applied to decision trees. The tool is described together with experimental results on standard datasets.

99/4

Feature selection for discrete and numeric class machine learning

Mark A. Hall

Algorithms for feature selection fall into two broad categories: <I>wrappers</I>use the learning algorithm itself to evaluate the usefulness of features, while <I>filters</I>evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems.

This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. Experiments using the new method as a preprocessing step for naïve Bayes, instance-based learning, decision trees, locally weighted regression, and model trees show it to be an effective feature selector- it reduces the data in dimensionality by more than sixty percent in most cases without negatively affecting accuracy. Also, decision and model trees built from the pre-processed data are often significantly smaller.

99/5

Browsing tree structures

Mark Apperley, Robert Spence, Stephen Hodge, Michael Chester

Graphic representations of tree structures are notoriously difficult to create, display, and interpret, particularly when the volume of information they contain, and hence the number of nodes, is large. The problem of interactively browsing information held in tree structures is examined, and the implementation of an innovative tree browser described. This browser is based on distortion-oriented display techniques and intuitive direct manipulation interaction. The tree layout is automatically generated, but the location and extent of detail shown is controlled by the user. It is suggested that these techniques could be extended to the browsing of more general networks.

99/6

Facilitating multiple copy/past operations

Mark Apperley, Jay Baker, Dale Fletcher, Bill Rogers

Copy and paste, or cut and paste, using a clipboard or paste buffer has long been the principle facility provided to users for transferring data between and within GUI applications. We argue that this mechanism can be clumsy in circumstances where several pieces of information must be moved systematically. In two situations - extraction of data fields from unstructured data found in a directed search process, and reorganisation of computer program source text - we present alternative, more natural, user interface facilities to make the task less onerous, and to provide improved visual feedback during the operation.

For the data extraction task we introduce the Stretchable Selection Tool, a semi-transparent overlay augmenting the mouse pointer to automate paste operations and provide information to prompt the user. We describe a prototype implementation that functions in a collaborative software environment, allowing users to cooperate on a multiple copy/paste operation. For text reorganisation, we present an extension to Emacs, providing similar functionality, but without the collaborative features.

99/7

Automating iterative tasks with programming by demonstration: a user evaluation

Gordon W. Paynter, Ian H. Witten

Computer users often face iterative tasks that cannot be automated using the tools and aggregation techniques provided by their application program: they end up performing the iteration by hand, repeating user interface actions over and over again. We have implemented an agent, called Familiar, that can be taught to perform iterative tasks using programming by demonstration (PBD). Unlike other PBD systems, it is domain independent and works with unmodified, widely-used, applications in a popular operating system. In a formal evaluation, we found that users quickly learned to use the agent to automate iterative tasks. Generally, the participants preferred to use multiple selection where possible, but could and did use PBD in situations involving iteration over many commands, or when other techniques were unavailable.

99/8

A survey of software requirements specification practices in the New Zealand software industry

Lindsay Groves, Ray Nickson, Greg Reeve, Steve Reeves, Mark Utting

We report on the software development techniques used in the New Zealand software industry, paying particular attention to requirements gathering. We surveyed a selection of software companies with a general questionnaire and then conducted in-depth interviews with four companies. Our results show a wide variety in the kinds of companies undertaking software development, employing a wide range of software development techniques. Although our data are not sufficiently detailed to draw statistically significant conclusions, it appears that larger software development groups typically have more well-defined software development processes, spend proportionally more time on requirements gathering, and follow more rigorous testing regimes.

99/9

The LRU*WWW proxy cache document replacement algorithm

Chung-yi Chang, Tony McGregor, Geoffrey Holmes

Obtaining good performance from WWW proxy caches is critically dependent on the document replacement policy used by the proxy. This paper validates the work of other authors by reproducing their studies of proxy cache document replacement algorithms. From this basis a cross-trace study is mounted. This demonstrates that the performance of most document replacement algorithms is dependent on the type of workload that they are presented with. Finally we propose a new algorithm, LRU*, that consistently performs well across all our traces.

99/10

Reduced-error pruning with significance tests

Eibe Frank, Ian H. Witten

When building classification models, it is common practice to prune them to counter spurious effects of the training data: this often improves performance and reduces model size. "Reduced-error pruning" is a fast pruning procedure for decision trees that is known to produce small and accurate trees. Apart from the data from which the tree is grown, it uses an independent "pruning" set, and pruning decisions are based on the model's error rate on this fresh data. Recently it has been observed that reduced-error pruning overfits the pruning data, producing unnecessarily large decision trees. This paper investigates whether standard statistical significance tests can be used to counter this phenomenon.

The problem of overfitting to the pruning set highlights the need for significance testing. We investigate two classes of test, "parametric" and "non-parametric." The standard chi-squared statistic can be used both in a parametric test and as the basis for a non-parametric permutation test. In both cases it is necessary to select the significance level at which pruning is applied. We show empirically that both versions of the chi-squared test perform equally well if their significance levels are adjusted appropriately. Using a collection of standard datasets, we show that significance testing improves on standard reduced error pruning if the significance level is tailored to the particular dataset at hand using cross-validation, yielding consistently smaller trees that perform at least as well and sometimes better.

The Waikato Environment for Knowledge Analysis (Weka) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning and data mining algorithms. Weka is freely available on the World-Wide Web and accompanies a new text on data mining [1] which documents and fully explains all the algorithms it contains. Applications written using the Weka class libraries can be run on any computer with a Web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform.

99/12

Pace Regression

Yong Wang, Ian H. Witten

This paper articulates a new method of linear regression, “pace regression”, that addresses many drawbacks of standard regression reported in the literature—particularly the subset selection problem. Pace regression improves on classical ordinary least squares (OLS) regression by evaluating the effect of each variable and using a clustering analysis to improve the statistical basis for estimating their contribution to the overall regression. As well as outperforming OLS, it also outperforms—in a remarkably general sense—other linear modeling techniques in the literature, including subset selection procedures, which seek a reduction in dimensionality that falls out as a natural byproduct of pace regression. The paper defines six procedures that share the fundamental idea of pace regression, all of which are theoretically justified in terms of asymptotic performance. Experiments confirm the performance improvement over other techniques.

99/13

A compression-based algorithm for Chinese word segmentation

W.J. Teahan, Yingying Wen, Rodger McNab, Ian H. Witten

The Chinese language is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks: for example, full-text search, word-based compression, and keyphrase extraction.

We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of pre-segmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation.

Existing clustering methods for the semi-parametric mixture distribution perform well as the volume of data increases. However, they all suffer from a serious drawback in finite-data situations: small outlying groups of data points can be completely ignored in the clusters that are produced, no matter how far away they lie from the major clusters. This can result in unbounded loss if the loss function is sensitive to the distance between clusters.

This paper proposes a new distance-based clustering method that overcomes the problem by avoiding global constraints. Experimental results illustrate its superiority to existing methods when small clusters are present in finite data sets; they also suggest that it is more accurate and stable than other methods even when there are no small clusters.

99/15

99/16

The Niupepa Collection: Opening the blinds on a window to the past

Te Taka Keegan, Sally Jo Cunningham, Mark Apperley

This paper describes the building of a digital library collection of historic newspapers. The newspapers (Niupepa in Maori), which were published in New Zealand during the period 1842 to 1933, form a unique historical record of the Maori language, and of events from an historical perspective. Images of these newspapers have been converted to digital form, electronic text extracted from these, and the collection is now being made available over the Internet as a part of the New Zealand Digital Library (NZDL) project at the University of Waikato.

98/1

Boosting trees for cost-sensitive classifications

Kai Ming Ting, Zijian Zheng

This paper explores two boosting techniques for cost-sensitive tree classification in the situation where misclassification costs change very often. Ideally, one would like to have only one induction, and use the induced model for different misclassification costs. Thus, it demands robustness of the induced model against cost changes. Combining multiple trees gives robust predictions against this change. We demonstrate that ordinary boosting combined with the minimum expected cost criterion to select the prediction class is a good solution under this situation. We also introduce a variant of the ordinary boosting procedure which utilizes the cost information during training. We show that the proposed technique performs better than the ordinary boosting in terms of misclassification cost. However, this technique requires to induce a set of new trees every time the cost changes. Our empirical investigation also reveals some interesting behavior of boosting decision trees for cost-sensitive classification.

98/2

Generating accurate rule sets without global optimization

Eibe Frank, Ian H. Witten

The two dominant schemes for rule-learning, C4.5 and RIPPER, both operate in two stages. First they induce an initial rule set and then they refine it using a rather complex optimization stage that discards (C4.5) or adjusts (RIPPER) individual rules to make them work better together. In contrast, this paper shows how good rule sets can be learned one rule at a time, without any need for global optimization. We present an algorithm for inferring rules by repeatedly generating partial decision trees, thus combining the two major paradigms for rule generation-creating rules from decision trees and the separate-and-conquer rule-learning technique. The algorithm is straightforward and elegant: despite this, experiments on standard datasets show that it produces rule sets that are as accurate as and of similar size to those generated by C4.5, and more accurate than RIPPER's. Moreover, it operates efficiently, and because it avoids postprocessing, does not suffer the extremely slow performance on pathological example sets for which the C4.5 method has been criticized.

98/3

VQuery: a graphical user interface for Boolean query Specification and dynamic result preview

Steve Jones

Textual query languages based on Boolean logic are common amongst the search facilities of on-line information repositories. However, there is evidence to suggest that the syntactic and semantic demands of such languages lead to user errors and adversely affect the time that it takes users to form queries. Additionally, users are faced with user interfaces to these repositories which are unresponsive and uninformative, and consequently fail to support effective query refinement. We suggest that graphical query languages, particularly Venn-like diagrams, provide a natural medium for Boolean query specification which overcomes the problems of textual query languages. Also, dynamic result previews can be seamlessly integrated with graphical query specification to increase the effectiveness of query refinements. We describe VQuery, a query interface to the New Zealand Digital Library which exploits querying by Venn diagrams and integrated query result previews.

98/4

Revising <I>Z</I>: semantics and logic

Martin C. Henson, Steve Reeves

We introduce a simple specification logic <I>Z</I>c comprising a logic and semantics (in <I>ZF</I> set theory). We then provide an interpretation for (a rational reconstruction of) the specification language <I>Z</I> within <I>Z</I>c. As a result we obtain a sound logic for <I>Z</I>, including the schema calculus. A consequence of our formalisation is a critique of a number of concepts used in <I>Z</I>. We demonstrate that the complications and confusions which these concepts introduce can be avoided without compromising expressibility.

98/5

A logic for the schema calculus

Martin C. Henson, Steve Reeves

In this paper we introduce and investigate a logic for the schema calculus of <I>Z</I>. The schema calculus is arguably the reason for <I>Z</I>'s popularity but so far no true calculus (a sound system of rules for reasoning about schema expressions) has been given. Presentations thus far have either failed to provide a calculus (e.g. the draft standard [3]) or have fallen back on informal descriptions at a syntactic level (most text books e.g. [7[). Once the calculus is established we introduce a derived equational logic which enables us to formalise properly the informal notations of schema expression equality to be found in the literature.

98/6

New foundations for <I>Z</I>

Martin C. Henson, Steve Reeves

We provide a constructive and intensional interpretation for the specification language <I>Z</I> in a theory of operations and kinds <I>T</I>. The motivation is to facilitate the development of an integrated approach to program construction. We illustrate the new foundations for <I>Z</I> with examples.

98/7

Predicting apple bruising relationships using machine learning

G. Holmes, S.J. Cunningham, B.T. Dela Rue, A.F. Bollen

Many models have been used to describe the influence of internal or external factors on apple bruising. Few of these have addressed the application of derived relationships to the evaluation of commercial operations. From an industry perspective, a model must enable fruit to be rejected on the basis of a commercially significant bruise and must also accurately quantify the effects of various combinations of input features (such as cultivar, maturity, size, and so on) on bruise prediction. Input features must in turn have characteristics which are measurable commercially; for example, the measure of force should be impact energy rather than energy absorbed. Further, as the commercial criteria for acceptable damage levels change, the model should be versatile enough to regenerate new bruise thresholds from existing data.

Machine learning is a burgeoning technology with a vast range of potential applications particularly in agriculture where large amounts of data can be readily collected [1]. The main advantage of using a machine learning method in an application is that the models built for prediction can be viewed and understood by the owner of the data who is in a position to determine the usefulness of the model, an essential component in a commercial environment.

98/8

An evaluation of passage-level indexing strategies for a technical report archive

Michael Williams

Past research has shown that using evidence from document passages rather than complete documents is an effective way of improving the precision of full-text database searches. However, passage-level indexing has yet to be widely adopted for commercial or online databases.

This paper reports on experiments designed to test the efficacy of passage-level indexing with a particular collection of a full-text online database, the New Zealand Digital Library. Discourse passages and word-window passages are used for the indexing process. Both ranked and Boolean searching are used to test the resulting indexes.

Overlapping window passages are shown to offer the best retrieval performance with both ranked and Boolean queries. Modifications may be necessary to the term weighting methodology in order to ensure optimal ranked query performance.

Managing the organizational and software complexity of a comprehensive digital library presents a significant challenge. Different library collections each have their own distinctive features. Different presentation languages have structural implications such as left-to-right writing order and text-only interfaces for the visually impaired. Different media involve different file formats, and-more importantly-radically different search strategies are required for non-textual media. In a distributed library, new collections can appear asynchronously on servers in different parts of the world. And as searching interfaces mature from the command-line era exemplified by current Web search engines into the age of reactive visual interfaces, experimental new interfaces must be developed, supported, and tested. This paper describes our experience, gained from operating a substantial digital library service over several years, in solving these problems by designing an appropriate software architecture.

98/10

Experiences with a weighted decision tree learner

John G. Cleary, Leonard E. Trigg

Machine learning algorithms for inferring decision trees typically choose a single “best” tree to describe the training data. Recent research has shown that classification performance can be significantly improved by voting predictions of multiple, independently produced decision trees. This paper describes an algorithm, OB1, that makes a weighted sum over many possible models. We describe one instance of OB1, that includes <I>all</I> possible decision trees as well as naïve Bayesian models. OB1 is compared with a number of other decision tree and instance based learning alogrithms on some of the data sets from the UCI repository. Both an information gain and an accuracy measure are used for the comparison. On the information gain measure OB1 performs significantly better than all the other algorithms. On the accuracy measure it is significantly better than all the algorithms except naïve Bayes which performs comparably to OB1.

98/11

An entropy gain measure of numeric prediction performance

Leonard Trigg

Categorical classifier performance is typically evaluated with respect to error rate, expressed as a percentage of test instances that were not correctly classified. When a classifier produces multiple classifications for a test instance, the prediction is counted as incorrect (even if the correct class was one of the predictions). Although commonly used in the literature, error rate is a coarse measure of classifier performance, as it is based only on a single prediction offered for a test instance. Since many classifiers can produce a class distribution as a prediction, we should use this to provide a better measure of how much information the classifier is extracting from the domain.

Numeric classifiers are a relatively new development in machine learning, and as such there is no single performance measure that has become standard. Typically these machine learning schemes predict a single real number for each test instance, and the error between the predicted and actual value is used to calculate a myriad of performance measures such as correlation coefficient, root mean squared error, mean absolute error, relative absolute error, and root relative squared error. With so many performance measures it is difficult to establish an overall performance evaluation.

The next section describes a performance measure for machine learning schemes that attempts to overcome the problems with current measures. In addition, the same evaluation measure is used for categorical and numeric classifier.

Component-based information systems development is an area of research and practice of increasing importance. Information Systems developers have realised that traditional approaches to IS engineering produce monolithic, difficult to maintain, difficult to reuse systems. In contrast, the use of software components, which embody data, functionality and well-specified and understood interfaces, makes interoperable, distributed and highly reusable IS components feasible. Component-based approaches to IS engineering can be used at strategic and organisational levels, to model business processes and whole IS architectures, in development methods which utilise component-based models during analysis and design, and in system implementation. Reusable components can allow end users to compose and configure their own Information Systems, possibly from a range of suppliers, and to more tightly couple their organisational workflows with their IS support.

This workshop proceedings contains a range of papers addressing one or more of the above issues relating to the use of component models for IS development. All of these papers were refereed by at least two members of an international workshop committee comprising industry and academic researchers and users of component technologies. Strategic uses of components are addressed in the first three papers, while the following three address uses of components for systems design and workflow management. Systems development using components, and the provision of environments for component management are addressed in the following group of five papers. The last three papers in this proceedings address component management and analysis techniques.

All of these papers provide new insights into the many varied uses of component technology for IS engineering. I hope you find them as interesting and useful as I have when collating this proceedings and organising the workshop.

98/13

An analysis of usage of a digital library

Steve Jones, Sally Jo Cunningham, Rodger McNab

As experimental digital library testbeds gain wider acceptance and develop significant user bases, it becomes important to investigate the ways in which users interact with the systems in practice. Transaction logs are one source of usage information, and the information on user behaviour can be culled from them both automatically (through calculation of summary statistics) and manually (by examining query strings for semantic clues on search motivations and searching strategy). We conduct a transaction log analysis on user activity in the Computer Science Technical Reports Collection of the New Zealand Digital Library, and report insights gained and identify resulting search interface design issues.

98/14

Measuring ATM traffic: final report for New Zealand Telecom

John Cleary, Ian Graham, Murray Pearson, Tony McGregor

The report describes the development of a low-cost ATM monitoring system, hosted by a standard PC. The monitor can be used remotely returning information on ATM traffic flows to a central site. The monitor is interfaces to a GPS timing receiver, which provides an absolute time accuracy of better than 1 usec. By monitoring the same traffic flow at different points in a network it is possible to measure cell delay and delay variation in real time, and with existing traffic. The monitoring system characterises cells by a CRC calculated over the cell payload, thus special measurement cells are not required. Delays in both local area and wide-area networks have been measured using this system. It is possible to measure delay in a network that is not end-to-end ATM, as long as some cells remain identical at the entry and exit points. Examples are given of traffic and delay measurements in both wide and local area network systems, including delays measured over the Internet from Canada to New Zealand.

98/15

Despite its simplicity, the naïve Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates.<P>

This paper shows how to apply the naïve Bayes methodology to numeric prediction (i.e. regression) tasks, and compares it to linear regression, instance-based learning, and a method that produces “model trees”-decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naïve Bayes is the method of choice, on real-world datasets it is almost uniformly worse than model trees. The comparison with linear regression depends on the error measure: for one measure naïve Bayes performs similarly, for another it is worse. Compared to instance-based learning, it performs similarly with respect to both measures. These results indicate that the simplistic statistical assumption that naïve Bayes makes is indeed more restrictive for regression than for classification.

98/16

Link as you type: using key phrases for automated dynamic link generation

Steve Jones

When documents are collected together from diverse sources they are unlikely to contain useful hypertext links to support browsing amongst them. For large collections of thousands of documents it is prohibitively resource intensive to manually insert links into each document. Users of such collections may wish to relate documents within them to text that they are themselves generating. This process, often involving keyword searching, distracts from the authoring process and results in material related to query terms but not necessarily to the author's document. Query terms that are effective in one collection might not be so in another. We have developed Phrasier, a system that integrates authoring (of text and hyperlinks), browsing, querying and reading in support of information retrieval activities. Phrasier exploits key phrases which are automatically extracted from documents in a collection, and uses them as link anchors and to identify candidate destinations for hyperlinks. This system suggests links into existing collections for purposes of authoring and retrieval of related information, creates links between documents in a collection and provides supportive document and link overviews.

98/17

Melody based tune retrieval over the World Wide Web

David Bainbridge, Rodger J. McNab, Lloyd A. Smith

In this paper we describe the steps taken to develop a Web-based version of an existing stand-alone, single-user digital library application for melodical searching of a collection of music. For the three key components: input, searching, and output, we assess the suitability of various Web-based strategies that deal with the now distributed software architecture and explain the decisions we made. The resulting melody indexing service, known as MELDEX, has been in operation for one year, and the feed-back we have received has been favorable.

98/18

Making oral history accessible over the World Wide Web

David Bainbridge, Sally Jo Cunningham

We describe a multimedia, WWW-based oral history collection constructed from off-the-shelf or publicly available software. The source materials for the collection include audio tapes of interviews and summary transcripts of each interview, as well as photographs illustrating episodes mentioned in the tapes. Sections of the transcripts are manually matched to associated segments of the tapes, and the tapes are digitized. Users search a full-text retrieval system based on the text transcripts to retrieve relevant transcript sections and their associated audio recordings and photographs. It is also possible to search for photos by matching text queries against text descriptions of the photos in the collection, where the located photos link back to their respective interview transcript and audio recordings.

1997

97/1

A dynamic and flexible representation of social relationships in CSCW

Steve Jones, Steve Marsh

CSCW system designers lack effective support in addressing the social issues and interpersonal relationships which are linked with the use of CSCW systems. We present a formal description of trust to support CSCW system designers in considering the social aspects of group work, embedding those considerations in systems and analysing computer supported group processes.

We argue that trust is a critical aspect in group work, and describe what we consider to be the building blocks of trust. We then present a formal notation for the building blocks, their use in reasoning about social interactions and how they are amended over time.

We then consider how the formalism may be used in practice, and present some insights from initial analysis of the behaviour of the formalism. This is followed by a description of possible amendments and extensions to the formalism. We conclude that it is possible to formalise a notion of trust and to model the formalisation by a computational mechanism.

97/2

Design issues for World Wide Web navigation visualisation tools

Andy Cockburn, Steve Jones

The World Wide Web (WWW) is a successful hypermedia information space used by millions of people, yet it suffers from many deficiencies and problems in support for navigation around its vast information space. In this paper we identify the origins of these navigation problems, namely WWW browser design, WWW page design, and WWW page description languages. Regardless of their origins, these problems are eventually represented to the user at the browser's user interface. To help overcome these problems, many tools are being developed which allow users to visualise WWW subspaces. We identify five key issues in the design and functionality of these visualisation systems: characteristics of the visual representation, the scope of the subspace representation, the mechanisms for generating the visualisation, the degree of browser independence, and the navigation support facilities. We provide a critical review of the diverse range of WWW visualisation tools with respect to these issues.

97/3

Stacked generalization: when does it work?

Kai Ming Ting, Ian H. Witten

Stacked generalization is a general method of using a high-level model to combine lower-level models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a 'black art' in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higher-level model, and the kind of attributes that should be used as its input.

We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms, and also for combining models of the same type derived from a single learning algorithm in a multiple-data-batches scenario. We also compare the performance of stacked generalization with published results arcing and bagging.

97/4

Browsing in digital libraries: a phrase-based approach

Craig Nevill-Manning, Ian H. Witten, Gordon W. Paynter

A key question for digital libraries is this: how should one go about becoming familiar with a digital collection, as opposed to a physical one? Digital collections generally present an appearance which is extremely opaque-a screen, typically a Web page, with no indication of what, or how much, lies beyond: whether a carefully-selected collection or a morass of worthless ephemera; whether half a dozen documents or many millions. At least physical collections occupy physical space, present a physical appearance, and exhibit tangible physical organization. When standing on the threshold of a large library one gains a sense of presence and permanence that reflects the care taken in building and maintaining the collection inside. No-one could confuse it with a dung-heap! Yet in the digital world the difference is not so palpable.

97/5

A graphical notation for the design of information visualisations

Matthew C. Humphrey

Visualisations are coherent, graphical expressions of complex information that enhance people's ability to communicate and reason about that information. Yet despite the importance of visualisations in helping people to understand and solve a wide variety of problems, there is a dearth of formal tools and methods for discussing, describing and designing them. Although simple visualisations, such as bar charts and scatterplots, are easily produced by modern interactive software, novel visualisations of multivariate, multirelational data must be expressed in a programming language. The Relational Visualisation Notation is a new, graphical language for designing such highly expressive visualisations that does not use programming constructs. Instead, the notation is based on relational algebra, which is widely used in database query languages, and it is supported by a suite of direct manipulation tools. This article presents the notation and examines the designs of some interesting visualisations.

97/6

Applications of machine learning in information retrieval

Sally Jo Cunningham, James Littin, Ian H. Witten

Information retrieval systems provide access to collections of thousands, or millions, of documents, from which, by providing an appropriate description, users can recover any one. Typically, users iteratively refine the descriptions they provide to satisfy their needs, and retrieval systems can utilize user feedback on selected documents to indicate the accuracy of the description at any stage. The style of description required from the user, and the way it is employed to search the document database, are consequences of the indexing method used for the collection. The index may take different forms, from storing keywords with links to individual documents, to clustering documents under related topics.

97/7

Computer concepts without computers: a first course in computer science

Geoffrey Holmes, Tony C. Smith, William J. Rogers

While some institutions seek to make CS1 curricula more enjoyable by incorporating specialised educational software [1] or by setting more enjoyable programming assignments [2], we have joined the growing number of Computer Science departments that seek to improve the quality of the CS1 experience by focusing student attention away from the computer monitor [3,4]. Sophisticated computing concepts usually reserved for senior level courses are presented in a <I>popular science</I> manner, and given equal time alongside the essential introductory programming material. By exposing students to a broad range of specific computational problems we endeavour to make the introductory course more interesting and enjoyable, and instil in students a sense of vision for areas they might specialise in as computing majors.

97/8

A sight-singing tutor

Lloyd A. Smith, Rodger J. McNab

This paper describes a computer program designed to aid its users in learning to sight-sing. Sight-singing-the ability to sing music from a score without prior study-is an important skill for musicians and holds a central place in most university music curricula. Its importance to vocalists is obvious; it is also an important skill for instrumentalists and conductors because it develops the aural imagination necessary to judge how the music should sound, when played (Benward and Carr 1991). Furthermore, it is an important skill for amateur musicians, who can save a great deal of rehearsal time through an ability to sing music at sight.

97/9

Stacking bagged and dagged models

Kai Ming Ting, I.H. Witten

In this paper, we investigate the method of stacked generalization in combining models derived from different subsets of a training dataset by a single learning algorithm, as well as different algorithms. The simplest way to combine predictions from competing models is majority vote, and the effect of the sampling regime used to generate training subsets has already been studied in this context-when bootstrap samples are used the method is called bagging, and for disjoint samples we call it dagging. This paper extends these studies to stacked generalization, where a learning algorithm is employed to combine the models. This yields new methods dubbed bag-stacking and dag-stacking.

We demonstrate that bag-stacking and dag-stacking can be effective for classification tasks even when the training samples cover just a small fraction of the full dataset. In contrast to earlier bagging results, we show that bagging and bag-stacking work for stable as well as unstable learning algorithms, as do dagging and dag-stacking. We find that bag-stacking (dag-stacking) almost always has higher predictive accuracy than bagging (dagging), and we also show that bag-stacking models derived using two different algorithms is more effective than bagging.

97/10

Extracting text from Postscript

Craig Nevill-Manning, Todd Reed, Ian H. Witten

We show how to extract plain text from PostScript files. A textual scan is inadequate because PostScript interpreters can generate characters on the page that do not appear in the source file. Furthermore, word and line breaks are implicit in the graphical rendition, and must be inferred from the positioning of word fragments. We present a robust technique for extracting text and recognizing words and paragraphs. The method uses a standard PostScript interpreter but redefines several PostScript operators, and simple heuristics are employed to locate word and line breaks. The scheme has been used to create a full-text index, and plain-text versions, of 40,000 technical reports (34 Gbyte of PostScript). Other text-extraction systems are reviewed: none offer the same combination of robustness and simplicity.

97/11

Gathering and indexing rich fragments of the World Wide Web

Geoffrey Holmes, William J Rogers

While the World Wide Web (WWW) is an attractive option as a resource for teaching and research it does have some undesirable features. The cost of allowing students unlimited access can be high-both in money and time; students may become addicted to 'surfing' the web-exploring purely for entertainment-and jeopardise their studies. Students are likely to discover undesirable material because large scale search engines index sites regardless of their merit. Finally, the explosive growth of WWW usage means that servers and networks are often overloaded, to the extent that a student may gain a very negative view of the technology.

We have developed a piece of software which attempts to address these issues by capturing rich fragments of the WWW onto local storage media. It is possible to put a collection onto CD ROM, providing portability and inexpensive storage. This enables the presentation of the WWW to distance learning students, who do not have internet access. The software interfaces to standard, commonly available web browsers, acting as a proxy server to the files stored on the local media, and provides a search engine giving full text searching capability within the collection.

97/12

Using model trees for classification

Eibe Frank, Yong Wang, Stuart Inglis, Geoffrey Holmes, Ian H. Witten

Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducer M5', based on Quinlan's M5, generates more accurate classifiers than the state-of-the-art decision tree learner C5.0, particularly when most of the attributes are numeric.

97/13

Discovering inter-attribute relationships

Geoffrey Holmes

It is important to discover relationships between attributes being used to predict a class attribute in supervised learning situations for two reasons. First, any such relationship will be potentially interesting to the provider of a dataset in its own right. Second, it would simplify a learning algorithm's search space, and the related irrelevant feature and subset selection problem, if the relationships were removed from datasets ahead of learning. An algorithm to discover such relationships is presented in this paper. The algorithm is described and a surprising number of inter-attribute relationships are discovered in datasets from the University of California at Irvine (UCI) repository.

97/14

Learning from batched data: model combination vs data combination

Kai Ming Ting, Boon Toh Low, Ian H. Witten

When presented with multiple batches of data, one can either combine them into a single batch before applying a machine learning procedure or learn from each batch independently and combine the resulting models. The former procedure, data combination, is straightforward; this paper investigates the latter, model combination. Given an appropriate combination method, one might expect model combination to prove superior when the data in each batch was obtained under somewhat different conditions or when different learning algorithms were used on the batches. Empirical results show that model combination often outperforms data combination even when the batches are drawn randomly from a single source of data and the same learning method is used on each. Moreover, this is not just an artifact of one particular method of combining models: it occurs with several different combination methods.

We relate this phenomenon to the learning curve of the classifiers being used. Early in the learning process when the learning curve is steep there is much to gain from data combination, but later when it becomes shallow there is less to gain and model combination achieves a greater reduction in variance and hence a lower error rate.

The practical implication of these results is that one should consider using model combination rather than data combination, especially when multiple batches of data for the same task are readily available. It is often superior even when the batches are drawn randomly from a single sample, and we expect its advantage to increase if genuine statistical differences between the batches exist.

In the interest of digital libraries, it is advisable that designers be aware of the potential behaviour of the users of such a system. There are two distinct parts under investigation, the interaction between traditional libraries involving the seeking and retrieval of relevant material, and the reading and storage behaviours ensuing. Through this analysis, the findings could be incorporated into digital library facilities. There has been copious amounts of research on information seeking leading to the development of behavioural models to describe the process. Often research on the information seeking practices of individuals is based on the task and field of study. The information seeking model, presented by Ellis et al. (1993), characterises the format of this study where it is used to compare various research on the information seeking practices of groups of people (from academics to professionals). It is found that, although researchers do make use of library facilities, they tend to rely heavily on their own collections and primarily use the library as a source for previously identified information, browsing and interloan. It was found that there are significant differences in user behaviour between the groups analysed. When looking at the reading and storage of material it was hard to draw conclusions, due to the lack of substantial research and information on the topic. However, through the use of reading strategies, a general idea on how readers behave can be developed. Designers of digital libraries can benefit from the guidelines presented here to better understand their audience.

97/16

Proceeding of the INTERACT97 Combined Workshop on CSCW in HCI-Worldwide

Matthias Rauterberg, Lars Oestreicher, John Grundy

This is the proceedings for the INTERACT97 combined workshop on “CSCW in HCI-worldwide”. The position papers in this proceedings are those selected from topics relating to HCI community development worldwide and to CSCW issues. Originally these were to be two separate INTERACT workshops, but were combined to ensure sufficient participation for a combined workshop to run.

The combined workshop has been split into two separate sessions to run in the morning of July 15th, Sydney, Australia. One to discuss the issues relating to the position papers focusing on general CSCW systems, the other to the development of HCI communities in a worldwide context. The CSCW session uses as a case study a proposed groupware tool for facilitating the development of an HCI database with a worldwide geographical distribution. The HCI community session focuses on developing the content for such a database, in order for it to foster the continued development of HCI communities. The afternoon session of the combined workshop involves a joint discussion of the case study groupware tool, in terms of its content and likely groupware facilities.

The position papers have been grouped into those focusing on HCI communities and hence content issues for a groupware database, and those focusing on CSCW and groupware issues, and hence likely groupware support in the proposed HCI database/collaboration tools. We hope that you find the position papers in this proceedings offer a wide range of interesting reports of HCI community development worldwide, leading CSCW system research, and that a groupware tool supporting aspects of a worldwide HCI database can draw upon the varied work reported.

97/17

Internationalising a spreadsheet for Pacific Basin languages

Robert Barbour, Alvin Yeo

As people trade and engage in commerce, an economically dominant culture tends to migrate language into other recently contacted cultures. Information technology (IT) can accelerate enculturation and promote the expansion of western hegemony in IT. Equally, IT can present a culturally appropriate interface to the user that promotes the preservation of culture and language with very little additional effort. In this paper a spreadsheet is internationalised to accept languages from the Latin-1 character set such as English, Maori and Bahasa Melayu (Malaysia's national language). A technique that allows a non-programmer to add a new language to the spreadsheet is described. The technique could also be used to internationalise other software at the point of design by following the steps we outline.

97/18

Localising a spreadsheet: an Iban example

Alvin Yeo, Robert Barbour

Presently, there is little localisation of software to smaller cultures if it is not economically viable. We believe software should also be localised to the languages of small cultures in order to sustain and preserve these small cultures. As an example, we localised a spreadsheet from English to Iban. The process in which we carried out the localisation can be used as a framework for the localisation of software to languages of small ethnic minorities. Some problems faced during the localisation process are also discussed.

97/19

Strategies of internationalisation and localisation: a postmodernist/s perspective

Alvin Yeo, Robert Barbour

Many software companies today are developing software not only for local consumption but for the rest of the world. We introduce the concepts of internationalisation and localisation and discuss some techniques using these processes. An examination of postmodern critique with respect to the software industry is also reported. In addition, we also feature our proposed internationalisation technique that was inspired by taking into account the researches of postmodern philosophers and mathematicians. As illustrated in our prototype, the technique empowers non-programmers to localise their own software. Further development of the technique and its implications on user interfaces and the future of software internationalisation and localisation are discussed.

97/20

Language use in software

Alvin Yeo, Robert Barbour

Many of the popular software we use today are in English. Very few software applications are available in minority languages. Besides economic goals, we justify why software should be made available to smaller cultures. Furthermore, there is evidence that people learn and progress faster in software in their mother tongue (Griffiths et at, 1994) (Krock, 1996). We hypothesise that experienced users of English spreadsheet can easily migrate to a spreadsheet in their native tongue i.e. Bahasa Melayu (Malaysia's national language). Observations made in the study suggest that the native speakers of Bahasa Melayu had difficulties with the Bahasa Melayu interface. The subjects' main difficulty was their unfamiliarity with computing terminology in Bahasa Melayu. We present possible strategies to increase the use of Bahasa Melayu in IT. These strategies may also be used to promote the use of other minority languages in IT.

97/21

Usability testing: a Malaysian study

Alvin Yeo, Robert Barbour, Mark Apperley

An exploratory study of software assessment techniques is conducted in Malaysia. Subjects in the study comprised staff members of a Malaysian university with a high Information Technology (IT) presence. The subjects assessed a spreadsheet tool with a Bahasa Melayu (Malaysia's national language) interface. Software evaluation techniques used include the think aloud method, interviews and the System Usability Scale. The responses in the various techniques used are reported and initial results indicate idiosyncratic behaviour of Malaysian subjects. The implications of the findings are also discussed.

97/22

Inducing cost-sensitive trees via instance-weighting

Kai Ming Ting

We introduce an instance-weighting method to induce cost-sensitive trees in this paper. It is a generalization of the standard tree induction process where only the initial instance weights determine the type of tree (i.e., minimum error trees or minimum cost trees) to be induced. We demonstrate that it can be easily adopted to an existing tree learning algorithm.

Previous research gave insufficient evidence to support the fact that the greedy divide-and-conquer algorithm can effectively induce a truly cost-sensitive tree directly from the training data. We provide this empirical evidence in this paper. The algorithm employing the instance-weighting method is found to be comparable to or better than both C4.5 and C5 in terms of total misclassification costs, tree size and the number of high cost errors. The instance-weighting method is also simpler and more effective in implementation than a method based on altered priors.

97/23

Fast convergence with a greedy tag-phrase dictionary

Ross Peeters, Tony C. Smith

The best general-purpose compression schemes make their gains by estimating a probability distribution over all possible next symbols given the context established by some number of previous symbols. Such context models typically obtain good compression results for plain text by taking advantage of regularities in character sequences. Frequent words and syllables can be incorporated into the model quickly and thereafter used for reasonably accurate prediction. However, the precise context in which frequent patterns emerge is often extremely varied, and each new word or phrase immediately introduces new contexts which can adversely affect the compression rate

A great deal of the structural regularity in a natural language is given rather more by properties of its grammar than by the orthographic transcription of its phonology. This implies that access to a grammatical abstraction might lead to good compression. While grammatical models have been used successfully for compressing computer programs [4], grammar-based compression of plain text has received little attention, primarily because of the difficulties associated with constructing a suitable natural language grammar. But even without a precise formulation of the syntax of a language, there is a linguistic abstraction which is easily accessed and which demonstrates a high degree of regularity which can be exploited for compression purposes-namely, lexical categories.

97/24

Tag based models of English text

W. J. Teahan, John G. Cleary

The problem of compressing English text is important both because of the ubiquity of English as a target for compression and because of the light that compression can shed on the structure of English. English text is examined in conjunction with additional information about the parts of speech of each word in the text (these are referred to as “tags”). It is shown that the tags plus the text can be compressed more than the text alone. Essentially the tags can be compressed for nothing or even a small net saving in size. A comparison is made of a number of different ways of integrating compression of tags and text using an escape mechanism similar to PPM. These are also compared with standard word based and character based compression programs. The result is that the tag character and word based schemes always outperform the character based schemes. Overall, the tag based schemes outperform the word based schemes. We conclude by conjecturing that tags chosen for compression rather than linguistic purposes would perform even better.

97/25

Musical image compression

David Bainbridge, Stuart Inglis

Optical music recognition aims to convert the vast repositories of sheet music in the world into an on-line digital format [Bai97]. In the near future it will be possible to assimilate music into digital libraries and users will be able to perform searches based on a sung melody in addition to typical text-based searching [MSW+96]. An important requirement for such a system is the ability to reproduce the original score as accurately as possible. Due to the huge amount of sheet music available, the efficient storage of musical images is an important topic of study.

This paper investigates whether the “knowledge” extracted from the optical music recognition (OMR) process can be exploited to gain higher compression than the JBIG international standard for bi-level image compression. We present a hybrid approach where the primitive shapes of music extracted by the optical music recognition process-note heads, note stems, staff lines and so forth-are fed into a graphical symbol based compression scheme originally designed for images containing mainly printed text. Using this hybrid approach the average compression rate for a single page is improved by 3.5% over JBIG. When multiple pages with similar typography are processed in sequence, the file size is decreased by 4-8%.

Section 2 presents the relevant background to both optical music recognition and textual image compression. Section 3 describes the experiments performed on 66 test images, outlining the combinations of parameters that were examined to give the best results. The initial results and refinements are presented in Section 4, and we conclude in the last section by summarizing the findings of this work.

97/26

Correcting English text using PPM models

W. J. Teahan, S. Inglis, J. G. Cleary, G. Holmes

An essential component of many applications in natural language processing is a language modeler able to correct errors in the text being processed. For optical character recognition (OCR), poor scanning quality or extraneous pixels in the image may cause one or more characters to be mis-recognized; while for spelling correction, two characters may be transposed, or a character may be inadvertently inserted or missed out.

This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a significant improvement over previously used methods. A similar technique is also applied as a post-processing stage after pages have been recognized by a state-of-the-art commercial OCR system. We show that the accuracy of the OCR system can be increased from 95.9% to 96.6%, a decrease of about 10 errors per page.

The problem of extracting Instruction Level Parallelism at levels of 10 instructions per clock and higher is considered. Two different architectures which use speculation on memory accesses to achieve this level of performance are reviewed. It is pointed out that while this form of speculation gives high potential parallelism it is necessary to retain execution state so that incorrect speculation can be detected and subsequently squashed. Simulation results show that the space to store such state is a critical resource in obtaining good speedup. To make good use of the space it is essential that state be stored efficiently and that it be retired as soon as possible. A number of techniques for extracting the best usage from the available state storage are introduced.

97/28

Effects of re-ordered memory operations on parallelism

Richard H. Littin, John G. Cleary

The performance effect of permitting different memory operations to be re-ordered is examined. The available parallelism is computed using a machine code simulator. A range of possible restrictions on the re-ordering of memory operations is considered: from the purely sequential case where no re-ordering is permitted; to the completely permissive one where memory operations may occur in any order so that the parallelism is restricted only by data dependencies. A general conclusion is drawn that to reliably obtain parallelism beyond 10 instructions per clock will require an ability to re-order all memory instructions. A brief description of a feasible architecture capable of this is given.

The idea for a specific industry session at OZCHI was first mooted at the 1995 conference in Wollongong, during questions following a session of short papers which happened (serendipitously) to be presented by people from industry. An animated discussion took place, most of which was about how OZCHI could be made more relevant to people in industry, be it working as usability consultants, or working within organisations either as usability professionals or as `champions of the cause'. The discussion raised more questions than answers, about the format of such as session, about the challenges of attracting industry participation, and about the best way of publishing the results. Although no real solutions were arrived at, it was enough to place an industry session on the agenda for OZCHI'96.

97/30

Adaptive models of English text

W. J. Teahan, John G. Cleary

High quality models of English text with performance approaching that of humans is important for many applications including spelling correction, speech recognition, OCR, and encryption. A number of different statistical models of English are compared with each other and with previous estimates from human subjects. It is concluded that the best current models are word based with part of speech tags. Given sufficient training text, they are able to attain performance comparable to humans.

97/31

A graphical user interface for Boolean query specification

Steve Jones, Shona McInnes

On-line information repositories commonly provide keyword search facilities via textual query languages based on Boolean logic. However, there is evidence to suggest that the syntactical demands of such languages can lead to user errors and adversely affect the time that it takes users to form queries. Users also face difficulties because of the conflict in semantics between AND and OR when used in Boolean logic and English language. We suggest that graphical query languages, in particular Venn-like diagrams, can alleviate the problems that users experience when forming Boolean expressions with textual languages. We describe Vquery, a Venn-diagram based user interface to the New Zealand Digital Library (NZDL). The design of Vquery has been partly motivated by analysis of NZDL usage. We found that few queries contain more than three terms, use of the intersection operator dominates and that query refinement is common. A study of the utility of Venn diagrams for query specification indicates that with little or no training users can interpret and form Venn-like diagrams which accurately correspond to Boolean expressions. The utility of Vquery is considered and directions for future work are proposed.