Attempts to overcome limitations in the attribute-value representation for
machine learning have led to much interest in learning from structured
data, concentrated in the research areas of inductive logic programming (ILP)
and multi-relational data mining (MDRM).
The expressiveness and encapsulation of the object-oriented
data model has led to its widespread adoption in software and
database design. The considerable congruence between this model and
individual-centred models in inductive logic programming presents new
opportunities for mining object data specific to its domain.

The thesis investigates the use of object-orientation in knowledge
representation for multi-relational data mining. The thesis proposes a
language for expressing object model metaknowledge and uses it
to extend the reasoning mechanisms of an object-oriented logic. A
refinement operator is then defined and used for feature search in a
object-oriented propositionalisation-based ILP classifier. An algorithm
is proposed for reducing the large number of redundant features typical in
propositionalisation. A data mining system based on the refinement operator
is implemented and demonstrated on a real-world
computational linguistics task and compared with a conventional ILP
system.

Contribution

The core contribution of the thesis is the CoSinus algorithm and
its underlying CorLog logical framework. CoSinus is both an
algorithm and data mining system for inductive logic programming under
the object model, in particular a feature search for propositionalisation
using an object-specific refinement operator.
CorLog is an extension of an existing object-oriented logical framework
which extends the logic programming basis of ILP with a new set of
features to guide induction, and define a strong bias, leading to a
smaller hypothesis space but still retaining the expressiveness of the object
logic. The object model gives rise to a much more sophisticated form of
type-safety and type-correctness,
involving conjunctions of simple, parametric and abstract classes
and constraints on co-existing classes, as well as mechanisms to abstract
existing structure in terms of these classes. Relations may take ordered
arguments, extending the basis of value restriction in search from classes
to constraints involving constants, such as simple integer inequalities.
The framework
gives rise to redefinition of valid substitution under the new data model,
informing a moded, classed refinement operator which has desirable
optimality properties, implemented in a three-tiered (feature, rule, theory)
search taking advantage of the natural data encapsulation of the object model.
Further aspects of the object model are exploited to arrive at search bounds
derived from the object model.
While there are other algorithms which perform inductive logic programming,
and specifically those which propose object-like data mining in
class hierarchies, no system known to the author at the time of writing
is as comprehensive in its coverage of the object-oriented data model
or its use of this data model in constraining induction as closely,
and it is in this aspect that much of the innovation lies.
Such an approach to induction is immediately useful not only for domains
with a strong notion of individual but also where objects naturally
belong in a class taxonomy — particularly ones which are defined at
multiple levels of granularity — or which adopt complex datatypes modelled
independent of their constituents.
The thesis argues and demonstrates that for the de-facto
forms of domain description in use in ILP today, such data is inadequately
modelled but may be exploited for induction.

The second contribution is the Refer algorithm, which overcomes the
often prohibitively high dimensionality of the feature sets produced by
propositionalisation.
This is of clear interest to the thesis, since the
core inductive logic programming technique of
CoSinus
is propositionalisation. Refer is a post-processing feature reduction
stage for feature sets containing Boolean examples
labelled by one of any number of possible class labels.
Refer is therefore
general to the large class of
learners producing such data and uses the notion
of logical coverage to efficiently
determine the logical redundancy of a feature for
building a classification rule. Similar feature reduction algorithms have
been proposed. However, the innovation here lies in the
partitioning of examples into a set of disjoint subsets whose examples are
mutually close in terms of Hamming distance. Comparisons between each of
these subsets lead to a greater reduction than comparing the example set as a
whole. Its usefulness to the topic of the thesis is that it
supplements the redundancy-reducing mechanisms of CoSinus, which use
a metaknowledge-based definition of redundancy, and further filters
logically redundant hypotheses before presentation to the learner.
A number of variants are presented and analysed,
and improvements in predictive performance,
quality of induced theories and reductions in induction time are demonstrated.

The third main contribution is the application of machine learning to
a learning task from the field of computational linguistics.
A detailed linguistic corpus is
preprocessed with a sophisticated, domain-specific pre-processor
to produce a highly-structured
representation of thousands of English-language sentences, including their
parse trees (the principal compositional structure), grammatical categories
(the principal inheritance structure),
parts of speech, etc. A complex domain library is defined which
allows querying of these sentences in a highly object-oriented manner
encompassing the modelling constructs of CorLog and
CoSinus as well
as augmenting it with additional, derived structure via additional rules.
Two inductive logic programming systems, one which uses the object model
and one which does not,
are used to solve the problem of labelling the nodes within
a sentence with its grammatical function.
In particular, the thesis studies
the labelling of a node as the subject of a sentence,
or otherwise.
Similar problems have been
approached many times in computational linguistics,
but typically by statistical methods, rather than the symbolic,
structural approach taken by ILP.
Theories expressed
in symbolic logic are of interest to linguists, since they
directly describe linguistic rules for these important lingustic features.
More generally, models for labelling grammatical functions of words
are immediately useful for document analysis and retrieval applications.
The work tests the thesis' claims in a practical
setting by establishing a mapping between the framework of the
object model and the conventional logic programming basis and inductive bias
of existing ILP learners.
Results from applying the learners under various parameter settings and
combinations of available background knowledge are compared, showing that
by using the object-oriented data model for representation and induction,
a higher-quality set of hypotheses is search and more comprehensible models
with better predictive accuracy are produced compared to an existing,
established ILP learner.
Furthermore, it is shown that object-oriented data mining is a viable approach
for real-world problems in
computational linguistics, and in particular the labelling of
grammatical functions.