Abstract

For several decades researchers around the globe have been avidly investigating practical solutions to the enduring problem of understanding visual content within an image. One might think of the quest as an effort to emulate human visual system. Despite all the endeavours, the simplest of visual tasks to us humans, such as optical segmentation of objects, remain a significant challenge for machines. In a few occasions where a computer's processing power is adequate to accomplish the task, the issue of public alienation towards autonomous solutions to critical applications remains unresolved.

The principal purpose of this thesis is to propose innovative computer vision, machine learning, and pattern recognition techniques that exploit abstract knowledge of human beings in practical models using facile yet effective methodologies. High-level information provided by users in the decision making loop of such interactive systems enhances the efficacy of vision algorithms, whilst simultaneously machines reduce users' labour by filtering results and completing mundane tasks on their behalf.

In this thesis, we initially draw a vivid picture of interactive approaches to vision tasks prior to scrutinising relevant aspects of human in the loop methodologies and highlighting their current shortcomings in object recognition applications. Our survey of literature unveils that the difficulty in harnessing users' abstract knowledge is amongst major complications of human in the loop algorithms. We therefore propose two novel methodologies to capture and model such high-level sources of information. One solution builds innovative textual descriptors that are compatible with discriminative classifiers. The other is based on the random naive Bayes algorithm and is suitable for generative classification frameworks.

We further investigate the infamous problem of fusing images' low-level and users' high-level information sources. Our next contribution is therefore a novel random forest based human in the loop framework that efficiently fuses visual features of images with user provided information for fast predictions and a superior classification performance. User abstract knowledge in this method is harnessed in shape of user's answers to perceptual questions about images. In contrast to generative Bayesian frameworks, this is a direct discriminative approach that enables information source fusion in the preliminary stages of the prediction process.

We subsequently reveal inventive generative frameworks that model each source of information individually and determine the most effective for the purpose of class label prediction. We propose two innovative and intelligent human in the loop fusion algorithms. Our first algorithm is a modified naive Bayes greedy technique, while our second solution is based on a feedforward neural network. Through experiments on a variety of datasets, we show that our novel intelligent fusion methods of information source selection outperform their competitors in tasks of fine-grained visual categorisation.

We additionally present methodologies to reduce unnecessary human involvement in mundane tasks by only focusing on cases where their invaluable abstract knowledge is of utter importance. Our proposed algorithm is based on information theory and recent image annotation techniques. It determines the most efficient sequence of information to obtain from humans involved in the decision making loop, in order to minimise their unnecessary engagement in routine tasks. This approach allows them to be concerned with more abstract functions instead. Our experimental results reveal faster achievement of peak performance in contrast to alternative random ranking systems.

Our final major contribution in this thesis is a novel remedy for the curse of dimensionality in pattern recognition problems. It is theoretically based on mutual information and Fano's inequality. Our approach separates the most discriminative descriptors and has the capability to enhance the accuracy of classification algorithms. The process of selecting a subset of relevant features is vital for designing robust human in the loop vision models. Our selection techniques eliminate redundant and irrelevant visual and textual features, and therefore its influence on improvement of various human in the loop algorithms proves to be fundamental in our experiments.