The aim of this thesis is to find new approaches to Sign Language Recognition (SLR) which are suited to working with the limited corpora currently available. Data available for SLR is of limited quality; low resolution and frame rates make the task of recognition even more complex. The content is rarely natural, concentrating on isolated signs and filmed under laboratory conditions. In addition, the amount of accurately labelled data is minimal. To this end, several contributions are made: Tracking the hands is eschewed in favour of detection based techniques more robust to noise; for both signs and for linguistically-motivated sign sub-units are investigated, to make best use of limited data sets. Finally, an algorithm is proposed to learn signs from the inset signers on TV, with the aid of the accompanying subtitles, thus increasing the corpus of data available. Tracking fast moving hands under laboratory conditions is a complex task, move this to real world data and the challenge is even greater. When using tracked data as a base for SLR, the errors in the tracking are compounded at the classification stage. Proposed instead, is a novel sign detection method, which views space-time as a 3D volume and the sign within it as an object to be located. Features are combined into strong classfifiers using a novel boosting implementation designed to create optimal classifiers over sparse datasets. Using boosted volumetric features, on a robust frame differenced input, average classification rates reach 71\% on seen signers and 66\% on a mixture of seen and unseen signers, with individual sign classification rates gaining 95\%. Using a classifier per sign approach to SLR, means that data sets need to contain numerous examples of the signs to be learnt. Instead, this thesis proposes learnt classifiers to detect the common sub-units of sign. The responses of these classifiers can then be combined for recognition at the sign level. This approach requires fewer examples per sign to be learnt, since the sub-unit detectors are trained on data from multiple signs. It is also faster at detection time since there are fewer classifiers to consult, the number of these being limited by the linguistics of sign and not the number of signs being detected. For this method, appearance based boosted classifiers are introduced to distinguish the sub-units of sign. Results show that when combined with temporal models, these novel sub-unit classifiers, can outperform similar classifiers learnt on tracked results. As an added side effect; since the sub-units are linguistically derived they can be used independently to help linguistic annotators. Since sign language data sets are costly to collect and annotate, there are not many publicly available. Those which are, tend to be constrained in content and often taken under laboratory conditions. However, in the UK, the British Broadcasting Corporation (BBC) regularly produces programs with an inset signer and corresponding subtitles. This provides a natural signer, covering a wide range of topics, in real world conditions. While it has no ground truth, it is proposed that the translated subtitles can provide weak labels for learning signs. The final contributions of this thesis, lead to an innovative approach to learn signs from these co-occurring streams of data. Using a unique, temporally constrained, version of the Apriori mining algorithm, similar sections of video are identified as possible sign locations. These estimates are improved upon by introducing the concept of contextual negatives, removing contextually similar noise. Combined with an iterative honing process, to enhance the localisation of the target sign, 23 word/sign combinations are learnt from a 30 minute news broadcast, providing a novel method for automatic data set creation

@PHDTHESIS{Cooper_Sign_2010b,
author = {Helen Cooper},
title = {Sign Language Recognition : Generalising to More Complex Corpora},
school = {Centre For Vision Speech and Signal Processing, University Of Surrey},
year = {2010},
abstract = {The aim of this thesis is to find new approaches to Sign Language
Recognition (SLR) which are suited to working with the limited corpora
currently available. Data available for SLR is of limited quality;
low resolution and frame rates make the task of recognition even
more complex. The content is rarely natural, concentrating on isolated
signs and filmed under laboratory conditions. In addition, the amount
of accurately labelled data is minimal. To this end, several contributions
are made: Tracking the hands is eschewed in favour of detection based
techniques more robust to noise; for both signs and for linguistically-motivated
sign sub-units are investigated, to make best use of limited data
sets. Finally, an algorithm is proposed to learn signs from the inset
signers on TV, with the aid of the accompanying subtitles, thus increasing
the corpus of data available. Tracking fast moving hands under laboratory
conditions is a complex task, move this to real world data and the
challenge is even greater. When using tracked data as a base for
SLR, the errors in the tracking are compounded at the classification
stage. Proposed instead, is a novel sign detection method, which
views space-time as a 3D volume and the sign within it as an object
to be located. Features are combined into strong classfifiers using
a novel boosting implementation designed to create optimal classifiers
over sparse datasets. Using boosted volumetric features, on a robust
frame differenced input, average classification rates reach 71\%
on seen signers and 66\% on a mixture of seen and unseen signers,
with individual sign classification rates gaining 95\%. Using a classifier
per sign approach to SLR, means that data sets need to contain numerous
examples of the signs to be learnt. Instead, this thesis proposes
learnt classifiers to detect the common sub-units of sign. The responses
of these classifiers can then be combined for recognition at the
sign level. This approach requires fewer examples per sign to be
learnt, since the sub-unit detectors are trained on data from multiple
signs. It is also faster at detection time since there are fewer
classifiers to consult, the number of these being limited by the
linguistics of sign and not the number of signs being detected. For
this method, appearance based boosted classifiers are introduced
to distinguish the sub-units of sign. Results show that when combined
with temporal models, these novel sub-unit classifiers, can outperform
similar classifiers learnt on tracked results. As an added side effect;
since the sub-units are linguistically derived they can be used independently
to help linguistic annotators. Since sign language data sets are
costly to collect and annotate, there are not many publicly available.
Those which are, tend to be constrained in content and often taken
under laboratory conditions. However, in the UK, the British Broadcasting
Corporation (BBC) regularly produces programs with an inset signer
and corresponding subtitles. This provides a natural signer, covering
a wide range of topics, in real world conditions. While it has no
ground truth, it is proposed that the translated subtitles can provide
weak labels for learning signs. The final contributions of this thesis,
lead to an innovative approach to learn signs from these co-occurring
streams of data. Using a unique, temporally constrained, version
of the Apriori mining algorithm, similar sections of video are identified
as possible sign locations. These estimates are improved upon by
introducing the concept of contextual negatives, removing contextually
similar noise. Combined with an iterative honing process, to enhance
the localisation of the target sign, 23 word/sign combinations are
learnt from a 30 minute news broadcast, providing a novel method
for automatic data set creation},
url = {http://personal.ee.surrey.ac.uk/Personal/H.Cooper/research/papers/SLR_GeneralisingtoMoreComplexCorpora.pdf}
}