Abstract:

A method and system are provided of spoken document retrieval using
multiple search transcription indices. The method includes receiving a
query input formed of one or more query terms and determining a type of a
query term, wherein a type includes a term in a speech recognition
vocabulary or a term not in a speech recognition vocabulary. One or more
indices of search transcriptions are selected for searching the query
term based on the type of the query term. The one or more indices are
generated using different speech transcription methods. The results for
the query term are scored by the one or more indices and the results of
the one or more indices for the query term are merged. The results of the
one or more query terms are then merged to provide the results for the
query.

Claims:

1. A method of spoken document retrieval using multiple search
transcription indices, the method comprising:receiving a query input
formed of one or more query terms;for each query term determining a type
of the query term, wherein a type includes a term in a speech recognition
vocabulary or a term not in a speech recognition vocabulary;selecting one
or more indices of search transcriptions for searching the query term
based on the type of the query term;scoring the results from the one or
more indices; andmerging the results of the one or more indices for the
query term.

2. The method as claimed in claim 1, including merging the results of the
one or more query terms to provide results for the query.

3. The method as claimed in claim 1, wherein the one or more indices are
generated using different speech transcription methods.

4. The method as claimed in claim 1, wherein the one or more indices are
taken from the group of: word indices, sub-word word-fragment indices,
sub-word phonetic indices, sub-word syllable indices, word and sub-word
indices using timestamps, and word and sub-word indices using offsets.

5. The method as claimed in claim 1, wherein merging of the results of the
one or more indices for a query term uses a Threshold Algorithm to
simultaneously scan results of different indices to aggregate retrieved
document scores.

6. The method as claimed in claim 5, wherein the Threshold Algorithm for
merging the results of the one or more indices for a query term using
AND/OR semantics between the indices of this single term

7. The method as claimed in claim 5, wherein merging of the results of the
one or more indices for a query term includes a weighted sum with weights
allocated for different indices.

8. The method as claimed in claim 2, wherein merging of the results of the
one or more query terms to provide results for a query uses a Threshold
Algorithm to simultaneously scan results of different query terms to
aggregate retrieved document scores.

9. The method as claimed in claim 8, wherein the Threshold Algorithm for
merging the results of the one or more query terms for a query uses query
semantics and phrase search according to the semantics in the query
between query terms.

10. The method as claimed in claim 1, wherein scoring the results from the
one or more indices includes one of the group of: scoring results from a
word index using weighted term frequency; scoring results for a sub-word
index using computed confusion cost of sub-word in a query term; Boolean
model scoring; vector space scoring; or edit distance scoring.

11. The method as claimed in claim 1, wherein for a query term formed of a
term in the speech recognition vocabulary, one or more word indices and
one or more sub-word indices are selected.

12. The method as claimed in claim 1, wherein for a query term formed of a
term not in the speech recognition vocabulary, one or more sub-word
indices are selected.

13. The method as claimed in claim 1, wherein for a query formed of at
least one term in the speech recognition vocabulary and at least one term
not in the speech recognition vocabulary, merging the results of query
terms includes combining word and sub-word indices where the indices are
combined using timestamps or offsets of the occurrence of the word or
sub-word in a transcript.

14. A computer software product for spoken document retrieval using
multiple search transcription indices, the product comprising a
computer-readable storage medium, storing a computer in which program
comprising computer-executable instructions are stored, which
instructions, when read executed by a computer, perform the following
steps:receiving a query input formed of one or more query terms;for each
query term determining a type of the query term, wherein a type includes
a term in a speech recognition vocabulary or a term not in a speech
recognition vocabulary;selecting one or more indices of search
transcriptions for searching the query term based on the type of the
query term;scoring the results from the one or more indices; andmerging
the results of the one or more indices for the query term.

15. A method of providing a service to a customer over a network for
spoken document retrieval, the service comprising:receiving a query input
formed of one or more query terms;for each query term determining a type
of the query term, wherein a type includes a term in a speech recognition
vocabulary or a term not in a speech recognition vocabulary;selecting one
or more indices of search transcriptions for searching the query term
based on the type of the query term;scoring the results from the one or
more indices; andmerging the results of the one or more indices for the
query term.

16. A search system for spoken document retrieval using multiple search
transcription indices, the method comprising:a processor;a query input
means, wherein a query is formed of one or more query terms;means for
determining a type of the query term by reference to a speech recognition
vocabulary, wherein a type includes a term in a speech recognition
vocabulary or a term not in a speech recognition vocabulary;means for
selecting one or more indices of search transcriptions for searching the
query term based on the type of the query term;means for scoring the
results from the one or more indices; andmeans for merging the results of
the one or more indices for the query term.

17. The search system as claimed in claim 16, including means for merging
the results of the one or more query terms to provide results for the
query.

18. The search system as claimed in claim 16, wherein the one or more
indices are taken from the group of: word indices, sub-word word-fragment
indices, sub-word phonetic indices, sub-word syllable indices, word and
sub-word indices using timestamps, and word and sub-word indices using
offsets.

19. The search system as claimed in claim 16, wherein the means for
merging of the results of the one or more indices for a query term uses a
Threshold Algorithm to simultaneously scan results of different indices
to aggregate retrieved document scores.

20. The search system as claimed in claim 17, wherein the means for
merging of the results of the one or more query terms to provide results
for a query uses a Threshold Algorithm to simultaneously scan results of
different query terms to aggregate retrieved document scores.

Description:

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001]The present application is related to the following application with
a common assignee, U.S. patent application Ser. No. 11/781,285 (Attorney
Docket No. IL9-2007-0042US1) filed Jul. 23, 2007, titled "Method and
System for Indexing Speech Data".

FIELD OF THE INVENTION

[0002]This invention relates to the field of spoken document retrieval
using a search query. In particular, the invention relates to using
multiple speech transcription indices in spoken document retrieval.

BACKGROUND OF THE INVENTION

[0003]The rapidly increasing amount of spoken data calls for solutions to
index and search this data. The classical approach consists of converting
the speech to word transcripts using large vocabulary continuous speech
recognition (LVCSR) tools. In the past decade, most of the research
efforts on spoken data retrieval have focused on extending classical
information retrieval (IR) techniques to word transcripts.

[0004]However, a significant drawback of such approaches is that search on
queries containing out-of-vocabulary (OOV) terms will not return any
results. OOV terms are words missing in the automatic speech recognition
(ASR) system vocabulary. Those words are replaced in the output
transcript by alternatives that are probable, given the recognition
acoustic model and the language model. It has been experimentally
observed that over 10% of user queries can contain OOV terms, as queries
often relate to named entities that typically have a poor coverage in the
ASR vocabulary.

[0005]In many applications, the OOV rate may get worse over time unless
the recognizer's vocabulary is periodically updated.

[0006]An approach for solving the OOV issue consists of converting the
speech to phonetic transcripts and representing the query as a sequence
of phones. Such transcripts can be generated by expanding the word
transcripts into phones using the pronunciation dictionary of the ASR
system. This kind of transcript is acceptable to search OOV terms that
are phonetically close to in-vocabulary (IV) terms.

[0007]Another way would be to use sub-word (phones, syllables, or
word-fragments) based language model. The retrieval is based on searching
the sequence of sub-words representing the query in the sub-word
transcripts. The main drawback of this approach is the inherent high
error rate of the transcripts and such sub-word approaches cannot be an
alternative to word transcripts for searching IV query terms that are
part of the vocabulary of the ASR system.

[0008]Many techniques can be used to generate transcripts. Above are
described sub-word-based and word-based approaches that have been used
for IR on speech data; the former suffers from low accuracy and the
latter from limited vocabulary of the recognition system.

SUMMARY OF THE INVENTION

[0009]According to a first aspect of the present invention there is
provided a method of spoken document retrieval using multiple search
transcription indices, the method comprising: receiving a query input
formed of one or more query terms; for each query term determining a type
of the query term, wherein a type includes a term in a speech recognition
vocabulary or a term not in a speech recognition vocabulary; selecting
one or more indices of search transcriptions for searching the query term
based on the type of the query term; scoring the results from the one or
more indices; and merging the results of the one or more indices for the
query term.

[0010]According to a second aspect of the present invention there is
provided a computer software product for spoken document retrieval using
multiple search transcription indices, the product comprising a
computer-readable storage medium, storing a computer in which program
comprising computer-executable instructions are stored, which
instructions, when read executed by a computer, perform the following
steps: receiving a query input formed of one or more query terms; for
each query term determining a type of the query term, wherein a type
includes a term in a speech recognition vocabulary or a term not in a
speech recognition vocabulary; selecting one or more indices of search
transcriptions for searching the query term based on the type of the
query term; scoring the results from the one or more indices; and merging
the results of the one or more indices for the query term.

[0011]According to a third aspect of the present invention there is
provided a method of providing a service to a customer over a network for
spoken document retrieval, the service comprising: receiving a query
input formed of one or more query terms; for each query term determining
a type of the query term, wherein a type includes a term in a speech
recognition vocabulary or a term not in a speech recognition vocabulary;
selecting one or more indices of search transcriptions for searching the
query term based on the type of the query term; scoring the results from
the one or more indices; and merging the results of the one or more
indices for the query term.

[0012]According to a fourth aspect of the present invention there is
provided a search system for spoken document retrieval using multiple
search transcription indices, the method comprising: a processor; a query
input means, wherein a query is formed of one or more query terms; means
for determining a type of a query term by reference to a speech
recognition vocabulary, wherein a type includes a term in a speech
recognition vocabulary or a term not in a speech recognition vocabulary;
means for selecting one or more indices of search transcriptions for
searching the query term based on the type of the query term; means for
scoring the results from the one or more indices; and means for merging
the results of the one or more indices for the query term.

[0013]A general retrieval model is provided for vocabulary-independent
search that combines retrieval on different speech transcripts generated
according to different methods. This is different from meta-search that
sends the whole query to multiple search engines and then combines the
results. In this disclosure, for each query term it is decided to which
search engines to send it according to the type of the term. Then, the
results for each term are combined and, finally, the results of all terms
are combined.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014]The subject matter regarded as the invention is particularly pointed
out and distinctly claimed in the concluding portion of the
specification. The invention, both as to organization and method of
operation, together with objects, features, and advantages thereof, may
best be understood by reference to the following detailed description
when read with the accompanying drawings in which:

[0015]FIG. 1 is a schematic diagram showing indexing of speech data;

[0016]FIG. 2 is a block diagram of a system in accordance with the present
invention; and

[0017]FIG. 3 is a block diagram of a computer system in which the present
invention may be implemented;

[0018]FIG. 4 is a flow diagram of a method in accordance with an aspect of
the present invention;

[0019]FIG. 5 is a flow diagram of an overall method in accordance with the
present invention;

[0020]FIG. 6 is a further flow diagram of the overall method in accordance
with the present invention;

[0021]FIG. 7 is a block diagram of a system of providing a combined word
and sub-word index as used in an aspect of the present invention;

[0022]FIG. 8 is a flow diagram of a method of generating the combined word
and sub-word index of FIG. 7; and

[0023]FIG. 9 is a flow diagram of a method of searching the combined word
and sub-word index of FIG. 7.

[0024]It will be appreciated that for simplicity and clarity of
illustration, elements shown in the figures have not necessarily been
drawn to scale. For example, the dimensions of some of the elements may
be exaggerated relative to other elements for clarity. Further, where
considered appropriate, reference numbers may be repeated among the
figures to indicate corresponding or analogous features.

DETAILED DESCRIPTION OF THE INVENTION

[0025]In the following detailed description, numerous specific details are
set forth in order to provide a thorough understanding of the invention.
However, it will be understood by those skilled in the art that the
present invention may be practiced without these specific details. In
other instances, well-known methods, procedures, and components have not
been described in detail so as not to obscure the present invention.

[0026]A retrieval model for vocabulary-independent search is provided.
Speech transcripts are generated according to different methods and
retrieval on each transcript is carried out according to different
methods.

[0027]Referring to FIG. 1, a schematic diagram is shown of indexing of
speech data 100. A first speech file 101 is transcribed into multiple
transcripts 121-123 according to different methods 111-113. Each
transcript 121-123 is indexed in a separate index 131-133. In this way,
there are several indices 131-133 of the same first speech file 101. A
speech file 101 may be speech data recorded in any form and stored for
retrieval.

[0028]FIG. 1 shows a second speech file 102 which in turn is transcribed
into multiple transcripts 124-125 according to different methods 114-115.
In this way, there are several indices 134-135 of the second speech file
102. It should be noted that a method 114-115 of transcribing the second
speech file 102 may be the same as one of the methods of transcribing
111-113 the first file 101.

[0029]The transcripts 121-125 may be generated according to different
methods 111-115, for example, word decoding, sub-word decoding, phonetic
representation of word decoding, etc.

[0030]A query may be submitted in the following different forms or as a
combination of forms, including: [0031]a written query in which a user
inputs the text of the query; [0032]a spoken query in which a user speaks
the text of the query and a speech recognition system transcribes the
query; and/or [0033]an image query in which the user supplies an image
containing text of the query and an optical character recognition (OCR)
system is used to find the written text of the query.

[0034]A query may take the form of a single query term in the form of a
keyword or a query phrase formed of a plurality of words to be searched
as a single entity. A query may alternatively be formed of multiple query
terms to be search independently. The query terms have Boolean or phrase
constraints between them (for example, Term1 AND Term2 OR Term3).

[0035]A search is made for speech files which are relevant to a query. A
query submitted to a search system may include query terms which are in
vocabulary (IV) or out of vocabulary (OOV). The vocabulary of the
automatic speech recognition (ASR) system used to generate the word
transcripts is given and, therefore, IV and OOV terms can be identified
in a query.

[0036]In the described method and system, a query is divided into query
terms which can each be identified as IV or OOV terms. In the described
method and system, a query term formed of a phrase is split into
individual query terms which are either IV or OOV and recombined as a
phrase during a merging process described further below.

[0037]In the described method and system, for each query term, it is
decided to which search engines or indices to send the query term for
speech file retrieval based on the type of the query term (IV or OOV) and
the final result list is based on the combination of the results returned
by all the different search engines or indices.

[0038]Referring to FIG. 2, a block diagram is shown of a system for speech
file retrieval 200. The system 200 includes a search system 210 including
an input mechanism 201 for a query. A query term processor 202 is
provided for dividing the query into query terms and determining the type
of the query terms. Determining the type of the query terms may include
referencing an ASR vocabulary 203 to determine if a query term is an IV
term or an OOV term. The ASR vocabulary 203 may be provided as part of
the search system 210 or may be referenced from a remote location.

[0039]The search system 210 includes an index selector 204 for selecting
indices 211-213 for searching for a query term. A search mechanism 205
sends the query terms to the selected indices 211-213. A scoring
mechanism 206 is provided either associated with the indices 211-213 or
provided in the search system 210 for scoring the results of speech files
located in the indices 211-213 for a query term.

[0040]The search system 210 also includes a first merging mechanism 207
for merging the results of a query term provided from different indices
211-213. A second merging mechanism 208 is provided for merging the
results of all the query terms including handling query terms which form
a phrase in the query, with a results output 209 for the final query
results.

[0041]Referring to FIG. 3, an exemplary system for implementing a search
system includes a data processing system 300 suitable for storing and/or
executing program code including at least one processor 301 coupled
directly or indirectly to memory elements through a bus system 303. The
memory elements can include local memory employed during actual execution
of the program code, bulk storage, and cache memories which provide
temporary storage of at least some program code in order to reduce the
number of times code must be retrieved from bulk storage during
execution.

[0042]The memory elements may include system memory 302 in the form of
read only memory (ROM) 304 and random access memory (RAM) 305. A basic
input/output system (BIOS) 306 may be stored in ROM 304. System software
307 may be stored in RAM 305 including operating system software 308.
Software applications 310 may also be stored in RAM 305.

[0043]The system 300 may also include a primary storage means 311 such as
a magnetic hard disk drive and secondary storage means 312 such as a
magnetic disc drive and an optical disc drive. The drives and their
associated computer-readable media provide non-volatile storage of
computer-executable instructions, data structures, program modules and
other data for the system 300. Software applications may be stored on the
primary and secondary storage means 311, 312 as well as the system memory
302.

[0044]The computing system 300 may operate in a networked environment
using logical connections to one or more remote computers via a network
adapter 316.

[0045]Input/output devices 313 can be coupled to the system either
directly or through intervening I/0 controllers. A user may enter
commands and information into the system 300 through input devices such
as a keyboard, pointing device, or other input devices (for example,
microphone, joy stick, game pad, satellite dish, scanner, or the like).
Output devices may include speakers, printers, etc. A display device 314
is also connected to system bus 303 via an interface, such as video
adapter 315.

[0046]The method is now described in more detail. A query is composed of
IV and OOV terms. Let the query be denoted as Q=(iv1, . . . ,
ivn, oov1, . . . , oovm) where each iv1 , is an IV
term and each oov1 is an OOV term. In a general setup, there may be
several indices of transcripts, where each transcript has been produced
according to different methods. It needs to be decided for each term in
which of its forms to query it (i.e., word or sub-word) and to which
indices to send it.

[0047]A simple setup is first described in which there is one index based
on word transcripts for IV terms and one index based on sub-word
transcripts for OOV terms. A general case is then described.

[0048]In the simple case, the query is decomposed into query terms such
that each iv1 is sent as a sub-query to a word index, and each
oov1 is converted to its sub-word representation and is sent as a
sub-query to a sub-word index. Each index runs its sub-query using a form
of scoring and returns a list of speech files sorted by decreasing order
of relevance to the given sub-query. The scores may be for example in the
range [0, 1] with 1 being the most relevant result.

[0049]It is assumed that each transcript referenced by an index relates a
unique speech file, and the next step is to merge the lists of results
from the indices into a single list of files that are most relevant to
the sub-query using some aggregation function to combine the scores from
the different lists.

[0050]This aggregation is known as the top-k problem-return the best k
documents (in this case speech files) that match the query. It should be
noted that for large collections with a low correlation between the lists
of results, the merging can be a time consuming process since it may
require scanning a large number of documents from each list until such a
set of k documents is found.

[0051]To make this process efficient, the Threshold Algorithm described in
"Optimal aggregation algorithms for middleware" by Ronal Fagin, et al
published in the Journal of Computer and System Sciences 66(2003)
614-656, also known as the TA algorithm is used. The TA algorithm
consists of scanning simultaneously the different lists row by row. For
each scanned document in the current row, it finds its score in the other
lists and then it uses the aggregate function to find its total score.
Then, it calculates the total score of the current row using the same
aggregate function. The algorithm stops when there are k documents with a
total score larger than the last row total score.

[0052]The general setup is now described in which there may be several
word indices and several sub-word indices. Moreover, each iv1 term
can be sent to one or more word indices and to one or more sub-word
indices using its sub-word representation; each oov1 term can be
sent to one or more sub-word indices using its N-best sub-word
representation.

[0054]In this general case, the TA is run as follows. For each query term,
it is decided to which indices to send it and then it is sent as a
sub-query to each of the selected indices. A list of results is returned
from each index sorted by decreasing score order. A TA is applied on
those lists and a single list of documents is obtained that contains the
query term in at least one of the indices. This step is referred to as
local TA.

[0055]After a local TA is carried out for all query terms, a global TA is
applied to all the lists to get the final set of top-k documents that
best match the query.

[0056]The TA can be run with query semantics, such as OR or AND semantics.
In OR semantics, a document is returned even if it is not relevant
according to some lists. In AND semantics, a document that is not
relevant in one list is rendered not relevant regardless of its final
aggregated score. In one embodiment, the local TA may be run in OR
semantics: it is sufficient that the query term appears in the document
according to any chosen index to consider the document as relevant to
this query term. The global TA may be run in the semantics defined in the
query.

[0057]Referring to FIG. 4, a schematic flow diagram 400 shows the
processing of a sub-query for a query term. A query term 401 is selected
and sent to multiple indices 411-412. Each index 411-412 returns a
results list 421-422 with sorted results. A local TA 430 is applied to
the results lists 411-412 to obtain a combined result list 441 for the
query term.

[0058]Referring to FIG. 5, a schematic flow diagram 500 shows the
processing of a query. A query 501 is input and divided into query terms
401-402. For each query term 401-402 the flow of FIG. 4 is carried out
sending the query term 401, 402 to multiple indices 411-412. The results
lists 421-422 of the indices 411-412 are combined by applying the local
TA 430 and returning a combined results list 441, 442.

[0059]A global TA 502 is applied to aggregate the results lists 441, 442
of the sub-queries to return a final list of results 503 for the query
501. The global TA 502 is also responsible for combining query terms
which formed a phrase in the query 501.

[0060]A worked example is provided. The query is George AND bondesque.
There are two query terms: George and bondesque and "George" is IV,
"bondesque" is OOV.

[0061]Suppose there are multiple indices: [0062]Index 1 based on word
decoding; [0063]Index 2 based on phonetic decoding; [0064]Index 3 based
on phonetic representation of the 1-best path of the word decoding.

[0074]Finally, the global TA (aggregate sum with AND semantics) will
return the result list: (10, 0.55) (9, 0.15)

[0075]Referring to FIG. 6, a flow diagram 600 of the described method is
shown. An input query is received 601. A first query term is selected 602
from the query. The type of the query term is determined 603, for
example, if the query term is IV or OOV. Indices are selected for the
query term and the query term is sent 604 to appropriate indices
depending on its form. For example, an IV query term may be sent to both
word and sub-word indices while an OOV query term may be sent to sub-word
indices only. Both forms of query term may be sent to a combined word and
sub-word index as described further below. A query term which is a phrase
is divided into individual query terms formed of either IV or OOV terms.

[0076]A sorted list of results is returned 605 from each index. The lists
of results are merged 606 to return 607 a single results list for the
query term.

[0077]It is determined 608 if there is a next query term. If so, the
method loops 609 to process the next query term. If there are no more
query terms, the results lists for the various query terms are merged
610. This merging may use the query semantics used in the original query
including any phrase semantics. A final result list for the query is
returned 611.

[0078]Further specific embodiment details are now provided of indexing and
retrieval.

[0079]An ASR system is used for transcribing speech data. It works in
speaker-independent mode. For best recognition results, an acoustic model
and a language model are trained in advance on data with similar
characteristics. The ASR system generates word lattices. A compact
representation of a word lattice called a word confusion network (WCN) is
used. Each edge (u, v) is labeled with a word hypothesis and its
posterior probability, i.e., the probability of the word given the
signal. One of the main advantages of WCN is that it also provides an
alignment for all of the words in the lattice. Although WCNs are more
compact than word lattices, in general the 1-best path obtained from WCN
has a better word accuracy than the 1-best path obtained from the
corresponding word lattice.

[0080]Word decoding can be converted to its phonetic representation using
the pronunciation dictionary of the ASR system.

[0081]Phonetic output is generated using a word-fragment decoder, where
word-fragments are defined as variable-length sequences of phones. The
decoder generates 1-best word-fragments that are then converted into the
corresponding phonetic strings.

[0082]Example indices may include a word index on the word confusion
network (WCN); a word phone index which phonetic N-gram index of the
phonetic representation of the 1-best word decoding; and a phone index a
phonetic N-gram index of the 1-best fragment decoding.

[0083]An example of an indexing model is provided. For phonetic
transcripts, N-grams of phones are extracted from the transcripts and
indexed. The document is extended at its beginning and its end with
wildcard phones such that each phone appears exactly in N N-grams. Space
characters are ignored. In order to compress the phonetic index, each
phone is represented by a single character.

[0084]Both word and phonetic transcripts are indexed in an inverted index.
Each occurrence of a unit of indexing (word or N-gram of phones) u in a
transcript D is indexed with its position. In addition, for WCN indexing,
the confidence level of the occurrence of u at the time t that is
evaluated by its posterior probability Pr(u|t,D) is stored.

[0086]A word search is used for retrieving from indices based on word
transcripts and phonetic search for retrieving from indices based on
phonetic transcripts. In one embodiment, a combination of the Boolean
Model and the Vector Space Model with modifications of the term frequency
and document frequency is used to determine the relevance of a document
to a query. Afterward, an aggregate score is assigned to the result based
on the scores of the query terms from the search on the different.

[0087]Weighted sum is used as aggregate function for both local and global
TA. It is given in the following formula:

##EQU00001##

where score(Q,Di) is the score of a document D in list i for query Q
and wi is the weight assigned to this list. For example, weights in
the local TA are respectively 5, 3, 2 for the word, word phone and phone
indices. In the global TA, the weights of IV and OOV terms are
respectively equal to 2 and 1.

[0088]Example scoring methods used in the indices are now described. In a
word search, a scoring method of weighted term frequency may be used.
This approach can be used only for IV terms. The posting lists are
extracted from the word inverted index. The classical TFIDF method is
extended using the confidence level provided by the WCN.

[0089]Let the sequence of all the occurrences of an IV term u in the
document D be denoted by occ(u,D)=(t1, t2, . . . , tn).
The term frequency of u in D, tf(u,D), is given by the following formula:

##EQU00002##

The computation of the document frequency is not modified.

[0090]In the following, an approach for fuzzy phonetic search is
presented.

[0091]Although, this approach is more appropriate for OOV query terms, it
can be also used for IV query terms. However, the retrieval will probably
be less accurate, since the space character is ignored during indexing
process of the phonetic transcripts.

[0092]If the query term is OOV, it is converted to its N-best phonetic
pronunciation using the joint maximum entropy N-gram model. For ease of
representation, first a fuzzy phonetic search is described using only the
1-best presentation and in the next section, it is extended to N-best. If
the query term is IV, it is converted to its phonetic representation.

[0093]The search is decomposed into two steps: query processing and then,
pruning and scoring.

[0094]In query processing, each pronunciation is represented as a phrase
of N-grams of phones. As for indexing, the query is extended at its
beginning and its end with wildcard phones such that each phone appears
in N N-grams. For example, the sequence of phones (A,B,C) with N=2
generates the phrase "?A AB BC C?" where ? is the wildcard phone.

[0095]During the query processing, several fuzzy matches for the phrase
representation of the query are retrieved from the phonetic inverted
index.

[0096]In order to control the level of fuzziness, the following two
parameters are defined: δi, the maximal number of inserted
N-grams and δd, the maximal number of deleted N-grams. Those
parameters are used in conjunction with the inverted indices of the
phonetic transcript to efficiently find a list of indexed phrases that
are different from the query phrase by at most δiinsertions
and δddeletions of N-grams. Note that a substitution is also
allowed by an insertion and a deletion. At the end of this stage, a list
of fuzzy matches is obtained and for each match, the list of documents in
which it appears.

[0097]The next step consists of pruning some of the matches using a cost
function and then scoring each document according to its remaining
matches. Consider a query term ph represented by the following sequence
of phones (p1, p2, . . . , pn) and ph' a sequence of
phones (p'1, p'2, . . . , p'm) that appears in the indexed
corpus and that was matched to ph.

[0098]Define the confusion cost of ph with respect to ph', to be the
smallest sum of insertions, deletions, and substitutions penalties
required to change ph into ph'. A penalty αiis assigned to
each insertion and a penalty αdto each deletion. For
substitutions, a different penalty is given to substitutions that are
more likely to happen than others. Seven groups of phones have been
identified that are more likely to be confused with each other, denoted
as metaphones groups. A penalty is assigned to each substitution
depending on whether the substituted phones are in the same metaphone
group αsmor not αs. The penalty factors are
determined such that 0≦αsm, ≦αi,
αd, αs≦1. Note that it is different from the
classical Levenshtein distance since it is non-symmetric and different
penalties are assigned to each kind of error. The similarity of ph with
ph' is derived from the confusion cost of ph with ph'.

[0099]A dynamic programming algorithm is used in order to compute the
confusion cost; it extends the commonly used algorithm that computes the
Levenshtein distance. The described implementation is fail-fast since the
procedure is aborted if it is discovered that the minimal cost between
the sequences is greater than a certain threshold, θ(n), given by
the following formula:

θ(n)=θnmax(αi, αd, αs)

where θ is a given parameter, 0≦θ<1. Note that the
case of θ=0 corresponds to exact match.

[0100]The cost matrix, C, is an (n+1)(m+1) matrix. The element C(ij) gives
the confusion cost between the subsequences (p1, p2, . . . ,
pi) and (p'1, p'2, . . . , p'i). C is filled using a
dynamic programming algorithm. During the initialization of C, the first
row and the first column are filled. It corresponds to the case that one
of the subsequences is empty. C(0, 0)=0,C(i, 0)=iαd and C(0,
j)=jαi.

[0101]After the initialization step, each row i is traversed to compute
the values of C(i, j) for each value of j. The following recursion is
used to fill in row i:

C(i,j)=min0≦j≦m[C(i-1,j)+αd,
C(i,j-1)+αi, (i-1,j-1)+cc(pi,p'j)].

cc(pi, p'j) represents the cost of the confusion of piand
p'j, and it is computed in the following way:

[0102]if pi=p'j, cc(pi, p'j)=0,

[0103]if pi and p'j are in the same metaphone group, cc(pi,
p'j)=αsm,

[0104]if pi and p'j are not in the same metaphone group,
cc(pi, p'j)=αs.

[0105]After the filling of row i, the computation is aborted if:

≦≦ >θ ##EQU00003##

[0106]The similarity of ph with respect to ph' is defined as, sim (ph,
ph') as follows: if the computation is aborted the similarity sim(ph,
ph') is 0; else

' ααα ##EQU00004##

[0107]Note that 0≦sim(ph, ph')≦1. Finally, the score is
computed of ph in a document D, score(ph,D), using TFIDF. Define the term
frequency of ph in D, tf(ph,D) by the following formula:

.di-elect cons. ' ##EQU00005##

and the document frequency of ph in the corpus, df(ph),

df(ph)=|{D|.E-backward.ph' ε Ds.t.sim(ph, ph')>0}|

[0108]Example: consider a query term represented by the sequence of phones
(A,B,C,D,E), and let N=2, δi=δd=2 and
max(αi, αd, αs)=1. Search for fuzzy
matches of the phrase "?A AB BC CD DE E?" where ? is the wildcard phone.
In the matched phrase "YA AB BX XC CD DE EY", there are two inserted
bi-grams BX and XC, and one deleted bi-gram BC. This match corresponds to
the sequence of phones ABXCDE and

α ##EQU00006##

[0109]The word and sub-word indices used by the described method and
system may be combined as described in U.S. patent application Ser. No.
11,781,285 (Attorney Docket No. IL9-2007-0042US1) filed Jul. 23, 2007,
titled "Method and System for Indexing Speech Data". In another
embodiment, word and sub-word indices may also be combined based on
offsets within a transcript instead of timestamps. The combining of the
indices based on timestamps or offsets may be carried out by the TA.

[0110]FIG. 7 shows a system 700 for generating a word and sub-word indices
which are combined using timestamps.

[0111]Speech data 701 is transcribed by an automatic speech recognition
(ASR) system 702 to convert it into word transcripts 707. The ASR system
702 contains a word vocabulary 704 of IV terms 705 which it recognizes
from the input speech 701. Terms which are in the speech data 701, but
which are not in the word vocabulary 704 of the ASR system 702, are OOV
terms 706.

[0112]The ASR system 702 working with the word vocabulary 704 can
recognize only terms in the word vocabulary 704. The OOV terms 706 are
output in the word transcripts 703 as terms from the word vocabulary 704
that are probable given the recognition acoustic model and the language
model.

[0113]A sub-word ASR 709 is also provided which converts the speech data
701 to sub-word transcripts 708. The sub-words are typically phones,
morphemes, syllables, or a sequence of phones. The sub-word ASR 709 works
with language models to transcribe the speech data 701 into sub-words. In
particular, it includes a sub-word vocabulary 703 to recognize sub-words
in the speech data 701.

[0114]Both word and sub-word transcripts 707, 708 are built on all the
speech data. The word transcript 707 contains a transcript of all the
words in the speech data 701 using the IV terms in the word vocabulary
704. All OOV terms 706 in the speech data 701 will be transcribed
incorrectly as IV terms since the OOV terms cannot be recognized using
word transcription. The sub-word transcript 708 will contain the sub-word
transcription of all the terms (IV and OOV) in the speech data 701 into
sub-words from the sub-word vocabulary 703.

[0115]An indexing system 710 is provided that includes an indexing means
711 for processing terms in the word and sub-word transcripts 707, 708 to
index the terms for search retrieval. The indexing system 710 may access
the transcripts 707, 708 to be processed via a network. The indexing
system 710 includes a first index 712 for word transcripts 707, and a
second index 713 for sub-word transcripts 708.

[0116]In each of the indices 712, 713 the transcribed word or sub-word is
stored as a unit 714 with a transcript or speaker identifier 715 and a
timestamp 716 (for example, of the form of a start time and duration).
The timestamp 716 of a unit 714 (word or sub-word) represents the time
information in the speech data 701 about this unit 714. Generally, it is
a start time of the unit 714 in the speech 701 and its duration.

[0117]Optionally, additional information 717 may also be included in the
indices for a word 707 or sub-word 708. The additional information 717
may include, for example, posterior probability, rank relative to other
hypotheses, etc. The posterior probability is the probability of the unit
given the signal. In the case that the ASR hesitates between several
alternatives (e.g. between "will come" and "welcome"), the rank is the
relative position of the alternative among the other alternatives
occurring at the same time.

[0118]The first and second indices 712, 713 may be combined in a single
index containing words and sub-words with an indication as to whether a
stored unit is a word or a sub-word, so that the different categories can
be independently searched.

[0119]A search system 720 is provided for searching the first and second
indices 712, 713 for a query term. In the described system, the search
system 720 may be combined with the search system 210 of FIG. 2. A query
term is input into the search system 720. The search system 720 may be
used when a query term is recognized as a hybrid including both IV and
OOV terms. The search system 720 may also be used for a query term which
is an IV term using both the word index and the sub-word index. The
search system 720 may further be used for OOV terms using only the
sub-word index.

[0120]The search system 720 may access the indices 712, 713 remotely via a
network. The search system 720 includes a query input means 721 in which
a use can input a query. The query may take the form of a single keyword
or multiple keywords to be searched as a phrase within the query.

[0121]The search system 720 includes a query term processing means 722
which includes: a means for extracting 723 individual words from the
query term; a means for retrieving 724 a posting list from the word index
712; a means for retrieving 725 a posting list from the sub-word index
713; and a merging means 726. The merging means 726 merges the posting
lists from the word and sub-word indices 712, 713 using the timestamps
716 stored in the indices 712, 713.

[0122]An indexing algorithm is described with reference to FIG. 8. FIG. 8
shows a flow diagram 800 of a method of indexing in the combined word and
sub-word index. An input 801 of a corpus of both word transcripts and
sub-word transcripts of speech data is received. The corpus is processed,
by parsing 802 the transcripts and extracting the units for processing.

[0123]A first unit (n=1) in the form of a word or sub-word for indexing is
retrieved 803. The unit is stored 804 in the index including:

[0124]a transcript or speaker identifier; [0125]a timestamp, for example,
a start time and duration; and [0126]optionally, additional data provided
by the transcript on the unit (e.g. the posterior probability, the name
of the audio file, the word offset, etc.).

[0127]It is then determined 805 if there is a next unit for processing. If
so, the method loops 806 and increments to unit n=n+1 807, and retrieves
803 and stores 804 the next unit. If there are no more units for
processing, an index of the corpus of transcripts is output 808.

[0128]The indexing model is generic and is the same for all the different
types of transcripts (e.g. one-best path, lattice, confusion network)
providing a timestamp for each unit. The word and sub-word transcripts
are indexed in two different indices.

[0129]A search algorithm is described with reference to FIG. 9. FIG. 9
shows a flow diagram 900 of a method of searching for a query term in the
form of a phrase which may include both IV and OOV terms.

[0130]An input of a phrase to search in the speech data is received 901.
The query is parsed 902 in order to extract the query words. It is then
determined 903 for each query word, if it is an IV term.

[0131]For IV query terms, the posting lists are retrieved 904 from the
word based index. For non-IV terms, which must be OOV query terms the
terms are converted to sub-words and the posting list of each sub-word is
retrieved 905 from the sub-word index.

[0132]The different posting lists are merged 906 according to the
timestamp of the occurrences in order to create results matching the
query. It is checked 907 that the words and sub-words appear in the right
order according to their begin times, and it is checked 908 that the
difference in time between adjacent words/sub-words is reasonable (for
example, less that 0.5 seconds).

[0133]The set of all the exact matches of the given phrase in the speech
corpus is output 909 as the search result.

[0134]It should be noted that for keyword searches (and not phrase
searches as described above), the indexing model allows use of the
classical approaches which: [0135]handles queries containing only IV
terms using the word based index; [0136]handles queries containing only
OOV terms using the sub-word based index; and [0137]handles queries
containing both IV and OOV terms by unifying results retrieved
respectively from the word based and the sub-word based indices.

[0138]The described searching method can be used by any search engine on
speech data including, for example, call center recorded calls, broadcast
news, etc.

[0139]The method of using the combined word and sub-word index also
permits a ranking model based on temporal proximity. In one embodiment of
a ranking model, for OOV term ranking, information provided by the
phonetic index is used. A higher rank is given to occurrences of OOV
terms that contain phones that are close in time to each other. A scoring
function is defined that is related to the average gap in time between
the different phones.

[0140]A keyword k is converted to the sequence of phones (p0k, .
. . ,p1k). The normalized score, score(k,t0k,D) of a
keyword k=(p0k, . . . ,p1k), where each pik
occurs at time t1k with a duration dik in the
transcript D, can be defined by the following formula:

##EQU00007##

[0141]The above formula is just an example of a ranking formula that takes
into account the time information extracted from the index, and that can
also justifies the need to index timestamp information. This ranking
method can be combined with classical ranking methods such as tfidf (term
frequency, inverse document frequency), edit distance etc.

[0142]A search system combining results of multiple indices may be
provided as a service to a customer over a network.

[0143]The invention can take the form of an entirely hardware embodiment,
an entirely software embodiment or an embodiment containing both hardware
and software elements. In a preferred embodiment, the invention is
implemented in software, which includes but is not limited to firmware,
resident software, microcode, etc.

[0144]The invention can take the form of a computer program product
accessible from a computer-usable or computer-readable medium providing
program code for use by or in connection with a computer or any
instruction execution system. For the purposes of this description, a
computer usable or computer readable medium can be any apparatus that can
contain, store, communicate, propagate, or transport the program for use
by or in connection with the instruction execution system, apparatus or
device.

[0145]The medium can be an electronic, magnetic, optical, electromagnetic,
infrared, or semiconductor system (or apparatus or device) or a
propagation medium. Examples of a computer-readable medium include a
semiconductor or solid state memory, magnetic tape, a removable computer
diskette, a random access memory (RAM), a read only memory (ROM), a rigid
magnetic disk and an optical disk. Current examples of optical disks
include compact disk read only memory (CD-ROM), compact disk read/write
(CD-R/W), and DVD.

[0146]Improvements and modifications can be made to the foregoing without
departing from the scope of the present invention.