Years of Citing Articles

Bookmark

OpenURL

Abstract

Abstract. The retrieval of OCR degraded text using n-gram formula-tions within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using mgram databases of2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over stan-dard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query ex-pansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query for-mulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described. 1 In t roduct ion A major problem with retrieval of OCR text from image data is the inevitable

Citations

...Intelligent Document Understanding System (IDUS) developed at Lockheed-Martin Corporation [8] and INQUERY, a full text probabilistic retrieval system developed at the University of Massachusetts [9], =-=[1]-=-. The system allows retrieval of text images based on automatic OCR and logical text analysis of image content [11]. In this paper, we describe our efforts in using n-grams in query formulations and e...

...g an Intelligent Document Understanding System (IDUS) developed at Lockheed-Martin Corporation [8] and INQUERY, a full text probabilistic retrieval system developed at the University of Massachusetts =-=[9]-=-, [1]. The system allows retrieval of text images based on automatic OCR and logical text analysis of image content [11]. In this paper, we describe our efforts in using n-grams in query formulations ...

...erms to the original query. This approach has wide appeal since it could be largely language independent and could be applied to various concept (word) representations such as phonemes, soundex codes =-=[14, 13]-=- or for spelling correction [12], using differing retrieval engines [2], or as a means to summarize the content of a document [3]. 3.1 An N-gram to Term Database Expanding user provided query terms wi...

...us concept (word) representations such as phonemes, soundex codes [14, 13] or for spelling correction [12], using differing retrieval engines [2], or as a means to summarize the content of a document =-=[3]-=-. 3.1 An N-gram to Term Database Expanding user provided query terms with matching and closely matching words is accomplished by converting the query terms to their 2-gram components and querying a co...

... best image analysis system. Although OCR errors have little effect on retrieval with good quality input, effectiveness can be significantly reduced in short texts with poor image or scanning quality =-=[7, 4]-=-. In an effort to reduce these losses, we have incorporated the use of n-grams for the representation of document words and user query terms using a probabilistic retrieval system with no modification...

...erms to the original query. This approach has wide appeal since it could be largely language independent and could be applied to various concept (word) representations such as phonemes, soundex codes =-=[14, 13]-=- or for spelling correction [12], using differing retrieval engines [2], or as a means to summarize the content of a document [3]. 3.1 An N-gram to Term Database Expanding user provided query terms wi...

...ith no modification to the evaluation process. N-grams have been frequently used for word representation to address issues such as multiple character sets, language independence, and spell correction =-=[5]-=-. We have incorporated this research in retrieval and highlighting of image query text using an Intelligent Document Understanding System (IDUS) developed at Lockheed-Martin Corporation [8] and INQUER...

...largely language independent and could be applied to various concept (word) representations such as phonemes, soundex codes [14, 13] or for spelling correction [12], using differing retrieval engines =-=[2]-=-, or as a means to summarize the content of a document [3]. 3.1 An N-gram to Term Database Expanding user provided query terms with matching and closely matching words is accomplished by converting th...

... best image analysis system. Although OCR errors have little effect on retrieval with good quality input, effectiveness can be significantly reduced in short texts with poor image or scanning quality =-=[7, 4]-=-. In an effort to reduce these losses, we have incorporated the use of n-grams for the representation of document words and user query terms using a probabilistic retrieval system with no modification...

... measure by applying a &quot;cost&quot; of additional operations, such as character transposition, or special substitutions, as with common OCR errors [13]. We chose to keep the method simple and use =-=Ukkonen's [10]-=- Qgram Distance measure, which counts the number of n-grams contained in two words versus the number they share. The simplest form of this measure is QD(s; t) = jG(s)j + jG(t)j \Gamma 2sjG(s)ANDG(t)j ...

...roach has wide appeal since it could be largely language independent and could be applied to various concept (word) representations such as phonemes, soundex codes [14, 13] or for spelling correction =-=[12]-=-, using differing retrieval engines [2], or as a means to summarize the content of a document [3]. 3.1 An N-gram to Term Database Expanding user provided query terms with matching and closely matching...

...ine (non n-gram) database and queries. These experiments were performed using four different databases that were randomly degraded using data developed by the University of Nevada at Los Vegas (UNLV) =-=[6]-=-. Higher word error rates were used to further degrade the database when more severe corruption was to be tested. Words were corrupted using character recognition confusions typical of OCR systems, in...

... Intelligent Document Understanding System (IDUS) devel-soped at Lockheed-Martin Corporation [8] and INQUERY, a full text probabilis-stic retrieval system developed at the University of Massachusetts =-=[9]-=-, [1]. Thessystem allows retrieval of text images based on automatic OCR and logical textsanalysis of image content [11].sIn this paper, we describe our efforts in using n-grams in query formulationss...

...correction [5]. We have incorporated this research in retrieval and highlighting of image query text using an Intelligent Document Understanding System (IDUS) developed at Lockheed-Martin Corporation =-=[8]-=- and INQUERY, a full text probabilistic retrieval system developed at the University of Massachusetts [9], [1]. The system allows retrieval of text images based on automatic OCR and logical text analy...

...text probabilistic retrieval system developed at the University of Massachusetts [9], [1]. The system allows retrieval of text images based on automatic OCR and logical text analysis of image content =-=[11]-=-. In this paper, we describe our efforts in using n-grams in query formulations and expansions with the goal of obtaining enhanced retrieval performance on OCR data. The following sections contain des...

...gely language independent and could be applied to various concepts351s(word) representations such as phonemes, oundex codes [14, 13] or for spellingscorrection [12], using differing retrieval engines =-=[2]-=-, or as a means to summarizesthe content of a document [3].s3.1 An N-gram to Term DatabasesExpanding user provided query terms with matching and closely matching wordssis accomplished by converting th...

...est image analysis system.sAlthough OCR errors have little effect on retrieval with good quality input, effec-stiveness can be significantly reduced in short texts with poor image or scanningsquality =-=[7, 4]-=-.sIn an effort to reduce these losses, we have incorporated the use of n-grams forsthe representation f document words and user query terms using a probabilisticsretrieval system with no modification ...