Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

An automatic paraphrase acquisition technique is provided. A common theme
of the various embodiments described herein resides in careful design of
simple tasks that can elicit the necessary information for the automated
process. These tasks are performed quickly and inexpensively. By
gathering the results produced, paraphrases can be generated
automatically using the method and/or system.

Claims:

1. A method for acquiring paraphrases for use in natural language
processing applications, the method comprising: receiving raw text as
input; sentence breaking the raw text into individual sentences;
providing the individual sentences and a corresponding survey to an
annotating source, wherein the annotating source conducts the survey
based on the individual sentences; receiving results of the survey from
the annotating source; filtering the survey results; proving the filtered
survey results and a second survey to the annotating source, wherein the
annotating source conducts the second survey based on the filtered
results; receiving results of the second survey from the annotating
source; and generating paraphrases based on the results of the second
survey.

2. The method as set forth in claim 1 wherein the raw text is provided by
a database.

3. The method as set forth in claim 1 wherein the providing the
individual sentences and a survey is based on a script.

4. The method as set forth in claim 1 wherein the filtering is based on a
script.

5. The method as set forth in claim 1 wherein the providing the filtered
survey results and the second survey is based on a script.

7. The method as set forth in claim 1 wherein the generating paraphrases
comprises generating paraphrases in a many-to-one mapping of paraphrases.

8. A system for acquiring paraphrases for use in natural language
processing applications, the system comprising: an input for raw text; a
processor operative to break the raw text into individual sentences,
provide the individual sentences to an annotating source to conduct a
survey based on the individual sentences, receive results of the survey,
filter the results, provide the filtered results to the annotating source
to conduct a second survey, receive results of the second survey, and
generate paraphrases based on the results of the second survey; and an
output for the paraphrases.

9. The system as set forth in claim 8 further comprising a database for
the raw text.

10. The system as set forth in claim 8 wherein the processor comprises a
script to provide the individual sentence to the annotating device.

11. The system as set forth in claim 8 wherein the processor comprises a
script to filter the results of the survey.

12. The system as set forth in claim 8 wherein the processer comprises a
script to provide the filtered results to the annotating source.

13. The system as set forth in claim 8 wherein the processor is operative
to generate paraphrase pairs.

14. The system as set forth in claim 8 wherein the processor is operative
to generate paraphrases on a many-to-one mapping basis.

15. The system as set forth in claim 8 wherein the annotating source
comprises a survey platform.

16. A system for acquiring paraphrases for use in natural language
processing applications, the system comprising: means for receiving raw
text as input; means for sentence breaking the raw text into individual
sentences; means for providing the individual sentences and a
corresponding survey to an annotating source, wherein the annotating
source conducts the survey based on the individual sentences; means for
receiving results of the survey from the annotating source; means for
filtering the survey results; means for proving the filtered survey
results and a second survey to the annotating source, wherein the
annotating source conducts the second survey based on the filtered
results; means for receiving results of the second survey from the
annotating source; and means for generating paraphrases based on the
results of the second survey.

Description:

BACKGROUND

[0001] The ability to recognize many different ways of expressing the same
or similar meaning is important to many Natural Language Processing (NLP)
applications, such as question answering, searching, etc. Paraphrase
acquisition is the process used to address this issue. However, the
current technology and approaches implementing paraphrase acquisition are
inefficient and/or inadequate.

[0002] In this regard, paraphrase corpora are typically obtained either
through manual annotations or through machine learning. Manual annotation
of paraphrases is typically expensive and time consuming, while machine
learned paraphrases are often error prone.

BRIEF DESCRIPTION

[0003] In one aspect of the presently described embodiments, the method
comprises receiving raw text as input, sentence breaking the raw text
into individual sentences, providing the individual sentences and a
corresponding survey to an annotating source, wherein the annotating
source conducts the survey based on the individual sentences, receiving
results of the survey from the annotating source, filtering the survey
results, proving the filtered survey results and a second survey to the
annotating source, wherein the annotating source conducts the second
survey based on the filtered results, receiving results of the second
survey from the annotating source, and generating paraphrases based on
the results of the second survey.

[0004] In another aspect of the presently described embodiments, the raw
text is provided by a database.

[0005] In another aspect of the presently described embodiments, the
providing the individual sentences and a survey is based on a computing
script.

[0006] In another aspect of the presently described embodiments, the
filtering is based on a computing script.

[0007] In another aspect of the presently described embodiments, the
providing the filtered survey results and the second survey is based on a
computing script.

[0008] In another aspect of the presently described embodiments, the
generating paraphrases comprises generating paraphrase pairs.

[0009] In another aspect of the presently described embodiments, the
generating paraphrases comprises generating paraphrases in a many-to-one
mapping of paraphrases.

[0010] In another aspect of the presently described embodiments, the
system comprises an input for raw text, a processor operative to break
the raw text into individual sentences, provide the individual sentences
to an annotating source to conduct a survey based on the individual
sentences, receive results of the survey, filter the results, provide the
filtered results to the annotating source to conduct a second survey,
receive results of the second survey, and generate paraphrases based on
the results of the second survey, and an output for the paraphrases.

[0011] In another aspect of the presently described embodiments, the
system further comprises a database for the raw text.

[0012] In another aspect of the presently described embodiments, the
processor comprises a computing script to provide the individual sentence
to the annotating device.

[0013] In another aspect of the presently described embodiments, the
processor comprises a computing script to filter the results of the
survey.

[0014] In another aspect of the presently described embodiments, the
processer comprises a computing script to provide the filtered results to
the annotating source.

[0015] In another aspect of the presently described embodiments, the
processor is operative to generate paraphrase pairs.

[0016] In another aspect of the presently described embodiments, the
processor is operative to generate paraphrases on a many-to-one mapping
basis.

[0017] In another aspect of the presently described embodiments, the
annotating source comprises a survey platform.

[0018] In another aspect of the presently described embodiments, the
system comprises means for receiving raw text as input, means for
sentence breaking the raw text into individual sentences, means for
providing the individual sentences and a corresponding survey to an
annotating source, wherein the annotating source conducts the survey
based on the individual sentences, means for receiving results of the
survey from the annotating source, means for filtering the survey
results, means for proving the filtered survey results and a second
survey to the annotating source, wherein the annotating source conducts
the second survey based on the filtered results, means for receiving
results of the second survey from the annotating source, and means for
generating paraphrases based on the results of the second survey.

BRIEF DESCRIPTION OF THE DRAWINGS

[0019]FIG. 1 is an example of a survey according to the presently
described embodiments;

[0020] FIG. 2 is an example of a survey according to the presently
described embodiments;

[0021]FIG. 3 is an illustration of a method according to the presently
described embodiments; and

[0022] FIG. 4 is an illustration of a system according to the presently
described embodiments.

DETAILED DESCRIPTION

[0023] The presently described embodiments relate to automatic paraphrase
acquisition. A common theme of the various embodiments described herein
resides in careful design of simple tasks that can elicit the necessary
information for the automated process. In at least one form, these tasks
are performed quickly and inexpensively by untrained non-expert workers
or survey respondents. By gathering, filtering and/or analyzing the
results produced by the workers or respondents, paraphrases can be
generated automatically to achieve the objectives of the presently
described embodiments.

[0024] The presently described embodiments address the problems noted
above. The approach is exemplified by an application case in the domain
of sentiment analysis. However, this approach can also be applied to
other application domains such as question answering, searching,
information extraction and information retrieval.

[0025] Briefly, current research on paraphrase acquisition focuses on
recognizing any arbitrary pairs of paraphrases.

[0026] The output of such systems is represented as follows.

[0027] Expression A=Expression B

[0028] Expression B=Expression C

[0029] Expression C=Expression D

[0030] However, for real world NLP applications, such as question
answering, searching etc., the presently described embodiments recognize
that the more desired paraphrase format is multiple-to-one mapping such
as the following.

[0031] Expression B=Expression A

[0032] Expression C=Expression A

[0033] Expression D=Expression A

[0034] In the above format, `Expression A` is the chosen standard
expression and serves as the key for extracting and retrieving data.

[0035] Based on this stipulation, the approach according to the presently
described embodiments aims to automatically generate paraphrases in the
above format. Careful design of simple tasks that can elicit the
necessary information for this automated process of paraphrase
generation. Selected tasks can be performed quickly and inexpensively by
untrained non-expert workers or respondents, such as the ones on the
Amazon Mechanical Turk platform or any other automated survey platform.
It should be appreciated that any suitable source to provide results and,
thus, annotate the sentences will suffice. For example, in the absence of
a suitable survey platform, one may wish to hire workers to provide
survey results to annotate the sentences. In any event, by gathering the
results produced by the workers or respondents, paraphrases are generated
automatically by the presently described embodiments. To illustrate how
this goal is achieved, an example for sentiment analysis is described
below.

[0036] In the domain of sentiment analysis, it is important to identify
various ways of expressing the same opinion. For example, all of the
following sentences indicate the same opinion of `The construction
quality of the camera is bad.`

[0037] (1) One thing I have to mention is that the battery door keeps
falling off.

[0038] (2) On my recent trip to California, I dropped my camera and it
broke into two parts.

[0039] (3) I have to say the build of this camera is rather disappointing.

[0040] To generate paraphrases for `The construction quality of the camera
is bad` from (1)-(3), the following two types of information are used.
First, whether or not the sentence expresses a negative opinion regarding
the construction quality of the camera is assessed. Second, the exact
portion of the sentence that indicates that opinion is determined. To
obtain such information, surveys may be designed and used. Such surveys
may be given to a sampling of people, as described above, to complete the
surveys to generate data or results. Thus, the respondents to the surveys
are providing annotations to the text.

[0041] With reference to FIG. 1, Survey 1 (shown at 100) asks the workers
to judge whether a sentence (e.g. sentences (1), (2) or (3)) indicates an
opinion towards a certain feature of the camera, and if so, whether the
opinion is positive, negative or neutral. As noted, a well-designed
survey at this stage allows the system to begin to determine paraphrases
for statements such as, "The construction quality of the camera is bad."
For example, the example annotations for sentences (1)-(3) are shown in
FIG. 1. As can be seen, the survey 100 includes a Feature Name field 102,
and various response fields such as Not Invoked field 104, Positive field
106, Negative field 108 and Neutral field 110. The survey respondents are
prompted and able to select one of the response fields for each feature,
In the example shown, a negative response is provided for the feature
Construction Quality for the sentences, So, sentences (1)-(3) (or at
least parts thereof) are, thus far in the processing, candidates to be
paraphrases for the statement, "The construction quality of the camera is
bad." In the same survey, the respondent selected Not Invoked for the
features Picture Quality and Battery Life, in view of the same sentences
. Therefore, for Picture Quality or Battery Life statements, sentences
(1)-(3) are not likely to be or include suitable paraphrases, at least
based on this data.

[0042] In an experiment, 2000 sentences were randomly selected from a
database of camera reviews. Each sentence was annotated by two online
workers separately. 855 gold-standard annotations were obtained. Each
annotation comprised a sentence labeled with a feature and a sentiment
toward that feature. An annotation was considered "gold" when both
annotators marked the same sentiment toward the same feature.
Subsequently, these 855 sentences were then used in Survey 2, as
described below.

[0043] With reference to FIG. 2, Survey 2 (shown at 200) asks the workers
to point out the exact portion of the sentence that indicates an opinion.
The opinion and its associated feature name are displayed along with the
sentence in which they appear. Such information is automatically
generated from the results derived from Survey 1.

[0044] The expected answer for this example is `I dropped my camera and it
broke into two parts.` Or simply `my camera broke into two parts.`

[0045] Based on these two results:

[0046] 1. Given that by now we already know that the sentence `On my
recent trip to California, I dropped my camera and it broke into two
parts.` expresses a Negative opinion toward the feature Construction
Quality; and

[0047] 2. the exact portion of the sentence to indicate that opinion is
`my camera broke into two parts,` we can automatically generate the
following paraphrase pair:

[0048] My camera broke into two parts.=The construction quality of the
camera is bad.

[0049] As noted, using this method, we automatically generated 855
paraphrase pairs for the original 2000 sentences. An initial
investigation shows that about 87% of the generated paraphrases are
valid. This approach delivers high accuracy for low cost. We acquired the
results in a short time, e.g. a few days, with a minimal total cost,
including fees paid to the annotation sources. Using this approach, the
desired many-to-one mapping of paraphrases can be achieved. For example,
many paraphrases (such as selected parts of (1) to (3) above) are mapped
to the statement, "The construction quality of the camera is bad."

[0050] We further tested the quality and effectiveness of the acquired
paraphrases by using them as our training data to perform a sentence
level sentiment extraction. The goal is to extract camera features and
their associated polarity values from the sentences. Our initial
experiment gives results that are comparable to state-of-art systems.
Furthermore, we compared the results from training on the acquired
paraphrases with the results from training on the original sentences from
which the paraphrases were derived and found that the former outperforms
the latter, which suggests that the paraphrases are indeed helpful for
this sentiment extraction task.

[0051] With reference to FIG. 3, a method 300 according to the presently
described embodiments is illustrated. It should be appreciated that the
method 300 may be implemented in a variety manners including the
implementation of various software techniques and hardware
configurations. One such implementation will be described in connection
with FIG. 4. However, it should be appreciated that a variety of such
configurations are contemplated by the presently described embodiments.

[0052] With reference back now to FIG. 3, the method 300 includes first
obtaining raw text or corpora (at 302). It should be appreciated that the
raw text can be obtained by the contemplated system from a variety of
sources including suitable databases. Next, the raw text is broken up
into, for example, individual sentences using, for example, a sentence
breaking routine (at 304). It should be appreciated that the parsing or
breaking of the text into sentences, or other units such as paragraphs,
phrases, words or groups of words, can be accomplished using a variety of
techniques familiar to those in the field. However, in at least one form,
this procedure is accomplished on an automated basis by suitable
processors.

[0053] The individual sentences are then provided or uploaded to an
annotating source for processing (at 306). It will be understood that
this processing includes the completion of surveys as described above.
The surveys are also uploaded to the annotating source. Next, the results
of the survey are obtained and/or downloaded to a suitable system (at
308). In some cases, the results are filtered to prepare for further
stages of annotation (at 310). It should be further understood that the
filtered results are then provided or uploaded to the annotating source
(e.g. in a first stage or stage 1 of the process) (at 312) and used in a
second survey (which is also uploaded) in manners similar to those
described above (e.g. in a second stage or stage 2 of the process). The
results are obtained and downloaded (at 314). The results of the second
survey can then be used to generate paraphrases and/or paraphrase pairs
as described above (at 316). Of course, further stages, e.g. up to stage
n, may be added to the process.

[0054] As noted above, it should be appreciated that the pruning of
non-informative phrases as discussed by way of example above (e.g. in
connection with FIG. 2), allows for the system to then more effectively
acquire more accurate paraphrases and enhance the process. In this way,
many-to-one mappings of paraphrases that have non-informative information
(such as the phrase "On my recent trip to California," in the example
above) pruned out can be accomplished according to the presently
described embodiments.

[0055] With reference now to FIG. 4, a system implementation of the
presently described embodiments is illustrated. As noted above, the
system may be configured in a variety of manners; however, any such
system will be efficiently designed to prune non-informative phrases from
sentences to achieve the goals of the presently described embodiments.

[0056] As shown, a system 400 includes the processing module 402, input
raw data source 404, an annotating source 406 and an output 408. The
processing module 402 further includes a micro-processor 410, a raw data
buffer 412 and a results storage device 414. Also shown within the
processing module 402 are computing scripts used for various stages in
the process (at 420). More particularly, computing script modules for
stages 420-1, 420-2, . . . 420-n are shown. Again, the processing module
402 may be implemented using a variety of software techniques and
hardware configurations. The system shown is merely representative and
exemplary in nature.

[0057] Raw data source 404 may likewise take a variety of forms. Raw data
source 404 may comprise a database, or other server device that will
provide sufficient text or corpora to the system for training purposes.

[0058] The annotating source 406 also may take a variety of forms. In one
form, workers on the Amazon Mechanical Turk platform are used as the
annotating source 406. However, as noted above, it should be appreciated
that any source of survey results will suffice.

[0059] In operation, the micro-processor 410 uploads the raw data from the
raw data source 404 through, in one form, the input buffer 412, to the
annotating source 406 based on the script module for stage 1 of the
process. It should be understood that the raw data, in at least one form,
is broken up into individual sentences (or other units) as described
above (e.g. before or during uploading to the annotating source) by the
micro-processor 410 or other suitable processor. The annotating source
provides results (e.g. survey results obtained from respondents to Survey
1) to the micro-processor which stores the results in a storage, such as
results storage 414. The results from Survey 1 are filtered and uploaded
for stage 2 of processing by the micro-processor 410 based on the script
for stage 2. In this regard, the micro-processor 410 provides the
selected results and Survey 2 to the annotating source to obtain results
from stage 2 (e.g. results from Survey 2). Once the results for the
second stage are obtained, the paraphrases or paraphrase pairs may be
obtained, or further stages may be implemented. Also, it should be
appreciated that the uploading and downloading may be initiated and/or
accomplished manually or automatically.

[0060] It will be appreciated that variants of the above-disclosed and
other features and functions, or alternatives thereof, may be combined
into many other different systems or applications. Various presently
unforeseen or unanticipated alternatives, modifications, variations or
improvements therein may be subsequently made by those skilled in the art
which are also intended to be encompassed by the following claims.