Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

Systems and methods for efficiently detecting and coordinating step
changes, trends, cycles, and bursts affecting lexical items within data
streams are provided. Data streams can be sourced from documents that can
optionally be labeled with metadata. Changes can be grouped across
lexical and/or metavalue vocabularies to summarize the changes that are
synchronous in time. The methods described herein can be applied either
retrospectively to a corpus of data or in a streaming mode.

Claims:

1. A method for detecting and coordinating change events in a data stream
comprising: monitoring, by a processor, over time, a probability of
occurrence of lexical items in a data stream comprising a plurality of
lexical items and a metavalue associated therewith, according to a
lexical occurrence model, to detect a plurality of change events in the
data stream; applying, by the processor, a significance test to the
change events to determine if the change events are statistically
significant; applying, by the processor, an interestingness test to the
change events to determine a measure of interest (I) indicating whether
the change events are likely to be of interest to a user, the
interestingness test defined using conditional mutual information between
the lexical items (W) and the lexical occurrence model (M) given a time
span (T) as provided by the relationship: I(W:M|T)=H(W|T)-H(W|M,T) where
H represents conditional entropy; and grouping the change events across
the lexical items and the metavalue to summarize the change events that
are synchronous in time, the grouping forming a set of grouped change
events.

3. The method of claim 1, wherein the lexical items in the data stream
comprise at least one of a single word, a symbol, a number, a date, a
place, a named-entity, a URL, textual data, multimedia data, and a token.

4. The method of claim 1, wherein the metavalue associated with the
lexical items includes at least one of external metadata and internal
metadata.

5. The method of claim 1, wherein the probability of occurrence of the
lexical items in the data stream is monitored over time according to the
lexical occurrence model to detect at least one of a step change, a
trend, a cycle, and a burst in the data stream.

6. The method of claim 1, wherein the lexical occurrence model includes
at least one of a piecewise-constant lexical occurrence model and a
piecewise-linear lexical occurrence model.

7. The method of claim 1, wherein the lexical occurrence model includes a
periodic component to detect cyclic change events and a piecewise-linear
component to detect acyclic change events.

8. A non-transitory computer readable storage medium comprising computer
readable instructions that, when executed by a processor, cause the
processor to perform operations comprising: monitoring, over time, a
probability of occurrence of lexical items in a data stream comprising a
plurality of lexical items and a metavalue associated therewith,
according to a lexical occurrence model, to detect a plurality of change
events in the data stream; applying a significance test to the change
events to determine if the change events are statistically significant;
applying an interestingness test to the change events to determine a
measure of interest (I) indicating whether the change events are likely
to be of interest to a user, the interestingness test defined using
conditional mutual information between the lexical items (W) and the
lexical occurrence model (M) given a time span (T) as provided by the
relationship: I(W:M|T)=H(W|T)-H(W|M,T) where H represents conditional
entropy; and grouping the change events across the lexical items and the
metavalue to summarize the change events that are synchronous in time,
the grouping forming a set of grouped change events.

10. The non-transitory computer readable storage medium of claim 8,
wherein the lexical items in the data stream comprise at least one of a
single word, a symbol, a number, a date, a place, a named-entities, a
URL, textual data, multimedia data, and a token, and the metavalue
associated therewith.

11. The non-transitory computer readable storage medium of claim 8,
wherein the metavalue associated with the lexical items includes at least
one of external metadata and internal metadata.

12. The non-transitory computer readable storage medium of claim 8,
wherein the instructions for monitoring the probability of occurrence of
the lexical items in the data stream over time cause the processor to
detect at least one of a step change, a trend, a cycle, and a burst in
the data stream.

13. The non-transitory computer readable storage medium of claim 8,
wherein the lexical occurrence model includes at least one a
piecewise-constant lexical occurrence model and a piecewise-linear
lexical occurrence model.

15. A system for detecting and coordinating change events in a data
stream, comprising: a processor; a memory in communication with the
processor, the memory having stored thereon instructions, executable by
the processor to cause the processor to perform operations comprising:
monitoring, over time, a probability of occurrence of lexical items in a
data stream comprising a plurality of lexical items and a metavalue
associated therewith, according to a lexical occurrence model, to detect
a plurality of change events in the data stream; applying a significance
test to the change events to determine if the change events are
statistically significant; applying an interestingness test to the change
events to determine a measure of interest (I) indicating whether the
change events are likely to be of interest to a user, the interestingness
test defined using conditional mutual information between the lexical
items (W) and the lexical occurrence model (M) given a time span (T) as
provided by the relationship: I(W:M|T)=H(W|T)-H(W|M,T) where H
represents conditional entropy; and grouping the change events across the
lexical items and the metavalue to summarize the change events that are
synchronous in time, the grouping forming a set of grouped change events.

16. The system of claim 15, wherein the data stream comprises a text
stream, and the lexical items comprise at least one of a single word, a
symbol, a number, a date, a place, a named-entities, a URL, textual data,
multimedia data, and a token.

17. The system of claim 15, wherein the metavalue comprises at least one
of external metadata and internal metadata.

18. The system of claim 15, wherein the instructions for monitoring the
probability of occurrence of the lexical items in the data stream over
time cause the processor to detect at least one of a step change, a
trend, a cycle, and a burst in the data stream.

19. The system of claim 15, wherein the lexical occurrence model includes
at least one of a piecewise-constant lexical occurrence model and a
piecewise-linear lexical occurrence model.

20. The system of claim 15, wherein the lexical occurrence model includes
a periodic component to detect cyclic change events and a
piecewise-linear component to detect acyclic change events.

Description:

CROSS REFERENCE TO RELATED APPLICATION

[0001] This application is a continuation of U.S. application Ser. No.
12/325,157, filed Nov. 29, 2008, the entirety of which is incorporated
herein by reference.

TECHNICAL FIELD

[0002] The present disclosure relates generally to identifying trends in a
data set and, more particularly, to systems and methods for detecting and
coordinating changes in lexical items.

BACKGROUND

[0003] Text streams are ubiquitous and contain a wealth of information,
but are typically orders of magnitude too large in scale for
comprehensive human inspection. Organizations often collect voluminous
corpora of data continuously over time. The data may be, for example,
email messages, transcriptions of customer comments or of phone
conversations, recordings of phone conversations, medical records,
news-feeds, or the like. Analysts in an organization may wish to learn
about the contents of the data and the changes that occur over time,
including when and why, such that they may understand and/or act upon the
information contained within the data. Because of the large volume of
data, reading each document in the corpora of data individually to
determine the changes and summarize the contents can be expensive as well
as difficult or impossible.

SUMMARY

[0004] The present disclosure describes systems and methods for
efficiently detecting step changes, trends, cycles, and bursts affecting
lexical items within one or more data streams. The data stream can be a
text stream that includes, for example, documents and can optionally be
labeled with metadata. These changes can be grouped across lexical and/or
metavalue vocabularies to summarize the changes that are synchronous in
time. A lexical item can include a single word, a set of words, symbols,
numbers, dates, places, named-entities, URLs, textual data, multimedia
data, other tokens, and the like. A metavalue can include information
about incoming text or other incoming data. Metadata can be external
metadata or internal metadata, External metadata can include facts about
the source of the document. Internal metadata can include labels inferred
from the content. Examples of metavalues include, but are not limited to,
information about the source, geographic location, current event data,
data type, telecommunications subscriber account data, and the like.

[0005] In one embodiment of the present disclosure, a method for
efficiently detecting and coordinating change events in data streams can
include receiving a data stream. The data stream can include various
lexical items and one or more metavalues associated therewith. The method
can further include monitoring a probability of occurrence of the lexical
items in the data stream over time according to a lexical occurrence
model to detect a plurality of change events in the data stream. The
method can further include applying a significance test and an
interestingness test. The significance test can be used to determine if
the change events are statistically significant. The interestingness test
can be used to determine if the change events are likely to be of
interest to a user. The interestingness test can be defined using
conditional mutual information between the lexical items and the lexical
occurrence model given a time span to determine the amount of information
that is derived from the change event. The method can further include
grouping the change events across the lexical items and the metavalue to
summarize the change events that are synchronous in time. The method can
further include presenting, via an output device, a summarization of the
grouped change events to the user.

[0006] In some embodiments, the change events are step changes, trends,
cycles, or bursts in the data stream.

[0007] In some embodiments, the lexical occurrence model is a
piecewise-constant lexical model, for example, based upon a Poisson or
other distribution. In other embodiments, the lexical occurrence model is
a piecewise-linear lexical model, for example, based upon a Poisson or
other distribution. In still other embodiments, the lexical occurrence
model includes a piecewise-linear component and periodic component to
detect the change events in the data stream for recent data and long-span
data, respectively.

[0008] In some embodiments, the interestingness test can be defined by the
relationship:

I(W:M|T)=H(W|T)-H(W|M,T)

to determine the amount of information that is derived from the change
event.

[0009] In some embodiments, the method can further include applying the
monitoring step in a stream analysis mode. In a stream analysis mode, the
lexical occurrence model includes a slowly-evolving periodic component
for modeling regular cyclic changes, together with a piecewise-linear
component for modeling irregular acyclic changes that may occur over
either long or short timescales.

[0010] According to another embodiment of the present disclosure, a
computer readable medium can include computer readable instructions that,
when executed, perform the steps of the aforementioned method.

[0011] According to another embodiment of the present disclosure, a
computing system for detecting and coordinating change events in data
streams can include a processor, an output device, and a memory in
communication with the processor. The memory can be configured to store
instructions, executable by the processor to perform the steps of the
aforementioned method.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] FIG. 1 schematically illustrates a computing system for use in
accordance with various exemplary embodiments of the present disclosure.

[0013]FIG. 2 schematically illustrates a system architecture for
implementing a CoCITe (Coordinating Changes In Text) tool in a
retrospective analysis mode of operation in accordance with an exemplary
embodiment of the present disclosure.

[0014]FIG. 3 schematically illustrates a system architecture for
implementing a CoCITe tool in a stream analysis mode of operation in
accordance with an exemplary embodiment of the present disclosure.

[0015]FIG. 4 schematically illustrates a method for operating a CoCITe
tool in accordance with an exemplary embodiment of the present
disclosure.

[0016] FIG. 5 schematically illustrates a method for operating a CoCITe
tool in accordance with another exemplary embodiment of the present
disclosure.

[0017]FIG. 6 is an exemplary graph of a two-segment lexical occurrence
model with periodic modulation, according to the present disclosure.

[0018]FIG. 7 illustrates an exemplary optimization of lexical occurrence
model components, according to the present disclosure.

[0019]FIG. 8 is an exemplary graph of a significance test for
change-points, according to the present disclosure.

[0020]FIG. 9 is an exemplary graph of the likelihood computation time for
two exemplary likelihood computation methods, according to the present
disclosure.

[0021]FIG. 10 is an exemplary log-scale plot of the average per-word CPU
time to optimize a piecewise-linear model as a function of length of data
for two exemplary likelihood computation methods, according to the
present disclosure.

[0022] FIG. 11 is an exemplary graph of several profiles that show various
exemplary types of step events, each with an onset phase shown in bold
including one or more change-points, according to the present disclosure.

[0023] FIG. 12 is an exemplary graph of several profiles that show various
exemplary types of burst events, each with an offset phase shown in bold
including the onset phase, according to the present disclosure.

[0024] FIG. 13 is an exemplary plot of events on the In-w plane, according
to the present disclosure.

[0025]FIG. 14 is an exemplary table summarizing results obtained by
applying a CoCITe method to various corpora, according to the present
disclosure.

[0026] FIG. 15 is an exemplary plot of two of the responses to the initial
greeting prompt for an Interactive Voice Response (IVR) application for
an electronics company over a 90-day period, according to the present
disclosure.

[0027] FIG. 16 is an exemplary plot for flight status requests at the
initial greeting for an airline application, according to the present
disclosure.

[0028] FIG. 17 is an exemplary table illustrating the top ten clusters
including a start date, the number of words, and the metavalues (states)
in each cluster for a plurality of events, according to the present
disclosure.

[0030] FIG. 19 is an exemplary plot of the profile of the burst event
using daily data for the death of Princess Diana, according to the
present disclosure.

[0031] FIG. 20 is an exemplary table illustrating event clusters for Enron
in the year 2000, according to the present disclosure.

[0032] FIG. 21 is an exemplary plot for daily and weekly periodic
variation for hourly data acquired from an IVR application, according to
the present disclosure.

[0033] FIG. 22 is an exemplary plot of data acquired from an IVR
application, according to the present disclosure.

[0034] FIG. 23 is an exemplary plot of Botnet activity as detected by an
exemplary CoCITe tool, according to the present disclosure.

DETAILED DESCRIPTION

[0035] As required, detailed embodiments of the present disclosure are
disclosed herein. It must be understood that the disclosed embodiments
are merely exemplary examples of the disclosure that may be embodied in
various and alternative forms, and combinations thereof. As used herein,
the word "exemplary" is used expansively to refer to embodiments that
serve as an illustration, specimen, model or pattern. The figures are not
necessarily to scale and some features may be exaggerated or minimized to
show details of particular components. In other instances, well-known
components, systems, materials or methods have not been described in
detail in order to avoid obscuring the present disclosure. Therefore,
specific structural and functional details disclosed herein are not to be
interpreted as limiting, but merely as a basis for the claims and as a
representative basis for teaching one skilled in the art to variously
employ the present disclosure.

[0036] By way of example and not limitation, consider a flow of text in
the form of a stream of documents, each labeled with a time stamp and
optionally with metadata, for example, the values of zero or more
metavariables of the source. Each document can contain a set of words.
The analysis described herein is also applicable to more general lexical
items, such as, for example, phrases and non-local conjunctions. Given
the enormous volumes of text currently being acquired and stored in many
domains, it is impractical for human analysts to scan these volumes in
order to find and summarize the important changes that are occurring,
especially in a timely manner. Accordingly, the present disclosure
provides systems and methods for detecting changes in frequency of
occurrence of lexical items, either overall or for particular metavalues,
localizing these changes in time, and coordinating changes that are
synchronous in time across both lexical and metavalue vocabularies into
higher-order events.

[0037] The present disclosure approaches the term "event" from a
statistical view as would be understood by one skilled in the art. The
output of a system according to the present disclosure can be a set of
ranked groups, each of which can include one or more sets of lexical
items and metavalues together with a description of the timing of the
event, which can be a step, trend, cycle, burst, or the like. It is
contemplated that the system output can be accompanied by original
versions of documents that can be presented to an analyst for inspection.

[0038] Aspects of the present disclosure can be applied to documents of
any length, although accuracy has been found to increase for documents
that are relatively short. Documents can be divided into smaller
documents, paragraph by paragraph, sentence by sentence, word by word, or
character by character, for example. Some exemplary documents include:
[0039] search queries; [0040] instant messages; [0041] text messages;
[0042] customer care data, such as, but not limited to human-machine
dialogues (e.g.,

[0050] Metadata, if available, is valuable in several respects. Changes
are often concentrated in sub-streams of the text flow characterized by
particular metavalues. Hence, performing change-detection for individual
metavalues or groups thereof focuses the search where necessary and
avoids dilution. In addition, distinct groups of changes often overlap in
time and share words or metavalues. Also, availability of metadata helps
the coordination of changes into distinct events and avoids confusion.
From an analyst's perspective, having a change-event labeled with a
metavalue or group of metavalues helps to contextualize the change-event
and aids in understanding the change-event.

[0051] The potential disadvantages of using sub-streams are a loss of
power after separating the data into sub-streams for analysis, and
additional computational burden. To alleviate these disadvantages, the
present disclosure can impose a size limit on the metavalue vocabulary,
for example, by grouping metavalues to reduce computational burden. Size
limitations, if needed, can depend on the data set and the computational
resources available. A metavalue vocabulary size on the order of tens can
be preferable to one on the order of hundreds.

[0052] Conventional statistical tools can test two predetermined time
intervals for whether the frequency of a given lexical item changed. In
one embodiment of the present disclosure, neither the time intervals nor
the number of changes are predetermined. In one embodiment of the present
disclosure, the occurrences of the lexical item in a given text stream
are modeled by a Poisson process, and changes are expressed in terms of
the intensity of this process. The present disclosure can be fit to other
models, such as, but not limited to, processes described by generalized
Poisson distributions, binomial distributions, or negative binomial
distributions.

[0053] The present disclosure provides systems and methods for detecting
and coordinating changes of lexical items in the following exemplary
respects: [0054] The lexical vocabulary is not prescribed, although it
can be seeded with items of particular interest. [0055] Multiple
change-points for each lexical item can be detected using a dynamic
programming algorithm that ensures optimality. [0056] The Poisson
intensity parameter is assumed to be piecewise-linear. In addition to
step changes, this allows the event occurrence rate to trend upwards or
downwards in between the change-points. [0057] A multi-phase periodic
modulation can be superimposed on the intensity.

[0058] This allows for regular (e.g., weekly) cycles, and avoids the
redundant discovery of these as change-points. [0059] A measure of
interestingness is introduced. This weights each change-point by how much
information it provides, and complements the more conventional measure of
statistical significance. [0060] Metadata are expressly incorporated into
the analyst. [0061] Individual atomic changes affecting word/metavalue
combinations are grouped together where these are likely to arise from a
common cause. This provides a structured output that is easier for a
human analyst to assess.

[0062] Referring now to the drawings wherein like numerals represent like
elements throughout the drawings, FIG. 1 illustrates an exemplary
computing system 100 with which the present disclosure can be
implemented. The illustrated system 100 includes a system bus 102 that
couples various system components including a processor 104, a system
memory 106, a read only memory (ROM) 108, and a random access memory
(RAM) 110 to the processor 104. Other system memory can be available for
use as well. It can be appreciated that the present disclosure can
operate on a computing system with more than one processor 104 or on a
group or cluster of computing systems networked together to provide
greater processing capability. The system bus 102 can be any of several
types of bus structures including a memory bus or memory controller, a
peripheral bus, and a local bus using any of a variety of bus
architectures. A basic input/output (BIOS), containing the basic routine
that helps to transfer information between elements within the computing
system 100, such as during start-up, is typically stored in ROM 108. The
illustrated computing system 100 further includes a storage device 112,
such as a hard disk drive, a magnetic disk drive, an optical disk drive,
tape drive, or the like. The storage device 112 is connected to the
system bus 102 by a drive interface. The drives and the associated
computer readable media provide nonvolatile storage of computer readable
instructions, data structures, program modules, and other data for the
computing system 100. The basic components are known to those of skill in
the art and appropriate variations are contemplated depending on the type
of system, such as whether the system is a small, handheld computing
device, a desktop computer, a computer server, a network cluster, and the
like.

[0063] Although the exemplary environment described herein employs the
hard disk, it should be appreciated by those skilled in the art that
other types of computer readable media which can store data that are
accessible by a computer, such as magnetic cassettes, flash memory cards,
digital versatile disks, cartridges, RAMs, ROMs, a cable or wireless
signal containing a bit stream and the like, can also be used in the
exemplary operating environment.

[0064] To enable user interaction with the computing system 100, an input
device 114 represents any number of input mechanisms, such as a
microphone for speech, a touch-sensitive screen for gesture or graphical
input, keyboard, mouse, motion input, and the like. An output device 116
can also be one or more of a number of output means, such as a display,
monitor, projector, touch screen, multi-touch screen, or other output
device capable of presenting results data to an analyst in a visual
manner.

[0065] In some instances, multimodal systems enable a user to provide
multiple types of input to communicate with the computing system 100. A
communications interface 118 generally governs and manages the user input
and system output. There is no restriction on the present disclosure
operating on any particular hardware arrangement and therefore the basic
features here may be substituted, removed, added to, or otherwise
modified for improved hardware or firmware arrangements as they are
developed.

[0066] Referring now to FIG. 2, a system architecture 200 for implementing
a CoCITe (Coordinating Changes In Text) tool in a retrospective analysis
mode of operation is illustrated in accordance with an exemplary
embodiment of the present disclosure. The illustrated system architecture
200 includes a CoCITe tool 202 that can be configured to operate in a
retrospective analysis mode, In an exemplary embodiment, a corpus of data
204 is received at the CoCITe tool 202, analyzed over a specified period
of time according to a lexical occurrence model 206, output to a
visualization interface 208 (realized via one or more output devices
116), and presented to an end user, such as an analyst, in a graph, plot,
table, or other visualization. In the retrospective analysis mode, all
modeling and visualization covers the specified period of time.

[0067] Referring now to FIG. 3, a system architecture 300 for implementing
a CoCITe tool 302 in a stream analysis mode of operation is illustrated
in accordance with an exemplary embodiment of the present disclosure. The
illustrated system architecture 300 includes a CoCITe tool 302 that can
be configured to operate in a stream analysis mode. In an exemplary
embodiment, a corpus of data 304 is received at the CoCITe tool 302 and
analyzed together with a history file 306. A new history file 306 can be
generated together with the output of the change-detection algorithms
described herein. The history file 306 can include past data that is
useful for future analyses to create future training models in
conjunction with new data. The history file 306 does not grow without
bound because model segments are regularly transitioned to permanent
status and the history file 306 is updated accordingly. As the time span
lengthens, the first segment of the fitted model eventually becomes
permanent and the start point moves forward to the end of that segment,
Both temporary and permanent models go into the visualization covering
any time-span.

[0068] In the stream analysis mode, the CoCITe tool 302 can create
permanent segments (permanent models 308) of the lexical occurrence model
from temporary models 310 as the span of incoming data moves forward in
time. Accordingly, the CoCITe tool 302 can receive data on an on-going
basis, analyze the data, output results to a visualization interface 312
(realized via one or more output devices 116), and presented to an end
user, such as an analyst, in a graph, plot, table, or other
visualization. In the stream analysis mode, new data arrives on an
on-going basis, existing models are extended and updated, and an
arbitrary time-span can be used for visualization.

[0069] The stream analysis mode improves efficiency over the retrospective
analysis mode because earlier data is already pre-processed for model
training and new data can be added expeditiously. The stream analysis
mode also decouples optimization of model components. The periodic
component changes slowly and the model is thereby trained using smoothed
data from a long time-span. The piecewise-linear component may change
quickly and the model is thereby trained using fully-detailed recent
data.

[0070] Referring now to FIG. 4, a method 400 for operating a CoCITe tool
202, 302 is illustrated, according to an exemplary embodiment of the
present disclosure. It should be understood that the illustrated method
400 can be performed by a CoCITe tool 202, 302 operating in a
retrospective analysis mode or a stream analysis mode as described above.
It should be understood that the steps of the method 400 are not
necessarily presented in any particular order and that performance of
some or all the steps in an alternative order(s) is possible and is
contemplated. The steps have been presented in the demonstrated order for
ease of description and illustration. Steps can be added, omitted and/or
performed simultaneously without departing from the scope of the appended
claims. It should also be understood that the illustrated method 400 can
be ended at any time. Some or all steps of this process, and/or
substantially equivalent steps, can be performed by execution of
computer-readable instructions included on a computer readable medium.

[0071] The method 400 begins and flow proceeds to block 402 wherein one or
more data streams including one or more documents each optionally labeled
with metadata are received at the CoCITe tool 202, 302. It should be
understood that the use of the term "documents" here is merely exemplary
and the data stream can alternatively include raw or unformatted text, or
other lexical items. Flow can proceed to block 404 wherein a
determination is made as to whether a lexical vocabulary is prescribed.
If a lexical vocabulary is not prescribed, flow can proceed to block 406
wherein a lexical vocabulary can be discovered. Flow can then proceed to
block 408 wherein the probability of occurrence of lexical items in the
incoming data streams over time is monitored. If a lexical vocabulary is
prescribed, flow can proceed directly to block 408. At block 410, changes
can be coordinated across lexical items and metadata. Flow can then
proceed to block 412 wherein results can be output for visualization in
the form of a graph, plot, table, or other visualization. The method can
end.

[0072] Referring now to FIG. 5, a method 500 for operating a CoCITe tool
202, 302 is illustrated, according to another exemplary embodiment of the
present disclosure. It should be understood that the illustrated method
500 can be performed by a CoCITe tool 202, 302 operating in a
retrospective analysis mode or a stream analysis mode as described above.
It should be understood that the steps of the method 500 are not
necessarily presented in any particular order and that performance of
some or all the steps in an alternative order(s) is possible and is
contemplated. The steps have been presented in the demonstrated order for
ease of description and illustration, Steps can be added, omitted and/or
performed simultaneously without departing from the scope of the appended
claims. It should also be understood that the illustrated method 500 can
be ended at any time. Some or all steps of this process, and/or
substantially equivalent steps, can be performed by execution of
computer-readable instructions included on a computer readable medium.

[0073] The method 500 begins and flow proceeds to block 502 wherein one or
more data streams including one or more documents each optionally labeled
with metadata can be received at the CoCITe tool 202, 302, At block 504,
an acyclic component of the lexical occurrence model can be defined such
that documents containing a particular lexical item are assumed to occur
at a rate described by an intensity function that is piecewise-linear
over time. For example, a Poisson distribution model or other
distribution models can be used. Each linear piece of the model is
referred to herein as a segment. There is no prescribed number of
segments. The acyclic component can be used to model step changes,
trends, and bursts in the incoming lexical items.

[0074] At block 506, an optional cyclic component of the lexical
occurrence model can be defined such that a multi-phase periodic
modulation can be superimposed on the intensity function. The cyclic
component can be used to model regular cyclic changes in rate and can
have multiple periods and phases. FIG. 6 illustrates a two-segment model
with periodic modulation that is modeled after a cyclic component of an
exemplary lexical occurrence model.

[0075] At block 508, the acyclic and cyclic model components are optimized
using a dynamic programming algorithm. The optimization results in a
likelihood of the data to maximize. The likelihood can be computed as the
product of the probability of the actual data values.

[0076] Referring briefly to FIG. 7, an exemplary optimization of the
lexical model components using a dynamic programming algorithm is
illustrated. The dynamic programming algorithm can optimize likelihood
for the piecewise-linear component given the most recent data. There is
no prescribed limit to the number of model segments in the optimization.
An overall quadratic-time implementation is contemplated. Measures of
significance and interest at change-points are used in the optimization.
The dynamic programming algorithm can use a maximum-likelihood procedure,
such as the exemplary procedure described herein below, to optimize the
periodic component.

[0077] At block 510, a significance test for change-points is applied.
Various exemplary significance tests are described herein below for a
piecewise-constant model and a piecewise-linear model. FIG. 8 is an
exemplary graph of a significance test for change-points, according to
the present disclosure. The difference in piecewise-linear segments is
shown, If both piecewise-linear segments are constant, a 2×2
contingency table can be used. Otherwise, a standard F-test can be used
to compare separate models (solid line) with a single model spanning both
segments (dashed line). A continuity test can reveal if the slope changes
but the intercept does not, then one less parameter is needed in the
overall model, The F-test comparing separate models with a weighted
two-phase regression model (green line).

[0078] At block 512, an interestingness test for change-points is applied.
The most significant changes are often not the most interesting. When
large amounts of data are received, a ranking based on significance can
obscure interesting changes affecting rare events. Accordingly, a measure
of interest or otherwise termed "interestingness" can be defined using
conditional mutual information between lexical item (W and model (M)
given time (T):

I(W:M|T)=H(W|T)-H(W|M,T)

where H( ) is conditional entropy. The measure of interest measures the
amount of information that can be learned from the change in the model,
allowing for the fact that the models may each depend on time (a trend
segment). The definition of the measure of interest is defined to cover
all situations and can therefore be used to rank changes consistently.
From an analyst's perspective, consistency of the interestingness measure
is decisive.

[0079] At block 514, the change-points are coordinated. Typically, there
is a lot of output from the change-detection procedure. An exemplary
method for coordinating changes can identify change-events as graph
nodes, create edges between nodes that share words and/or metavalues, run
a clustering algorithm, and output a measure of interest ranked list of
clusters.

[0080] In addition to the above, an optional bigram check can be
implemented. Changes often occur for different words at the same time but
for different reasons. Metadata do not always exist and may not be
sufficient to separate node clusters. A bigram check can be used to only
add edge connecting events with distinct words if bigram (document
co-occurrence) frequency exceeds threshold. The bigram check is an
effective filter against spurious combinations. The bigram check provides
an unbiased estimate of true frequency of arbitrary bigram from merged
priority-weighted samples of consolidated documents. The bigram check is
efficient and reliable and yields no false positives. Most false zeroes
have true frequencies are below threshold values.

[0081] At block 516, the results are output for visualization.
Visualization can be in the form of a graph, plot, table, or other
visualization output put on one or more output devices 116. The method
500 can end.

[0082] Provided below are two exemplary models, a piecewise-constant
lexical occurrence model and a piecewise-linear lexical occurrence model.
These models are provided for further explanation of the aforementioned
systems and methods and are not intended to limit the scope of the
appended claims.

Exemplary Piecewise-Constant Lexical Occurrence Model

A. Text Data Stream

[0083] In one embodiment of the present disclosure, a piecewise-constant
model is used to detect and coordinate changes in lexical items. In this
embodiment, a typical source of lexical items, structured into documents,
each labeled with a time stamp and optionally with metadata is
considered. An assumption is that each document contains a set of lexical
items that are of interest. In some embodiments, a prescribed vocabulary
is used. In other embodiments, an open-ended vocabulary is used. An
open-ended vocabulary can be acquired, for example, as part of the
analysis. In still other embodiments, a vocabulary can be seeded with
lexical items. The internal structure of each document can be ignored,
thereby treating each document or the collective whole of documents as a
set of words. Exceptions can include lexical items of interest that are
either n-grams or non-local conjunctions of words, in which case the
vocabulary of these can be prescribed in advance.

[0084] A system of the present disclosure can be used in either a
retrospective mode or a streaming mode. In retrospective mode, a corpus
of text files is presented for end-to-end processing. In streaming mode,
a summary file (previously generated by the system) is presented together
with the most recent data. A new or updated summary file can be generated
together with the output of the change-detection algorithms. The summary
file can contain enough information about the history for the system to
be able to reproduce the results as though it were done retrospectively,
but in far less time. Data can be carried forward from summary file to
summary file until a time horizon is reached which can depend on recent
change-points, so the summary file does not grow without bound,

[0085] In either mode, the system creates regular bins of data, for
example, daily, weekly, monthly, yearly, etc. The system can ignore the
arrival time of each document within each bin. For each bin, the system
can obtain frequency data: numbers of documents labeled with particular
metavalues, and numbers of documents labeled with particular metavalues
and containing particular words. The system can ignore multiple
occurrences of words within documents. In many instances, the presence of
a word in a document is more important than repetitions thereof because
repetitions often add little further information.

[0086] Text streams always stiffer from missing data. For this reason, the
system does not make any assumption that successive bins correspond to
regular time increments. If successive bins do correspond to regular time
increments, the system can be tolerant of bins that are empty or that
contain no data for particular metavalues.

[0087] The system analyzes frequencies of lexical items relative to
documents. If the number of documents in each bin varies substantially
then this can be separately tracked, but of greater interest here is the
content of these documents. This makes the analysis more robust to
missing data.

B. Poisson Likelihood

[0088] By way of example, consider a stream of bins of documents,
containing nmt documents labeled with metavalue m in the bin at t,
where 1≦m≦M and t is discrete: t=1, . . . , T. Let the
(unknown) probability that a document labeled with metavalue m in the bin
at t contains word (or lexical item) w be pwmt, and the measured
number of documents labeled with metavalue m in the bin at t that contain
word w be fwmt. Assume a Poisson model for this quantity, i.e.

fwmt˜Poi(nmtpwmt)

where the present disclosure temporarily conflates the random variable
with the measured value.

[0089] In one embodiment, the Poisson parameter pwmt is
piecewise-constant in time. Let there be I time segments where the ith
segment starts at si and ends at ei=si+1-1, with s1=1
and e1=T. Assume for now that this time-segmentation is known. We
also define e0=0 and s1+1=T+1 for convenience, and si,
i=2, . . . I are referred to below as change-points. Let Ti denote
the time range [si, ei], and define

N m i = t = s i e i n mt , F wmi =
t = s i e i f wmt ##EQU00001##

For word w and metavalue in the overall log-likelihood is provided by
equation (1), below.

[0090] The second term in equation (2) does not depend on the model or
segmentation and can be treated as constant during the optimization.

C. Multi-Phase Periodic Modulation

[0091] The subscripts w and m are dropped hereinafter for brevity. Suppose
that for a word w and a metavalue m, there is a periodic modulation where
each bin t is labeled with a phase p from some set P. For example for
daily binning P={Monday, . . . , Sunday}, or for hourly binning P={0, . .
. , 23}. More complex forms of cyclic behavior can also be accommodated.
There is no requirement for a fixed period on t because of the
possibility of missing data or, for example, to accommodate for a monthly
variation and the fact that the months have unequal length. In this
embodiment, the present disclosure assumes that the time-segmentation is
known. Let Tp denote the subset of T with phase p, and Tip
denote the subset of Ti with phase p. Also let

where qp≧0 is common for all segments. Because only |P|-1 of
these values are independent the present disclosure sets the largest
equal to one, and if all the remaining qp also equal one then there
is no periodic effect. The present disclosure can also map the phases to
a smaller set where the values of qp are similar. For daily binning,
for example, it has been found that different behavior is seen at
weekends compared with weekdays, but the weekend-days are similar to each
other, as are the weekdays. P is then binary. This mapping can be
discovered automatically using a dynamic programming algorithm that
optimizes both the final number of phases and the mapping.

These may be solved for the |P|-1 independent values of qp, and
hence the present disclosure obtains {ri}i=1 . . . , I using
equation (4). For a two-phase periodic modulation, equation (5)
transforms into a polynomial equation of degree I for the unknown
qp, which can be solved exactly for I≦4 or numerically for
any I.

D. Dynamic Programming Optimization

[0093] In this embodiment, the present disclosure assumes that the time
segmentation (equivalently the set of change-points si, i=2, . . . ,
I) is unknown, although this may not necessarily be the case. A dynamic
programming algorithm can be used to efficiently find the optimum
segmentation. The periodic modulation parameters qp are assumed
known. The reason for this is that these are global parameters and to
attempt to optimize these at the same time as the segmentation would
violate the Bellman principle of optimality. If {qp}p.di-elect
cons.P are unknown then the method below can be iterated: initially the
present disclosure assumes all gp=1, finds the optimum segmentation,
and then solves equation (5) for qp. The method can repeat. This
method generally converges after two or three iterations.

[0094] In one embodiment, the dynamic programming algorithm can be
represented as follows. Let [0095] A(J, τ) be the total
log-likelihood (excluding the constant term) for an optimal J-segment
model on 1≦t≦τ, [0096] B(J, τ) be the location of
the most recent change-point (start of segment J) for this model, and
[0097] L(s, τ) be the contribution to the log-likelihood for the data
from s to τ inclusive, assuming a constant Poisson intensity
optimized on that interval, and ignoring the constant term. Then from
equation (3), the present disclosure derives equation (6), below.

[0109] In step 2(b), if a J-I-segment model exists on [1, s--1] (for some
s>1) then the latest segment on [s, τ] can potentially be appended
to it giving a J-segment model on [1, τ]. The restriction sig(s)
denotes that the potential change-point at s satisfies both the criterion
of significance and that of interestingness. It is these criteria that
limit the number of segments I discovered: it is not uncommon for no
significant changes to be discovered, in which case the procedure
terminates with I=1.

[0110] This procedure is optimal: recursively, the optimal segmentation
into segments on [1, T] must be given by the maximum over s of the
optimal segmentation into I-1 segments on [1, s-1] combined with a single
segment on [s, T]. And, no segmentation into less than I segments is
expected to give a higher likelihood than the optimum for I.

[0111] Various additional quantities are also stored during step 2(b) for
recovery during the back-trace for the optimum segmentation, including
the model parameters for the Jth segment [s, τ] (which for the
piecewise-linear model will be aJ,{circumflex over (b)}J, and
the measures of significance and interestingness for the change-point at
s. These quantities are then available for output at the end of the
procedure.

E. Significance Test for Change-Points

[0112] In an exemplary test for significance of a potential change-point
at s, let sJ-1=B(J-1, s-1) be the start of the previous segment J-1,
and eJ-1=s-1 be segment end. In one embodiment, the estimated rate
{circumflex over (r)}J equation (7) can be significantly different
from that for the previous segment, which can be given by equation (8),
below.

These two proportions can be compared using standard methods, for
example, a 2×2 contingency table using Fisher's method for small
frequencies and the chi-square test for large frequencies. If some
qp≠1 then the denominators can take non-integer values, but
the nearest integer can be used.

F. Measure of Interest for Change-Points

[0113] The most significant changes are often not the most interesting
ones. If a word (or more generally a lexical item) is relatively frequent
then changes affecting it are likely to be significant. However, changes
affecting less frequent items may be of greater interest to an analyst of
the data, in which case it is inappropriate to rank the items by
significance level. For this reason, the present disclosure can use a
separate criterion of interestingness, in addition to significance, both
as a test for acceptance of a potential change-point and as a ranking
criterion. A measure of interestingness provided herein is based upon
information theory.

[0114] The null hypothesis is that there is no change in rate at s, that
is, rJ=rJ-1. The present disclosure can test this hypothesis to
measure both significance and interestingness using the estimated values
from equation (7) and equation (8). The principal difference between
these two measures can be summarized as follows: if the null hypothesis
is false, then as the amount of data increases, the significance test
statistic increases in magnitude without bound, and the measure of
interest converges to a finite value depending only on rJ-1 and
rJ.

[0115] The degree of interest of a change in rate (from rJ-1 to
rJ) can be measured by the amount of information conveyed by this
change. To evaluate this, the present disclosure can compare two possible
models on the latest segment [s, τ]: the model derived for that
segment (rJ) and the model extrapolated from the previous segment
(rJ-1). The present disclosure can define the following three
variables: [0116] W: Bernoulli random variable for presence of word
within a document, [0117] M: Bernoulli random variable for selecting
between the two models: 0 for rJ-1--1, 1 for rJ, [0118] T:
Discrete uniform random variable taking a value from s to τ.

[0119] The conditional mutual information between W and M given T can be
defined as shown below in equation (9)

I(W; M|T)=H(W|T)-H(W|M,T) (9)

where H(•|•) is conditional entropy:

H(Y|X)=-ΣxΣyP(x,y) log2 P(y|x)

I(W; M|T) measures the amount of information regarding W brought by
knowledge of M that is not already contained in T. A reason for adopting
this definition conditional on T is that this definition also covers the
case where the segments are not constant but involve trends. For the
piecewise-constant model, T conveys no information about W. Let
P(M=1)=θ, and LJ=τ-s+1 be the length of the Jth segment.
If the variables W, M, T are independent the joint distribution can be
given by

Equation (10) can be evaluated using the estimated values {circumflex
over (r)}J, {circumflex over (r)}J-1 from equations (7) and (8)
with θ=1/2. It can be appreciated that
IrJ-1.sub.;rJ≧0, with
IrJ-1.sub.;rJ=0rJ=rJ-1. Also,
IrJ-1.sub.;rJ≦1, with
IrJ-1.sub.;rJ=1rJ-1=0, rJ=1 or vice versa. The
parameter iv can control the sensitivity of the measure for infrequent
events; for example, as the value decreases, the sensitivity of the
measure increases. A value w=0.1 is a good compromise in practice. A
desirable feature of the interestingness measure is that it gives greater
weight to a small increment from close to zero than it does to the same
increment from higher up that has less novelty value, as illustrated in
the following table:

[0122] With this formulation, the recursion step is ˜0(T2) in
time. The space requirements are quite modest: in addition to the above
linear arrays, A(•, •) and B(•, •) are each
˜O(ImaxT), where Imax is the maximum number of segments
permitted.

Piecewise-Linear Lexical Occurrence Model

A. Poisson Likelihood

[0123] If the Poisson probability with which a lexical item occurs in a
document (pwmt) trends gradually up or down over time, the
piecewise-constant model can represent this as a flight of steps, which
is suboptimal. Trends can be accommodated by assuming more generally that
pwmt is piecewise-linear. As above, it is initially assumed that the
segmentation is known. Again, the subscripts w and m are dropped for
brevity, and allow for a periodic modulation.

[0124] For the ith segment, let

pt=qpr1 for t.di-elect cons.Tp

where rt=ai+bi(t-ei-1), with ei-1=si-1
being the end of the previous segment. For a constant segment the
coefficient bi is zero. The log-likelihood equation (1) becomes
equation (15), below.

[0125] Again the final term does not depend on the model or segmentation,
and is the same constant term as before. Taking the partial derivative
with respect to qp, equation (15) becomes equation (16), below.

[0126] Given a segmentation and a model in the form {ai,
bi}i=1, . . . , I, the present disclosure can obtain qp by
setting equation (16) to zero. However, maximizing equation (15) directly
with respect to {ai, bi}i=1, . . . , I is not as simple
because the algorithm would involve additional iteration loops and would
be too slow.

B. Trend Segment Parameter Estimation

[0127] 1) Weighted Linear Regression: Because the log-likelihood is hard
to maximize for ai, bi the present disclosure can use weighted
linear regression instead. Consider the regression model

[0130] Setting equation (16) to zero and substituting for the
weighted-least-squares estimates ai, {circumflex over (b)}i
also enables us to re-estimate the periodic modulation parameters qp
from these quantities to derive equation (27), below:

for all p.di-elect cons.P, where δm=1 if p=m, otherwise zero.
The nullspace of this matrix (found using a singular value decomposition)
is spanned by the vector of reciprocals of the nonzero periodic
parameters and, once found, the nonzero periodic parameters can be scaled
so that the largest is equal to one.

[0131] 2) Likelihood Adjustment: If we assume ai=ai+ε,
bi={circumflex over (b)}i+δ substitute into the
contribution to the log-likelihood equation (15) from the ith segment,
set the derivatives with respect to ε and δ to zero, and
expand to first-order in ε and δ, then we get the following
pair of equations that are linear in these increments:

The equations immediately above can be solved for ε and δ
giving improved estimates of the parameters, and the process can be
iterated. Generally, this process converges after one or two iterations.
The present embodiment now has estimates of ai and bi that
maximize the likelihood; however, the likelihood is maximized at the
expense of additional summations over the data. Fortunately, the
weighted-least-squares estimates are usually very close to the maximum
likelihood estimates, so this step can be omitted if computational
efficiency is a priority.

[0132] 3) Segment Constant vs. Trend: The decision as to whether to treat
the latest segment spanning [s, .UPSILON.] as constant or trend can be
based on any combination of the following exemplary criteria: [0133]
Absolute value of slope parameter [0134] Change in r, over the length of
the segment [0135] Significance of regression slope [0136] Likelihood
using trend model compared to that for constant model. In practice, each
of the aforementioned criteria has been found to be useful. In general,
each constant segment introduces one less parameter into the overall
model, resulting in a simpler description of the data.

C. Dynamic-Programming Optimization of PLM

[0137] The present embodiment can assume that the segmentation is not
known, although this is not necessarily the case. The optimization
proceeds similarly to that described above for the piecewise-constant
model. If the periodic modulation parameters qp are not known, as is
usually the case, then the procedure is to initially assume all
qp=1, find the optimum segmentation and model, re-estimate qp
using equation (27), and repeat. Two or three iterations of this process
are generally sufficient.

[0138] The likelihood contribution L(s, τ) for the Jth segment [s,
τ] is obtained using equation (13) for a constant segment. For a
trend segment, equation (28) as shown below is used.

The present embodiment defers consideration of how to express this in
terms of differences in cumulative values at segment endpoints. The
regression parameters and the residual sum of squares can all be
evaluated using linear-time arrays for the quantities defined in
equations (20) and (22), namely equation (29),

and so forth. All the quantities in equation (20) through equation (24)
can be obtained in this way, and also the regression parameters aJ,
{circumflex over (b)}J from equation (25), the RSS from equation
(26), and the periodic modulation parameters from equation (27).

[0139] With the segment model and likelihood available for [s, τ], the
optimization can proceed once the restriction sig(s) is defined for
segments that may involve trends.

D. Significance Tests for PLM Change-Points

[0140] 1) Difference Between Regression Lines: Let sJ=s,
eJ=τ be the start and end of the Jth segment, sJ-1=B(J-1,
s-1), eJ-1==s-1 be the start and end of the previous segment. Also
define eJ-2=sJ-1-1. There are two tests can be used for each
candidate change-point. A first test can be used to decide whether a
significant change exists. A second test can be used to decide what form
the significant change takes.

[0141] The first test may be used when at least one of the two segments is
a trend. The null hypothesis (H0) is that there is no change. That
is, the Jth segment is a linear extrapolation of the J-1st. A single
regression line can be first fit through both segments as described above
and obtain the residual sum of squares RSS0 using equation (26). The
alternative hypothesis (H1) is that there is a change-point at s,
and RSS1 can be obtained as the sum of the residual sums of squares
over the two segments, fitted separately. Then, the F-statistic, below,

defines the critical region. The number of degrees of freedom in the
denominator is n-m where n=eJ-sJ-11 is the total number of data
points in the two segments, and m=4 is the total number of estimated
parameters in the separate models. Although this test and a similar one
in the next section assume normal residuals, the tests have been found to
nevertheless work well in this application.

[0142] 2) Difference between Regression Slopes: If a change-point
involving a trend is significant then the next question that needs to be
addressed is whether the change involves a discontinuity (as for the
piecewise-constant model) or merely a corner, in which case the slope
changes but the intercept does not. A corner introduces one less
parameter into the overall model, resulting in a simpler description of
the data. To test whether a change involves a discontinuity, a modified
two-phase linear regression can be used. The modified two-phase linear
regression can incorporate the weights vt. The null hypothesis
H0 is that the regression lines for segments J-1 and J coincide at
eJ-1.

aJ-1+bJ-1(eJ-1-eJ-2)=aJ

[0143] The above constraint can be incorporated into the weighted squared
error criterion using a Lagrange multiplier:

[0144] If a change-point is determined to be continuous with a corner then
the two-phase regression model can be adopted, as determined above for
both segments. However, if two consecutive change-points consist of such
corners then the middle segment would inherit two distinct models from
the separate two-phase regressions, and these would have to be
reconciled. So, instead, the present embodiment makes an adjustment to
the model for one segment only, depending on the type of the Jth segment,
as shown below.

Trend: Set=a'J=aJ-1+{circumflex over
(b)}J-1(eJ-e-eJ-2)

Constant: Set={circumflex over
(b)}'J-1=(aJ-aJ-1)/(ej-1-eJ-2)

In the first case the intercept of the Jth segment is adjusted to match
the end of the J-1st segment, whereas in the second the slope of the
J-1st segment, which has to be a trend, is adjusted to match the
intercept of the Jth segment. Although slightly suboptimal, this method
can handle any number of consecutive connected segments. Within the
dynamic programming method, if {circumflex over (b)}'J-1, is set in
this way then because this affects the previous (not the current) segment
it can be recorded in the main loop as

During the back-trace, if this value is nonzero for the Jth segment then
it overrides the usual value recorded for the J-1st.

E. Measure of Interest for PLM Change-Points

[0145] In addition to passing the significance test, a potential
change-point can again satisfy the interestingness requirement based on
conditional mutual information (equations (9) and (10)). The present
embodiment now involves four model parameters as shown below in equation
(30).

The two models for the Jth segment [sJ, ej] are derived for
that segment (aJ, bJ) and extrapolated from the preceding
segment (aJ-1, bJ-1). If the variables W, M, T are defined, as
defined above, then the joint distribution is now given by:

Here, again, H(•) is the entropy function. The aforementioned
equations are evaluated using the estimated values aJ-1, {circumflex
over (b)}J-1, aJ, {circumflex over (b)}J, and with
θ=1/2. It should be noted that the evaluation involves six terms
(two for each H(•)), all of which can have the following general
form:

t = s e ( α + β t ) log 2 (
α + β t ) ##EQU00042##

for various values of α and β. Because the sum over t could
degrade the overall algorithm from quadratic time to cubic time the
present embodiment can eliminate this possibility by applying the
Euler-Maclaurin formula in the following form:

All the terms on the right-hand side are evaluated at the endpoints of
the segment, and in practice the last term is usually negligible. All
that remains is to divide the result by 1n(2). This makes it possible to
efficiently compute the conditional mutual information (equation (9)) and
measure of interest (equation (30)).

[0146] Having the measure of interest consistently defined for both
constant and trend segments brings two major advantages: [0147] 1) A
single threshold value can be used for all change-points, whether the
previous and latest segments are constant or trend. [0148] 2) The measure
can be carried forward into the coordination phase for weighting events
that may extend over several consecutive change-points of various types,

F. Quadratic-Time Implementation

[0149] Thus far, the following steps in the dynamic-programming
optimization of the piecewise-linear model are based on linear arrays
evaluated at segment ends: [0150] 3) setting the parameters, assuming
the likelihood adjustment step is omitted, [0151] 4) both significance
tests, [0152] 5) interestingness measure. If the segment likelihood
equation (28) can be similarly treated then the formulation becomes a
complete linear-space, quadratic-time formulation. First recall the
definitions in equations (11) and (12), and similarly define

[0153] This calculation leaves G(s, τ). At the moment the algorithm is
cubic-time because of this term only. For short segments the cost of
evaluating this is small, but for long segments it may be burdensome. Let
L≧1 be a parameter which essentially governs the maximum segment
length for which the sum in equation (31) can be evaluated directly. The
present embodiment can use a Chebyshev polynomial approximation to
1n(1+x) for 0≦x≦1 and the Clenshaw algorithm to convert
this to a regular polynomial, represented in equation (32):

ln ( 1 + x ) = k = 1 K c k x k ( 32 )
##EQU00049##

where K=11, accurate to 1×10-9 throughout the domain [0,1],
which is sufficient for present purposes.

where xt={circumflex over (b)}J(t-s+1)/(a+{circumflex over
(b)}J(s-1-w)). Since t≦v≦u, the definition of equation
(33) guarantees that 0<xt≦1. Therefore, the approximation
equation (32) can be used together with a standard binomial expansion to
obtain equation (35), below.

Although equation (35) involves a sum over 77 terms, there are no
function evaluations and empirically it turns out to be faster than the
direct evaluation of equation (31) for segment length of 15 (see below).

[0155] If {circumflex over (b)}J<0 then the present embodiment
proceeds in a similar fashion and only the result will be quoted. Define

[0156] Because the number of recursive function calls in equations (34)
equation (36) depends on the values of aJ, {circumflex over
(b)}J and not directly on the segment time span (and in practice
seldom exceeds 2), this completes a linear-space, quadratic-time
formulation. To assess this experimentally the inventors used the
Magellan search query corpus. The inventors selected 20 words that occur
regularly throughout the corpus (interact, hotel, jobs, free, home,
software, music, american, games, email, computer, world, page, school,
real, college, state, tv, video, all). FIG. 9 shows the likelihood
computation time for both procedures as a function of segment length,
using a Linux server with a 3.8 GHz CPU. The end-point based method is
faster for segments longer than 15, so the parameter L is set to this
value. FIG. 10 is a log-scale plot of the average per-word CPU time to
optimize the piecewise-linear model as a function of length of data, for
both likelihood computation procedures. The time includes the initial
linear step of creating the arrays (a little larger for the end-point
based method because there are more of them), as well as the
dynamic-programming procedure. Using the end-point based method reduces
the overall time by a factor of two for 300 data bins and three for 1000.

Coordinating Changes

A. Step and Burst Events

[0157] The change-detection method described in previous sections
typically generates a lot of output. For each word/metavalue pair there
can be a sequence of change-points connecting piecewise-linear segments.
Some of these individual changes can be related to similar ones for many
other word/metavalue pairs. It can be undesirable to leave it to a human
analyst to have to synthesize more meaningful events out of all these
elementary changes.

[0158] It is often the case that where a subset of all the change-points
for all word/metavalue combinations have a common cause the overall event
can be visualized in three exemplary dimensions as follows: [0159] 1) a
subset W of words, [0160] 2) a subset M of metavalues, [0161] 3) an
interval T of time. Ideally, precisely synchronized change-points would
be found for the Cartesian product of the sets of words and metavalues.
However, this is seldom the case in practice. Accordingly, the
coordination algorithm can be designed such that it is tolerant of
missing word/metavalue combinations and of lack of synchrony (referred to
herein below as dis-synchrony) in time.

[0162] It can be helpful to consider a new kind of event that can cover
several consecutive segments and therefore change-points. Each of these
events can have an onset phase, and can also have peak and offset phases.
The onset of an event need not consist of a single change-point. The
profiles illustrated in FIG. 11 show various possible types of step
event, each with an onset phase shown in bold including one or more
change-points. Similarly the profiles illustrated in FIG. 12 show various
possible types of burst event, each with an offset phase shown in bold in
addition to the onset phase. All these examples, except the second and
fourth example illustrated in FIG. 12 also have a peak phase where the
rate is constant in between the onset and offset

[0163] The overall change profile for a word/metavalue combination can, in
general, include several such events in sequence: zero or more bursts
followed by an optional step. An algorithm can post-process the change
profiles for each word/metavalue combination and form an overall list of
these events in the following exemplary form:

φj=(wj,mj,sj,ej,Ij), j=1, . . . , N
(37)

where [0164] wj is the word, [0165] mj is the metavalue,
[0166] sj is the start-time, [0167] ej is the end-time (zero
for a step event), [0168] Ij is the interestingness. Because the
onset and offset phases of these events can be extended, the present
disclosure can characterize the start-time using the first moment of area
of the profile during the onset phase about the point t=0, and similarly
for the end-time. The interestingness of the event is based on the
quantity defined in section E. If the span of the event φj
consists of the segments i1≦i≦i2 then define
equation (38):

where Iaj-1.sub.,bj-1.sub.;aj.sub.,bj is the
measure of interest for segment I compared with the previous segment, as
per equation (30). This assigns a measure of interest in a natural way to
the entire event.

[0169] There are various ways in which the present disclosure can measure
the dis-synchrony of two events, for example, φi, φj. A
measure using only |sj-si|+|ej-ei| may not be
sufficient because of the different forms the onset and offset phases can
take, as illustrated above. An abrupt step can get grouped with a long
trend. The present embodiment adopts the simple expedient of also
incorporating the second moments of area of the onset and offset phases
of φI and φj. The actual definition of the
dis-synchrony measure d(φi, φj) involves further minor
considerations which can be omitted here.

[0170] It is logical to separate groups of step events (with ej=0)
and of burst events (with ej≠0). The principle can be the same
in each case. Events of form φj can form groups when words
wj and metavalues mj form sets W and M such that the Cartesian
product W{circle around (x)}M is substantially covered with events
Φj that are substantially synchronous in time.

B. Graph Clustering

[0171] To meet the challenge posed at the end of the previous section, the
present disclosure can use a graph clustering method. In testing, the
inventors determined that metric clustering algorithms did not work as
well as desired because the space occupied by the events φj is a
metric space only in the time dimension. It should be understood,
however, that the use of metric clustering algorithms is not precluded.

[0172] Also, it should be understood that the aforementioned challenge is
not a bi-clustering problem, at least in part because it is possible and
quite common for words and/or metavalues to be shared between distinct
groups of events at different times, and sometimes even for the same
times. This is illustrated in FIG. 13. Each represents an event
φj in the in-w plane (the time is ignored but the events are
assumed to be synchronous). It is natural to form the distinct groups
Φ1, Φ2 even though the word w1 is shared.

[0173] So the imperative is to cluster the events φj placing
emphasis on the Cartesian-product structure across the sets W and M. The
present embodiment can accomplish this by creating an undirected graph
with the events φj as nodes. Edges are created between pairs of
nodes (for example, φ1 and φj) that satisfy one of the
following three conditions (δ is a threshold):

Edges therefore exist between nodes that are sufficiently synchronous and
that share either the word or the metavalue, or lie across the diagonals
of rectangular structures in the m-w plane where all four corners are
populated with events that are synchronous as a group (as in FIG. 13).
This third condition turns such a structure into a clique in the graph.
All edges have weights inversely dependent on d(φi;
φj).

[0174] For clustering the nodes in the graph, the present disclosure can
use a procedure that reveals clusters of densely interconnected nodes by
simulating a Markov flow along the graph edges.

C. Bigram Check

[0175] 1) Filtering Graph Edges: Despite the additional discriminative
leverage brought by the metadata, it is still possible that changes can
occur for separate words at or about the same time but for different
reasons, in which case groups can be generated that are misleading. Data
sets without metadata are especially prone to this phenomenon. For this
reason, the present embodiment can also perform a bigram check: for a
pair of distinct events φi, φj such that
wi≠wj an edge connecting these events to the graph is
only added if the bigram frequency for the pair wi, wj exceeds
a required threshold that may depend on wi and wj.

[0176] The bigram frequency can be defined as the total frequency of
documents containing both wi and wj over the range of data
concerned. There is no requirement that the words be adjacent or occur in
a particular order. Imposing this requirement ensures that the two words
co-occur in a sufficient number of the source documents, without regard
to metadata. This is an effective filter against spurious combinations.
It can be expensive to compute the bigram frequency because it may be
impractical to accumulate frequencies for all possible such bigrams
during the original binning. A separate pass over the raw data can be
implemented for this purpose. Requiring a separate pass can be slow and
especially undesirable for the streaming mode, in which case it may be
desirable to process all raw data only once.

[0177] 2) Priority Sampling Scheme: The present embodiment can resolve the
aforementioned challenge by using a priority sampling scheme through
which the present embodiment is able to efficiently obtain an estimate
for the frequency of an arbitrary bigram post-hoc without the need for a
subsequent pass through the raw data. The general principle of priority
sampling can be described as follows: Let there be n items i=1, . . . n
with positive weights vi. For each item, define a priority
qi=vi/ri where ri is a uniform random number on
[0,1]. The priority sample S of size k<n can include the k items of
highest priority. Let γ be the k+1st priority, and let {circumflex
over (v)}i=max{vi, γ} for each sampled item i.di-elect
cons.S. Now consider an arbitrary subset U.OR right.{1, . . . , n} of the
original items. It can be shown that

An unbiased estimate of the total weight of the items in the arbitrary
subset U is therefore obtained from the priority sample by summing
{circumflex over (v)}i for those items that are also in U. This can
be done for many different subsets U after forming the priority sample.

[0178] The present embodiment employs this for the bigram check in three
stages. First, during the binning of the data the present embodiment
forms a list of consolidated documents by filtering out stop words and
words that are excluded from the final dictionary, then re-assembling
each document with the words in word dictionary order. Metadata can be
ignored. This enables the documents to merge as far as possible. The
total weight vi of each consolidated document is its total frequency
within that bin. From this, the present embodiment can create the
priority sample for that bin as described above, and export it along with
the word frequency data, In streaming mode, the priority samples are
carried forward within the summary file until the data drops off the time
horizon.

[0179] The second step is to form a merged priority sample for all
consolidated documents throughout the data, either from all the separate
bins (retrospective mode) or from the summary file together with the
latest data (streaming mode). For time and space economy it may be
necessary or desirable to discard the tail of the sample for each bin. If
this is done, the values of {circumflex over (v)}i can be
re-assigned using the revised value of γ, so that unbiasedness is
preserved. The final step is to estimate the frequency of an arbitrary
bigram for a range of time by summing the values of {circumflex over
(v)}i for all the consolidated documents in the merged priority
sample that contain that bigram, over that range of time. This can be
done very quickly. A threshold can then be applied to the estimated
frequencies as described above in order to decide which edges to add to
the graph.

[0180] There are not expected to be "false positives" with this scheme. If
an estimated bigram frequency is greater than zero then the true
frequency must also be. However, there is expected to be "false zeros"
where the estimated bigram frequency is zero for a bigram that does
actually occur. The inventors have measured the true frequencies for
these false zeros and found that for a sufficiently large merged priority
sample ˜105 the true frequencies are typically very small and
below the threshold for acceptance.

D. Output of the Coordination Procedure

[0181] The graph clustering forms the nodes (events φj) into
groups. From this, the present embodiment can immediately generate a
structured output of the following form:

φk={φkj}1≦j≦nk,Tk,W.-
sub.k,Mk,Ik, k=1, 2, . . . , K

sorted in decreasing order of Ik, where for each group Φk,
[0182] {φkj}1≦j≦nk is the set of
either step or burst events as appropriate, [0183] Tk is the time
description, [0184] Wk=∪j=1nk {wkj}
is the set of words, [0185] Mk=∪j=1nk
{mkj} is the set of metavalues, and [0186]
Ik=Σj=1nkI(φkj) is the group
measure of interest.

[0187] The time description Tk can take various forms depending on
the type of onset presence and type of offset. The group measure of
interest Ik is the total over that for the component events equation
(38). All that needs to be presented to the user are the time Tk,
sets of words Wk and metavalues Mk, and perhaps a small sample
of the documents or a subset of the priority sample. This is information
on a digestible scale which should enable the user to make a judgment
about whether this is an important event or not.

Results

A. Corpora

[0188] The following description provides some results obtained by
applying the aforementioned exemplary CoCITe procedure to various
corpora. FIG. 14 summarizes the essential statistics of the corpora. The
vocabulary size is the final vocabulary after preselection. There is
often a long vocabulary tail of words that do not occur often enough to
create a change-point, and these are excluded. The timing information
includes model fitting (in retrospective mode) and change-point
coordination but excludes text preprocessing and binning. The inventors
conducted experiments on a Linux server with a 3.8 GHz CPU.

[0189] The time requirements have been found to be roughly proportional to
the numbers of words and metavalues and the square of the number of bins.
Sparsity also varies from one corpus to another and makes a difference.

B. CHI Scan IVR Analysis

[0190] The first corpus consists of logs of human/machine automated
dialogs. CHI Scan is a tool for reporting, analysis and diagnosis of
interactive voice response (IVR) systems. IVR systems can operate using
natural language or directed dialog. Natural language allows a caller to
speak naturally. Directed dialog requires a caller to follow a menu
which, in some cases, only permits touch-toned responses. Designing,
monitoring, testing, and improving all IVR systems is predicated on the
availability of tools for data analysis. CHI Scan is a web-based
interactive tool for this purpose. In addition to providing both
high-level and in-depth views of dialogs between callers and automated
systems, CHI scan provides views of changes occurring over time. Changes
may be either planned (via a new release of the system) or unplanned.

[0191] The CoCITe algorithm can be incorporated into the CHI Scan software
framework and like software using the streaming mode. Each document is a
complete dialog between a caller and the IVR system. Changes in relative
frequencies of the following are tracked: [0192] Prompts: Messages
played to the caller [0193] Responses: Callers' choices in response to
prompts [0194] Call outcomes: Transfers (to human agents), hang-ups
(caller ends the call), and end-calls (system ends the call) [0195] KPIs:
Key performance indicators of progress made within the automation.

[0196] These can be important metrics for evaluating and tracking IVR
systems over time for providing invaluable insight. No call metadata are
used at present for the CoCITe algorithm. However, for tracking the
responses the relevant prompt is treated as a metavalue. This has the
effect of conditioning each response on a preceding occurrence of the
prompt, thereby ensuring that the distribution of responses is
normalized. This does not preclude the future use of call metadata as
well. Three versions have been implemented, using hourly, daily and
weekly binning. FIGS. 15 and 16 illustrate using examples of responses to
the initial greeting prompt at the start of each dialog, for two
applications using daily binning.

[0197] FIG. 15 shows two of the responses to the initial greeting prompt
for an IVR application for an electronics company, plotted over a 90-day
period. The dots are the actual data and the lines show the fitted
segment model. The lower plot of the pair shows a pronounced weekly
variation. Two periodic phases are sufficient: weekday and weekend. Both
plots show step changes on Jun. 7 and 28, 2007. Because the responses are
normalized, if one goes up then others must go down, and the remaining
responses (not shown) cover the remainder of the shift in the
distribution during that period. An image map on the CHI Scan web page is
enabled, so the user can get further details and navigate to particular
points just by using the mouse.

[0198] FIG. 16 shows a similar plot for "flight status" requests at the
initial greeting for an airline application. A regular weekly modulation
is superimposed on a four-segment model. The first two segments represent
a gradual increasing trend in such requests during the 2006 holiday
season, followed by a constant phase through Feb. 14, 2007. On this date
there was a snowstorm in the north-eastern United States that caused a
burst in requests for flight status that quickly decayed back to the
normal level. This phenomenon is captured by the final two segments. The
rather noisy signal (sequence of dots) therefore has quite a simple
description in terms of the piecewise-linear model with the periodic
cycle. There are some finer-grained phenomena that account for the
imperfect fit in places, but the threshold settings prevented the fitting
of more fragmentary segments. It should be noted that the illustrated
plot are tracking relative responses. Events such as the snowstorm often
cause an increase in call volume as well as a shift in the distribution
of call intents that can be tracked separately.

C. Customer Care Agent Notes

[0199] When a customer talks to a human agent, the agent typically makes
notes on the reason for the call and the resolution. These notes are a
mine of information on why customers are calling, but are usually far too
numerous to be read individually. These notes also tend to be rather
unstructured, containing many nonstandard abbreviations and spelling
errors. However, metadata about the customer are generally available.
Detecting and structuring the changes that occur within such streams of
notes can provide useful intelligence to the organization. FIG. 17
illustrates notes made during August and September 2005 by customer
service representatives talking with domestic residential
telecommunications customers. For each note the customer's location is a
useful metavalue. In order to avoid splitting the data into too many
sub-streams, with consequent loss of power, the state is used. FIG. 17
shows the top ten clusters including start date and the numbers of words
and metavalues (states) in each cluster.

[0200] Most of the clusters represent routine traffic, but cluster 6
(Hurrican Katrina) is unusual. Customers in the Gulf Coast region who
were affected by this disaster had special needs. Many change-points
therefore emerge, some involving entirely new words (e.g. Katrina), some
involving pre-existing words which increased in frequency (e.g.
hurricane), and some involving common words being used in new
combinations (e.g. home, destroyed). The coordination procedure groups
these changes as follows: [0201] Metavalues: Louisiana, Mississippi
[0202] Words: hurricane, Katrina, hurrican, house, affected, home,
victim, destroyed

[0203] The word list shown is a subset. Note the mis-spelling "hurricane,"
which occurs often enough to be picked up by the procedure. Tracking this
event over time we see it gradually tail off during the month of
September, 2005,

D. Search Query Data

[0204] Queries made to internet search engines can be treated as documents
for this analysis. Such queries tend to evolve over time, both cyclically
within the 24-hour period, and over a longer time-scale as changing
frequency of search terms reflects evolving interest in diverse topics.
FIG. 18 illustrates data acquired from the Magellan Voyeur service. This
service displayed the last 10 queries to the Magellan search engine, the
list being updated every 20 seconds. The list was sampled and archived at
10-minute intervals from 1997 through 2001 (a total of 1.7 million
queries containing 0.5 million distinct search terms). There are no
metadata because only the query text was revealed. The illustrated
results uses both weekly bins for longer-term changes, and daily bins for
finer resolution.

[0205] Some rather generic terms (e.g. computer, school, jobs, weather)
show no change in rate throughout. Some show an increase in frequency
(e.g. hotel, Internet, IM), others a decrease (e.g. chatroom, telnet).
Many search terms show bursty behavior, and for grouping these in the
absence of metadata the bigram check is helpful for forming coherent
groups. Some search terms show an increase in frequency at the same time
(e.g. Linux and mall in November 1997) but for different reasons, and the
bigram check helps to prevent these from being grouped together. Some
groups of burst events generated by the coordination procedure are shown
in FIG. 18.

[0206] The profile of the burst event (using daily data) for the death of
Princess Diana is shown in FIG. 19. Note that there were no data for
31st August (the date of the accident) and 1 Sep. 1997 so the event
first appears on 2nd September. The initial burst for the word
"Diana" is followed by a sharp decline modeled by a linear trend, with a
corner on 11th September and a further step down on 9th
October. The profile for the word "princess" is similar. In a situation
such as this, an exponential function can be a better model than the
piecewise-linear one.

E. Enron Email Corpus

[0207] Turning now to FIG. 20. The Enron email dataset consists of roughly
0.5 million messages belonging to a group of 150 users. For our purposes
the corpus can be considered a set of time-stamped observations (email
messages) along with the meta-variable of document ownership. This data
presents a challenge to analysis for a number of reasons. Most
importantly, email is readily forwarded, posted to lists, embedded with
replies, and other operations which break assumptions of document
independence. Direct repetitions of message content are common. This
greatly exaggerates topic impact on word-level statistics, as well as
leading to the inclusion of non-topical words that happen to be in the
initial message and are then copied and recopied. Experiments on
automatic foldering of this corpus have revealed similar artifacts.

[0208] Thus, change clusters in the full Enron corpus are typically driven
by corporate mass mailings (all employees receive a copy) or by targeted
advertisements (multiple near-identical messages sent to a particular
user). Such effects are valid changes to the language model, but not
particularly illuminating as to user activity. To eliminate
non-informative "changes" driven by junk mail, we tried various forms of
pre-processing. Each user is associated with a number of online
identities. We report some results from analysis of messages which have
both sender and recipient fields including identities of members of the
user group (distinct members, since self-mailings between two accounts
are common). Junk email is no longer an issue. Repeated messages still
occur; it is difficult to distinguish between identical and
near-identical documents (e.g. a copy in the deleted items folder versus
a reply with a few new words attached to a copy of the old content). FIG.
20 illustrates the top ten clusters from CoCITe on messages with
date-stamps in the year 2000.

[0209] FIG. 21 is a plot illustrating data received from a customer care
IVR. This plot illustrates daily and weekly periodic variation for hourly
data over a 90-day period and 14-day period, respectively. In one
embodiment used to generate the data illustrated in FIG. 21, the CoCITe
tool 202, 302 is used to detect and coordinate patterns within IVR
responses. FIG. 22 is a plot illustrating responses received from a
customer care IVR during a 7-day period during which incoming callers are
prompted with a message, "To pay your bill or get other bill-related
options, Press 1. To check your services, Press 2. To get help with
services, Press 3. To report a lost or stolen device, Press 4. For Sales,
Press 5. For help with other issues including the option to speak with a
customer service professional, Press 0. To repeat these options, press
*." The illustrated responses are a "0" response requesting the call be
transferred to a customer service professional and a hangup response. In
one embodiment used to generate the data illustrated in FIG. 22, the
CoCITe tool 202, 302 is used to detect and coordinate patterns within IVR
responses.

[0210] FIG. 23 is a plot illustrating data received from Botnet activity
via an Internet Relay Chat (IRC) channel. In one embodiment used to
generate the data illustrated in FIG. 23, the CoCITe tool 202, 302 is
used to detect and coordinate patterns within IRC messages that are
characteristic of Botnet activity. The illustrated example shows a burst
of 556 similar messages from 110 different IP addresses (bots) to a
single control distributed denial of service (DDoS) attack on a single
target.

Conclusion

[0211] The present disclosure considers the problem of discovering and
coordinating changes occurring within text streams. Typically the volume
of text streams being acquired in many domains is far too large for human
analysts to process and understand by direct inspection, especially in a
timely manner. Therefore, there is a need for tools that can execute
change detection and coordination. Changes can be abrupt, gradual, or
cyclic. Changes can reverse themselves, and can occur in groups that have
a common underlying cause. A tool that is designed to accommodate these
behaviors can be of material assistance to analysts in providing them
with compact summaries of important patterns of change that would
otherwise be hidden in the noise. It is then for the analyst to decide
what priority to give to the discovered events.

[0212] The above description has described a methodology for efficiently
finding step changes, trends, and multi-phase cycles affecting lexical
items within streams of text that can be optionally labeled with
metadata. Multiple change-points for each lexical item are discovered
using a dynamic programming algorithm that ensures optimality. A measure
of interestingness has been introduced that weights each change-point by
how much information it provides, and complements the more conventional
measures of statistical significance. These changes are then grouped
across both lexical and metavalue vocabularies in order to summarize the
changes that are synchronous in time.

[0213] A linear-space, quadratic-time implementation of this methodology
is described as a function of the time span of the data and can be
applied either retrospectively to a corpus of data or in streaming mode
on an ongoing basis. The output of the tool can be a set of ranked
events, each including sets of lexical items and metavalues together with
a description of the timing of the event. This information, perhaps
augmented with a sample of the original documents, can assist a human
analyst in understanding an event and its significance.

[0214] The law does not require and it is economically prohibitive to
illustrate and teach every possible embodiment of the present claims.
Hence, the above-described embodiments are merely exemplary illustrations
of implementations set forth for a clear understanding of the principles
of the disclosure. Variations, modifications, and combinations may be
made to the above-described embodiments without departing from the scope
of the claims. All such variations, modifications, and combinations are
included herein by the scope of this disclosure and the following claims.