Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

Provided are techniques for generating order statistics and error bounds.
For each of multiple, distributed data sources, a finite number of data
bins are created for each field in that data source. Data values in each
of the multiple, distributed data sources are processed to generate basic
summaries for each of the data bins in a single pass of the data values.
The data bins from each of the multiple, distributed data sources are
sorted. One or more approximate order statistics are computed for a data
set by accumulating counts from a number of the sorted data bins. Lower
and upper error bounds are provided for each of the computed one or more
approximate order statistics, wherein the lower and upper error bounds
are values delimiting an interval containing a true value of an order
statistic.

Claims:

1. A method for generating order statistics and error bounds, comprising:
for each of multiple, distributed data sources, using a processor of a
computer, creating a finite number of data bins for each field in that
data source; processing data values in each of the multiple, distributed
data sources to generate basic summaries for each of the data bins in a
single pass of the data values; sorting the data bins from each of the
multiple, distributed data sources; computing one or more approximate
order statistics for a data set by accumulating counts from a number of
the sorted data bins; providing lower and upper error bounds for each of
the computed one or more approximate order statistics, wherein the lower
and upper error bounds are values delimiting an interval containing a
true value of an order statistic.

2. The method of claim 1, wherein software is provided as a service in a
cloud environment.

3. The method of claim 1, wherein the basic summaries for a data bin
comprises a count, a mean, a lower bound, and an upper bound for that
data bin.

4. The method of claim 1, further comprising calculating a power
transformation parameter for a Box-Cox transformation using the computed
one or more approximate order statistics.

5. The method of claim 1, further comprising: for each of the finite
number of data bins, generating a data bin of zero width; and in response
to receiving a new data value, determining whether the new data value is
to be added to an existing data bin; in response to determining that the
new data value is to be added to the existing data bin, adding the new
data value to the existing data bin; and updating basic summaries of the
existing data bin; in response to determining that the new data value is
not to be added to the existing bin, creating a new data bin for the new
data value; and creating basic summaries for the new data bin.

6. The method of claim 5, further comprising: merging each new data bin
with the existing bins in batches by adjusting the basic summaries of
each data bin involved in a merge when the number of bins exceeds a
preset threshold and while ensuring that width of the merged bins does
not exceed an approximation bound.

7. The method of claim 1, wherein a width of each data bin is maintained
within limits bounded by a range of data values divided by the finite
number of data bins.

Description:

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is a continuation of and claims the benefit of
U.S. patent application Ser. No. 13/399,838, filed Feb. 17, 2012, which
application is incorporated herein by reference in its entirety.

FIELD

[0002] Embodiments of the invention relate to computing and applying order
statistics for data preparation.

BACKGROUND

[0003] Pervasiveness and quantity of electronic data available today in
all areas of human endeavor call for new approaches in order to extract
timely insights and actionable information based on the very large data
sets encountered in practice. In addition to sheer data volume, research
analysts face methodological challenges when encountering poorly
described or irregular data, such as continuous data with non-normal data
distribution.

[0004] Computation of order statistics and statistical data distributions,
along with the other field summaries, is an important part of robust
assessment of data properties, as well as, data preparation for further
analyses. These summaries are useful in supporting data preparation and
diagnostics features, such as outlier detection, histograms, and box
plots that are based on order statistics and statistical data
distribution. Moreover, non-normal data usually require transformation to
normality for exploratory analysis and in preparation for modeling.

[0005] The cost of computing order statistics, statistical distributions,
and straightening transformations is prohibitive for large and
distributed data sets using available computation techniques. It requires
either storage of impermissible amounts of data in the main computer
memory or multiple data passes. Neither approach is efficient for
processing of large distributed data sets. This is in contrast to
available computation techniques for simple summaries, such as means or
standard deviations, that are computed in a single data pass with modest
memory storage requirements.

[0006] Some available computation techniques make the data ready for model
building without the need for prior knowledge of the statistical concepts
involved. Such available computation techniques do not support
computation on distributed data sources and are inefficient for very
large data sets requiring multiple data passes to accomplish several data
transformation steps sequentially.

[0007] Some conventional approaches focus on computing quantiles with
precision in a specified quantile range. Quantiles may be described as
data values taken at regular intervals from a cumulative distribution
function of a random variable. Dividing ordered data into q essentially
equal-sized data subsets is the motivation for q-quantiles; the quantiles
are the data values marking the boundaries between consecutive subsets.
Put another way, the k-th q-quantile marks the boundary at the k/q
fraction of the ranked data values and there are q-1 of the q-quantiles,
one for each integer k satisfying 0<k<q. Here, a more general
φ-quantile specification, where φ is a real number with
0≦φ≦1, is used, and the φ-quantile marks the
boundary at the φ fraction of the ranked data values. When queried
for a φ-quantile whose se precise value is x, these conventional
approaches return an element y that is guaranteed to be in the
[φ-ε, φ+ε] quantile range. On the other hand,
there are no guarantees for the precision of y in terms of the x itself
As a result, there can be uncontrolled errors in the location of the
computed approximate order statistics, thus invalidating location-based
statistical analysis. Moreover, the important information on the tails of
the statistical distribution and their possible anomalies may be lost.

SUMMARY

[0008] Provided are a method, computer program product, and system for
generating order statistics and error bounds. For each of multiple,
distributed data sources, a finite number of data bins are created for
each field in that data source. Data values in each of the multiple,
distributed data sources are processed to generate basic summaries for
each of the data bins in a single pass of the data values. The data bins
from each of the multiple, distributed data sources are sorted. One or
more approximate order statistics are computed for a data set by
accumulating counts from a number of the sorted data bins. Lower and
upper error bounds are provided for each of the computed one or more
approximate order statistics, wherein the lower and upper error bounds
are values delimiting an interval containing a true value of an order
statistic.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

[0009] Referring now to the drawings in which like reference numbers
represent corresponding parts throughout:

[0014] FIG. 5 illustrates, in a flow diagram, operations to compute and
apply robust statistics for data preparation in accordance with certain
embodiments.

[0015] FIG. 6 illustrates in a flow diagram, operations for stream
computation in each data source in accordance with certain embodiments.
FIG. 6 is formed by FIGS. 6A, 6B, and 6C.

[0016] FIG. 7 illustrates, in a flow diagram, operations for computing
approximate order statistics and error bounds from distributed data
sources in accordance with certain embodiments. FIG. 7 is formed by FIGS.
7A and 7B.

[0017]FIG. 8 illustrates, in a flow diagram, operations for generating
transformation rules and metadata in accordance with certain embodiments.

[0020] The descriptions of the various embodiments of the present
invention have been presented for purposes of illustration, but are not
intended to be exhaustive or limited to the embodiments disclosed. Many
modifications and variations will be apparent to those of ordinary skill
in the art without departing from the scope and spirit of the described
embodiments. The terminology used herein was chosen to best explain the
principles of the embodiments, the practical application or technical
improvement over technologies found in the marketplace, or to enable
others of ordinary skill in the art to understand the embodiments
disclosed herein.

[0021] It is understood in advance that although this disclosure includes
a detailed description on cloud computing, implementation of the
teachings recited herein are not limited to a cloud computing
environment. Rather, embodiments of the present invention are capable of
being implemented in conjunction with any other type of computing
environment now known or later developed.

[0022] Cloud computing is a model of service delivery for enabling
convenient, on-demand network access to a shared pool of configurable
computing resources (e.g. networks, network bandwidth, servers,
processing, memory, storage, applications, virtual machines, and
services) that can be rapidly provisioned and released with minimal
management effort or interaction with a provider of the service. This
cloud model may include at least five characteristics, at least three
service models, and at least four deployment models.

[0023] Characteristics are as Follows:

[0024] On-demand self-service: a cloud consumer can unilaterally provision
computing capabilities, such as server time and network storage, as
needed automatically without requiring human interaction with the
service's provider.

[0025] Broad network access: capabilities are available over a network and
accessed through standard mechanisms that promote use by heterogeneous
thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).

[0026] Resource pooling: the provider's computing resources are pooled to
serve multiple consumers using a multi-tenant model, with different
physical and virtual resources dynamically assigned and reassigned
according to demand. There is a sense of location independence in that
the consumer generally has no control or knowledge over the exact
location of the provided resources but may be able to specify location at
a higher level of abstraction (e.g., country, state, or datacenter).

[0027] Rapid elasticity: capabilities can be rapidly and elastically
provisioned, in some cases automatically, to quickly scale out and
rapidly released to quickly scale in. To the consumer, the capabilities
available for provisioning often appear to be unlimited and can be
purchased in any quantity at any time.

[0028] Measured service: cloud systems automatically control and optimize
resource use by leveraging a metering capability at some level of
abstraction appropriate to the type of service (e.g., storage,
processing, bandwidth, and active user accounts). Resource usage can be
monitored, controlled, and reported providing transparency for both the
provider and consumer of the utilized service.

[0029] Service Models are as Follows:

[0030] Software as a Service (SaaS): the capability provided to the
consumer is to use the provider's applications running on a cloud
infrastructure. The applications are accessible from various client
devices through a thin client interface such as a web browser (e.g.,
web-based email). The consumer does not manage or control the underlying
cloud infrastructure including network, servers, operating systems,
storage, or even individual application capabilities, with the possible
exception of limited user-specific application configuration settings.

[0031] Platform as a Service (PaaS): the capability provided to the
consumer is to deploy onto the cloud infrastructure consumer-created or
acquired applications created using programming languages and tools
supported by the provider. The consumer does not manage or control the
underlying cloud infrastructure including networks, servers, operating
systems, or storage, but has control over the deployed applications and
possibly application hosting environment configurations.

[0032] Infrastructure as a Service (IaaS): the capability provided to the
consumer is to provision processing, storage, networks, and other
fundamental computing resources where the consumer is able to deploy and
run arbitrary software, which can include operating systems and
applications. The consumer does not manage or control the underlying
cloud infrastructure but has control over operating systems, storage,
deployed applications, and possibly limited control of select networking
components (e.g., host firewalls).

[0033] Deployment Models are as Follows:

[0034] Private cloud: the cloud infrastructure is operated solely for an
organization. It may be managed by the organization or a third party and
may exist on-premises or off-premises.

[0035] Community cloud: the cloud infrastructure is shared by several
organizations and supports a specific community that has shared concerns
(e.g., mission, security requirements, policy, and compliance
considerations). It may be managed by the organizations or a third party
and may exist on-premises or off-premises.

[0036] Public cloud: the cloud infrastructure is made available to the
general public or a large industry group and is owned by an organization
selling cloud services.

[0037] Hybrid cloud: the cloud infrastructure is a composition of two or
more clouds (private, community, or public) that remain unique entities
but are bound together by standardized or proprietary technology that
enables data and application portability (e.g., cloud bursting for load
balancing between clouds).

[0038] A cloud computing environment is service oriented with a focus on
statelessness, low coupling, modularity, and semantic interoperability.
At the heart of cloud computing is an infrastructure comprising a network
of interconnected nodes.

[0039] Referring now to FIG. 1, a schematic of an example of a cloud
computing node is shown. Cloud computing node 10 is only one example of a
suitable cloud computing node and is not intended to suggest any
limitation as to the scope of use or functionality of embodiments of the
invention described herein. Regardless, cloud computing node 10 is
capable of being implemented and/or performing any of the functionality
set forth hereinabove.

[0040] In cloud computing node 10 there is a computer system/server 12,
which is operational with numerous other general purpose or special
purpose computing system environments or configurations. Examples of
well-known computing systems, environments, and/or configurations that
may be suitable for use with computer system/server 12 include, but are
not limited to, personal computer systems, server computer systems, thin
clients, thick clients, handheld or laptop devices, multiprocessor
systems, microprocessor-based systems, set top boxes, programmable
consumer electronics, network PCs, minicomputer systems, mainframe
computer systems, and distributed cloud computing environments that
include any of the above systems or devices, and the like.

[0041] Computer system/server 12 may be described in the general context
of computer system executable instructions, such as program modules,
being executed by a computer system. Generally, program modules may
include routines, programs, objects, components, logic, data structures,
and so on that perform particular tasks or implement particular abstract
data types. Computer system/server 12 may be practiced in distributed
cloud computing environments where tasks are performed by remote
processing devices that are linked through a communications network. In a
distributed cloud computing environment, program modules may be located
in both local and remote computer system storage media including memory
storage devices.

[0042] As shown in FIG. 1, computer system/server 12 in cloud computing
node 10 is shown in the form of a general-purpose computing device. The
components of computer system/server 12 may include, but are not limited
to, one or more processors or processing units 16, a system memory 28,
and a bus 18 that couples various system components including system
memory 28 to processor 16.

[0043] Bus 18 represents one or more of any of several types of bus
structures, including a memory bus or memory controller, a peripheral
bus, an accelerated graphics port, and a processor or local bus using any
of a variety of bus architectures. By way of example, and not limitation,
such architectures include Industry Standard Architecture (ISA) bus,
Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video
Electronics Standards Association (VESA) local bus, and Peripheral
Component Interconnects (PCI) bus.

[0044] Computer system/server 12 typically includes a variety of computer
system readable media. Such media may be any available media that is
accessible by computer system/server 12, and it includes both volatile
and non-volatile media, removable and non-removable media.

[0045] System memory 28 can include computer system readable media in the
form of volatile memory, such as random access memory (RAM) 30 and/or
cache memory 32. Computer system/server 12 may further include other
removable/non-removable, volatile/non-volatile computer system storage
media. By way of example only, storage system 34 can be provided for
reading from and writing to a non-removable, non-volatile magnetic media
(not shown and typically called a "hard drive"). Although not shown, a
magnetic disk drive for reading from and writing to a removable,
non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk
drive for reading from or writing to a removable, non-volatile optical
disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In
such instances, each can be connected to bus 18 by one or more data media
interfaces. As will be further depicted and described below, memory 28
may include at least one program product having a set (e.g., at least
one) of program modules that are configured to carry out the functions of
embodiments of the invention.

[0046] Program/utility 40, having a set (at least one) of program modules
42, may be stored in memory 28 by way of example, and not limitation, as
well as an operating system, one or more application programs, other
program modules, and program data. Each of the operating system, one or
more application programs, other program modules, and program data or
some combination thereof, may include an implementation of a networking
environment. Program modules 42 generally carry out the functions and/or
methodologies of embodiments of the invention as described herein.

[0047] Computer system/server 12 may also communicate with one or more
external devices 14 such as a keyboard, a pointing device, a display 24,
etc.; one or more devices that enable a user to interact with computer
system/server 12; and/or any devices (e.g., network card, modem, etc.)
that enable computer system/server 12 to communicate with one or more
other computing devices. Such communication can occur via Input/Output
(I/O) interfaces 22. Still yet, computer system/server 12 can communicate
with one or more networks such as a local area network (LAN), a general
wide area network (WAN), and/or a public network (e.g., the Internet) via
network adapter 20. As depicted, network adapter 20 communicates with the
other components of computer system/server 12 via bus 18. It should be
understood that although not shown, other hardware and/or software
components could be used in conjunction with computer system/server 12.
Examples, include, but are not limited to: microcode, device drivers,
redundant processing units, external disk drive arrays, RAID systems,
tape drives, and data archival storage systems, etc.

[0048] Referring now to FIG. 2, illustrative cloud computing environment
50 is depicted. As shown, cloud computing environment 50 comprises one or
more cloud computing nodes 10 with which local computing devices used by
cloud consumers, such as, for example, personal digital assistant (PDA)
or cellular telephone 54A, desktop computer 54B, laptop computer 54C,
and/or automobile computer system 54N may communicate. Nodes 10 may
communicate with one another. They may be grouped (not shown) physically
or virtually, in one or more networks, such as Private, Community,
Public, or Hybrid clouds as described hereinabove, or a combination
thereof. This allows cloud computing environment 50 to offer
infrastructure, platforms and/or software as services for which a cloud
consumer does not need to maintain resources on a local computing device.
It is understood that the types of computing devices 54A-N shown in FIG.
2 are intended to be illustrative only and that computing nodes 10 and
cloud computing environment 50 can communicate with any type of
computerized device over any type of network and/or network addressable
connection (e.g., using a web browser).

[0049] Referring now to FIG. 3, a set of functional abstraction layers
provided by cloud computing environment 50 (FIG. 2) is shown. It should
be understood in advance that the components, layers, and functions shown
in FIG. 3 are intended to be illustrative only and embodiments of the
invention are not limited thereto. As depicted, the following layers and
corresponding functions are provided:

[0052] In one example, management layer 64 may provide the functions
described below. Resource provisioning provides dynamic procurement of
computing resources and other resources that are utilized to perform
tasks within the cloud computing environment. Metering and Pricing
provide cost tracking as resources are utilized within the cloud
computing environment, and billing or invoicing for consumption of these
resources. In one example, these resources may comprise application
software licenses. Security provides identity verification for cloud
consumers and tasks, as well as protection for data and other resources.
User portal provides access to the cloud computing environment for
consumers and system administrators. Service level management provides
cloud computing resource allocation and management such that required
service levels are met. Service Level Agreement (SLA) planning and
fulfillment provide pre-arrangement for, and procurement of, cloud
computing resources for which a future requirement is anticipated in
accordance with an SLA.

[0053] Workloads layer 66 provides examples of functionality for which the
cloud computing environment may be utilized. Examples of workloads and
functions which may be provided from this layer include: mapping and
navigation; software development and lifecycle management; virtual
classroom education delivery; data analytics processing; transaction
processing; and statistics computing and application.

[0054] Thus, in certain embodiments, software, implementing statistics
computing and application in accordance with embodiments described
herein, is provided as a service in a cloud environment.

[0056] The memory 440 may be described as a collection of memory reserved
for intermediate computations in all fields. The memory 440 may include
caches 442 and/or buffers 444. In certain embodiments, there is one cache
442 per each numeric field and one buffer 444 per each string field.

[0057] Univariate data summaries 430 encompass all summaries. Univariate
data summaries 430 are based on the whole dataset and include order
statistics, statistical data distributions and other descriptive
statistics, such as mean, variance, skewness, etc. The basic summaries
436 in each data bin (e.g., referenced in blocks 702a, . . . , 702n of
FIG. 7A) are different from the univariate data summaries 430 and include
the lower and upper bounds, count, and mean. Thus, basic summaries are
used with reference to the data bins, and univariate data summaries are
used with reference to the whole dataset.

[0058] The statistics engine 410 computes univariate data summaries 430
(such as order statistics, statistical data distributions, and other
descriptive statistics) for each field in very large and distributed data
sources. Data values for each field are aggregated into a finite-sized
list of data bins based on their location. Basic summaries 436 (such as
count, mean, minimum, and maximum) are maintained for each data bin,
while the data values are discarded. These basic summaries 436 are
further aggregated when merging data bins from different data sources of
the distributed data. This approach uses a single data pass and is
scalable since only a limited amount of memory is used per field. Final
summaries 438 provide approximate order statistics with limited location
errors, their deterministic error bounds, as well as, approximate
statistical data distribution for each field. The summaries for
categorical fields having a limited number of distinct values are exact.
The computed approximate order statistics are of interest in data
analysis because they realistically summarize the distribution of data
and are less sensitive to outlying data values than the simple summaries
(such as mean or variance).

[0059] The computed univariate data summaries 430 are used as inputs for
the data preparation component 422 for generating statistical
distribution plots, transformation rules for outlier detection, missing
data handling, and distribution straightening transformations. In order
to transform distributions to symmetry, certain embodiments use Box-Cox
power transformations providing a continuum of transformation functions
through an estimable parameter. The statistics engine 410 introduces a
new approach based on the computed approximate order statistics to
estimate the power transformation parameter. This approach provides a
transformation insensitive to extreme values and it has no additional
data access for obtaining the robust transformation rule.

[0061] Order statistics depict the features of original data and are
mostly insensitive to extreme values. Computing the data preparation
transformations based on the small number of approximate order statistics
is more efficient than using a large volume of the original data. The
statistics engine 410 generates scalable and robust univariate data
summaries 430 in a single data pass, performs metadata discovery or
conformance rule checking to guarantee measurement level specification,
and generates data preparation rules based on the approximate order
statistics. Also, the transformation component 424 executes generated
transformation rules dynamically, while the data is read by the
subsequent analytic components. The statistics engine 410 serves as the
fundamental module for evaluating data quality and supporting exploratory
data analysis and predictive modeling.

[0062] FIG. 5 illustrates, in a flow diagram, operations to compute and
apply robust statistics for data preparation in accordance with certain
embodiments. Control begins at block 500 with the data summary component
420 receiving input data 460. In block 502, the data summary component
420 uses the input data 460 to generate one or more univariate data
summaries 430. In block 504, the data preparation component 422 uses the
one or more univariate data summaries 430 to generate one or more
transformation rules and metadata 432. In block 506, the transformation
component 424 uses the one or more transformation rules and metadata 432
to generate transformed data, statistics, and metadata 434. From block
506, processing may continue to blocks 508 and/or 510. In block 508, data
exploration is performed by some analytic engine using the transformed
data, statistics, and metadata 434. In block 510, predictive modeling is
performed by some analytic engine using the transformed data, statistics,
and metadata 434.

Data Summary Component

[0063] A metadata specification may be described as a specification of
various properties for each field in the data set. Given a new data set
with an undefined or an incomplete analytic metadata specification, the
data summary component 420 executes the initial data pass for the purpose
of generating univariate data summaries 430. This data pass produces
approximate field distributions and order statistics that provide
univariate statistics and support generating appropriate metadata
specifications. This data pass is for the new data sets and the results
remain available for any subsequent analyses.

[0064] The prior information requirement should be minimized for large and
distributed data sources because extracting basic metadata information
from them can be very expensive. The only information stored is the
storage type of each data field: numeric or string. The data summary
component 420 performs computation that consists of dispatching
operations to each data source, stream computation in each data source,
and a final consolidation stage.

Computation for a Numeric Field

[0065] For a numeric field, the dispatching operation parses computation
requirements from an application, constructs the content and order of the
computations, and dispatches the content and order of the computations to
each data source.

[0066] When performing stream computation in each data source, the data
summary component 420 treats the local data as a stream data source. The
data values of each numeric field are aggregated into a list of data bins
("bin lists"). Each data bin represents a cluster of data values that
specifies the smallest and the largest values, as well as, the count and
the mean of the data values within that data bin. All counts in the data
bins are weighted when appropriate. The data bins do not store the data
values. After the data bin lists for the given field from different data
sources are created, the data bin lists are combined together. In certain
embodiments, the combined data bin list is sorted by the ascending order
of the lower bounds of the data bins.

[0067] Order statistics (such as median, quartiles or percentiles) may be
approximated from the combined data bin list without sorting the original
field data values. Deterministic error bounds are provided for each of
the one or more order statistics, and it can be proved that the true
value of the related order statistic is within the estimated error
bounds. The width of each data bin is maintained within limits bounded by
the range of the data divided by the number of data bins per data source.
The errors with their bounds can be controlled by the user through
specifying the limit on the number of data bins. A larger number of data
bins will result in smaller errors. The term "smaller" refers to the size
of the error (i.e., the distance from the true values). The bounds on the
error size become smaller with the larger number of data bins. Moreover,
the achieved accuracy may be better due to the procedure exploiting
larger gaps in the range of data values. That is, data bins are not
necessarily adjacent, and larger gaps in data values are preserved as
gaps between data bins. This results in tighter error bounds than
calculated directly from the field range and the number of data bins.
Thus, with embodiments, the maximal location errors are limited.

[0068] The data summary component 420 creates the data bin list for the
field in each data source, and then estimates the error bounds based on
the combined data bin list. Error bounds are based on the observed data
values, and the size of the data bin list for each data source can be
fixed in advance. The error bounds are deterministic and provide overall
accuracy for the computed approximate order statistics.

[0069] Each data bin list is empty at the beginning of the data pass. New
data values are added as new data bins containing a single data value,
where lower bound, upper bound, and the mean are all equal. If a new data
value falls between the lower and the upper bound of an existing data
bin, the count and the mean for that data bin are updated with the data
value.

[0070] As additional data values arrive, the size of the data bin list
will reach a given threshold designed to conserve the memory usage. In
this case, the data summary component 420 merges some of data bins to
keep the size of the data bin list limited. Instead of performing merge
whenever a new data bin is created, the data summary component 420 caches
(i.e., stores in a cache 442 in memory 440) the new data bins to a
temporary data bin list until the number of data bins in the cache 442
reaches a given size threshold. The data summary component 420 performs
the merging procedure on the combined data bin list and the temporary
data bin list in the cache 442.

[0071] Each pair of data bins that have the least distance between their
means are merged together repeatedly until the threshold size of the data
bin list is reached. When data bins are merged, the counts and the means
are aggregated from the contributing data bins. The lower bound is set to
the smaller of the two former lower bounds, while the upper bound is set
to the larger of the two former upper bounds. An additional criterion for
merging data bins is that width of the newly formed data bin should be
less than twice the range divided by the number of data bins active for
merging. This ensures that the final estimates have limited error bounds.

[0072] When merging the data bins, the first and the last S data bins are
not involved in finding the nearest data bin pairs. S may be described as
a specified constant that does not vary with the size of the data.
Therefore, the S smallest values and the S largest values are preserved
as potential outliers for a later data preparation. This feature
preserves an accurate representation of the two tails of data
distribution.

[0073] FIG. 6 illustrates in a flow diagram, operations for stream
computation in each data source in accordance with certain embodiments.
FIG. 6 is formed by FIGS. 6A, 6B, and 6C. Control begins at block 600
with the data summary component 420 receiving input data from one data
source. In block 602, the data summary component 420 determining whether,
in the input data, there is a new data value to be processed. The new
data value represents a new data value for a field. If so, processing
continues to block 604, otherwise, processing continues to block 614
(FIG. 6c).

[0078] In block 612, the data summary component 420 merges data bins with
single data values each in the cache 442 with the existing data bins.
From block 612, processing continues to block 602 (FIG. 6A).

[0079] From block 602, if there is no new data value to be processed in
the input data, processing continues to block 614. In block 614, the data
summary component 420 merges additional data bins with single data values
each in the cache 442 with existing data bins. In block 616, the data
summary component 420 outputs a data bin list for the data source with
the bounds, counts, and means for each data bin.

[0080] Once each data source is processed in accordance with the
processing of FIG. 6, specified quantiles are computed in a final
consolidation stage. The count of every data bin from each data source is
accumulated in the ascending order of the data bins' lower bounds until
the accumulated count exceeds the count corresponding to the desired
quantile. The estimated quantile value is the mean of the last
accumulated data bin. The lower error bound of the estimated order
statistics is the lower bound of the last accumulated data bin, and the
upper error bound of the estimated order statistics is the largest upper
bound of all the accumulated data bins. For example, the approximate
median can be found by accumulating the counts of all data bins in
ascending order of lower bounds until the accumulated count is larger
than 50 percent of the total count. The approximate median equals the
mean of the last accumulated data bin. The lower bound of the last
accumulated data bin and the largest upper bound of all the accumulated
data bins are the lower and the upper error bounds for the approximate
median, respectively.

[0082] In blocks 706-712, the data bin lists and basic summaries 436 of
each data source are combined. In block 706, the data summary component
420 collects and sorts (i.e., orders) data bins. In block 708, the data
summary component 420 accumulates counts from a sufficient number of
sorted (i.e., ordered) data bins. In certain embodiments, the term
"sufficient" refers to the accumulated counts exceeding the counts
corresponding to desired quantiles. In block 710, the data summary
component 420 computes approximate order statistics and error bounds
based on the accumulated counts. In block 712, the data summary component
420 outputs the approximate order statistics and error bounds for the
overall data from all of the data sources, wherein the lower and upper
error bounds are values delimiting the interval containing the true value
of an order statistic for each computed approximate order statistic. In
particular, the upper and lower bounds are provided along with each
computed approximate order statistic. The true value lies in the interval
between the lower and upper bound and, therefore, within limited distance
from the computed approximate order statistic.

Computation for a String Field

[0083] For a string field, the dispatching operation parses computation
requirements from an application, constructs the content and order of the
computations, and dispatches the content and order of the computations to
each data source.

[0084] Stream computation in each data source treats the local data as a
stream data source.

[0085] In certain embodiments, a buffer 444 in memory 440 containing
distinct values with corresponding count is maintained for each string
field. If a given data string value is found in the buffer 444, its count
is updated accordingly. Otherwise, a new distinct value is added to the
buffer 444 and its frequency is set to the case weight. Once the buffer
444 contains M distinct string values, the new distinct values are no
longer added into the buffer 444, and all the other string values are
counted as a single group. M is set large enough to preserve the string
field information and it also depends on the available memory.

[0086] In the final consolidation stage, the frequencies for the same
distinct string value from all the data sources are added together, and
the distinct string values with the largest L counts are selected to
represent this field. If any of the contributing buffers 444 contains M
distinct values, the final computed counts provide the lower bounds for
the actual counts.

Data Preparation Component

[0087] To gain the metadata information and assure its validity, the data
preparation component 422 generates appropriate metadata specifications
based on the univariate data summaries 430 obtained from the data summary
component 420. These specifications are either produced by applying the
metadata discovery rules when no metadata information is available from
the data source or updated by applying conformance rules when metadata is
available but possibly mis-specified. Numeric field storage type can be
specified as either integer or real.

[0088] The data preparation component 422 also implements various features
after metadata specification, such as outlier detection and handling,
missing value handling, Box-Cox transformation (which transforms
distributions to symmetry), etc. These features use the approximate order
statistics from the data summary component 420, as well as, the metadata
specifications as input. Their output contains transformation rules for
creating new fields. Using the approximate order statistics has two
merits: (1) no additional data pass is required which saves time,
especially for the very large data sets; and (2) it makes the
transformations more robust against the extreme values in data.

[0089] Embodiments use the computed approximate order statistics in
Box-Cox transformation. Specifically, the Box-Cox transformation function
based on the original data yi, i=1, . . . , N, is specified as
follows:

where sd(g(λ)) is the standard deviation of Box-Cox transformation
of y values.

[0090] Embodiments use approximate order statistics instead of the
original data to estimate the transformation parameter λ by a grid
search with the maximum log-likelihood value, i.e., replace yi, i=1,
. . . , N, in the above log-likelihood function with percentiles,
pi, i=0, 1, . . . , 100, where pi is the ith percentile
value so p0 is the minimum and p100 is the maximum. Then the
original data yi, i=1, . . . . , N, are transformed by the Box-Cox
transformation function. Embodiments avoid one data pass and improve
speed of the Box-Cox transformation as 101 values may be used no matter
how large the data set is. Moreover, experiments indicate the estimated
transformation parameter, which is the transformation rule, based on the
percentiles is close to that based on the original data.

[0091]FIG. 8 illustrates, in a flow diagram, operations for generating
transformation rules and metadata in accordance with certain embodiments.
Control begins at block 800 with the data preparation component 422
receiving univariate data summaries 430. In block 802, the data
preparation component 422 obtains one or more metadata specifications by
applying the metadata discovery rules when no metadata information is
available from the data source or updated by applying conformance rules
when metadata is available but possibly mis-specified. That is, in block
802, the data preparation component 422 performs metadata discovery and
conformance rule checking From block 802, processing continues to blocks
804, 806, and 808.

[0092] In block 804, the data preparation component 422 handles outliers
based on approximate order statistics. In block 806, the data preparation
component 422 handles missing values based on approximate order
statistics. In block 808, the data preparation component 422 performs
Box-Cox transformation based on approximate order statistics. From blocks
804, 806, and 808, processing continues to block 810.

[0095] The transformation component 424 executes any transformations
generated by the data preparation component 422 and passes the
corresponding values to other data consuming components for data
exploration and predictive modeling. The scalable and distributed
mechanism for executing the data transformations is processing the data
in a distributed file system. The transformation component 424 generates
values for the new variables specified by the transformation rules. Both
the original and the transformed fields are presented as data to the
subsequent analytic components. No additional data passes are necessary
for generating the transformed values since the transformation rules are
record-based and can be executed concurrently with data reading required
for the input to the modeling components.

[0096] The transformation component 424 completes the flexible system of
extracting robust statistics from the original data, creating appropriate
transformation rules, and executing them on an as needed basis.

[0097] Thus, the statistics engine 410 provides a solution for delivering
summaries and transformation rules needed for robust data assessment and
preparation. The statistics engine 410 computes univariate data summaries
430, including robust order statistics and statistical distributions, for
analysis of irregular, large and distributed data sources. The statistics
engine 410 computes approximate order statistics with limited location
errors and their deterministic error bounds. Moreover, the statistics
engine 410 generates the data transformations for data exploration and
data preparation for modeling based on the acquired robust summaries. The
statistics engine 410 is useful for an increasing number of large and
distributed data source installations found in business, government, and
industry.

[0098] The statistics engine 410 calculates approximate order statistics
in a single data pass, with limited location error bounds. Also, the
statistics engine calculates the Box-Cox transformation parameter based
on the computed approximate order statistics, rather than on the original
data.

[0099] The statistics engine 410 approximates the order statistics for
each field in a single data pass from distributed data by creating data
bins for each distributed data source, collecting data bins from all
distributed data sources, sorting the data bins, and calculating
approximate order statistics by accumulating counts from a sufficient
number of ordered data bins. The statistics engine 410 reports
deterministic error bounds for each approximate order statistic, and the
errors with their bounds can be controlled through specifying the limit
on the number of data bins as the larger number of data bins will result
in smaller errors.

[0100] The statistics engine 410 uses a predefined number of data bins for
each data source without making any assumptions about data, while the
boundaries of the data bins are dynamically adjusted. The statistics
engine 410 creates a set of data bins for each field and for each data
source such that only a small number of basic summaries 436 are
maintained for each data bin.

[0101] The statistics engine 410 is able to extract robust statistics from
the original data and create appropriate transformation rules in a single
data pass and deliver statistics transformation rules for data
exploration and predictive modeling. In particular, the statistics engine
410 creates a limited, finite number of data bins for each field and for
each distributed data source such that only a small number of basic
summaries 436 are maintained for each data bin; collects data bins from
each distributed data source and sorting them; calculates approximate
order statistics for the overall data set by accumulating counts from a
sufficient number of ordered data bins; provides error bounds which are
values delimiting the interval containing the true value of an order
statistic for each computed approximate order statistic; discovers or
verifies metadata properties based on the computed summary statistics;
applies the computed approximate order statistics to generate data
transformation rules for outlier and missing value handling; calculates
power transformation parameters for Box-Cox transformation using the
computed approximate order statistics; and generates the corresponding
transformation rule.

[0102] The statistics engine 410 also creates a set of data bins for each
distributed data source by: generating a data bin of zero width for each
of the predetermined number of the initial data values; adding further
data values to the existing data bins and updating basic summaries 436
when any new point values can be placed within the existing data bin
bounds; creating preset number of additional data bins for the points
whose values cannot be placed within any of the existing data bin bounds;
setting the approximation bound proportional to the range of the values
divided by the predetermined number of data bins; and merging the
additional data bins with the existing data bins in batches by adjusting
the data bin summaries to reflect the merged data bins when number of
data bins exceeds a preset threshold and ensuring that the closest data
bins are merged first and that width of the merged data bins does not
exceed the approximation bound.

Additional Embodiment Details

[0103] As will be appreciated by one skilled in the art, aspects of the
present invention may be embodied as a system, method or computer program
product. Accordingly, aspects of the present invention may take the form
of an entirely hardware embodiment, an entirely software embodiment
(including firmware, resident software, micro-code, etc.) or an
embodiment combining software and hardware aspects that may all generally
be referred to herein as a "circuit," "module" or "system." Furthermore,
aspects of the present invention may take the form of a computer program
product embodied in one or more computer readable medium(s) having
computer readable program code embodied thereon.

[0104] Any combination of one or more computer readable medium(s) may be
utilized. The computer readable medium may be a computer readable signal
medium or a computer readable storage medium. A computer readable storage
medium may be, for example, but not limited to, an electronic, magnetic,
optical, electromagnetic, infrared, or semiconductor system, apparatus,
or device, or any suitable combination of the foregoing. More specific
examples (a non-exhaustive list) of the computer readable storage medium
would include the following: an electrical connection having one or more
wires, a portable computer diskette, a hard disk, a random access memory
(RAM), a read-only memory (ROM), an erasable programmable read-only
memory (EPROM or Flash memory), an optical fiber, a portable compact disc
read-only memory (CD-ROM), an optical storage device, a magnetic storage
device, solid state memory, magnetic tape or any suitable combination of
the foregoing. In the context of this document, a computer readable
storage medium may be any tangible medium that can contain, or store a
program for use by or in connection with an instruction execution system,
apparatus, or device.

[0105] A computer readable signal medium may include a propagated data
signal with computer readable program code embodied therein, for example,
in baseband or as part of a carrier wave. Such a propagated signal may
take any of a variety of forms, including, but not limited to,
electro-magnetic, optical, or any suitable combination thereof. A
computer readable signal medium may be any computer readable medium that
is not a computer readable storage medium and that can communicate,
propagate, or transport a program for use by or in connection with an
instruction execution system, apparatus, or device.

[0106] Program code embodied on a computer readable medium may be
transmitted using any appropriate medium, including but not limited to
wireless, wireline, optical fiber cable, RF, etc., or any suitable
combination of the foregoing.

[0107] Computer program code for carrying out operations for aspects of
the present invention may be written in any combination of one or more
programming languages, including an object oriented programming language
such as Java, Smalltalk, C++ or the like and conventional procedural
programming languages, such as the "C" programming language or similar
programming languages. The program code may execute entirely on the
user's computer, partly on the user's computer, as a stand-alone software
package, partly on the user's computer and partly on a remote computer or
entirely on the remote computer or server. In the latter scenario, the
remote computer may be connected to the user's computer through any type
of network, including a local area network (LAN) or a wide area network
(WAN), or the connection may be made to an external computer (for
example, through the Internet using an Internet Service Provider).

[0108] Aspects of the embodiments of the invention are described below
with reference to flowchart illustrations and/or block diagrams of
methods, apparatus (systems) and computer program products according to
embodiments of the invention. It will be understood that each block of
the flowchart illustrations and/or block diagrams, and combinations of
blocks in the flowchart illustrations and/or block diagrams, can be
implemented by computer program instructions. These computer program
instructions may be provided to a processor of a general purpose
computer, special purpose computer, or other programmable data processing
apparatus to produce a machine, such that the instructions, which execute
via the processor of the computer or other programmable data processing
apparatus, create means for implementing the functions/acts specified in
the flowchart and/or block diagram block or blocks.

[0109] These computer program instructions may also be stored in a
computer readable medium that can direct a computer, other programmable
data processing apparatus, or other devices to function in a particular
manner, such that the instructions stored in the computer readable medium
produce an article of manufacture including instructions which implement
the function/act specified in the flowchart and/or block diagram block or
blocks.

[0110] The computer program instructions may also be loaded onto a
computer, other programmable data processing apparatus, or other devices
to cause a series of operational processing (e.g., operations or steps)
to be performed on the computer, other programmable apparatus or other
devices to produce a computer implemented process such that the
instructions which execute on the computer or other programmable
apparatus provide processes for implementing the functions/acts specified
in the flowchart and/or block diagram block or blocks.

[0111] The code implementing the described operations may further be
implemented in hardware logic or circuitry (e.g., an integrated circuit
chip, Programmable Gate Array (PGA), Application Specific Integrated
Circuit (ASIC), etc. The hardware logic may be coupled to a processor to
perform operations.

[0112] Devices that are in communication with each other need not be in
continuous communication with each other, unless expressly specified
otherwise. In addition, devices that are in communication with each other
may communicate directly or indirectly through one or more
intermediaries.

[0113] A description of an embodiment with several components in
communication with each other does not imply that all such components are
required. On the contrary a variety of optional components are described
to illustrate the wide variety of possible embodiments of the present
invention.

[0114] Further, although process steps, method steps, algorithms or the
like may be described in a sequential order, such processes, methods and
algorithms may be configured to work in alternate orders. In other words,
any sequence or order of steps that may be described does not necessarily
indicate a requirement that the steps be performed in that order. The
steps of processes described herein may be performed in any order
practical. Further, some steps may be performed simultaneously.

[0115] When a single device or article is described herein, it will be
readily apparent that more than one device/article (whether or not they
cooperate) may be used in place of a single device/article. Similarly,
where more than one device or article is described herein (whether or not
they cooperate), it will be readily apparent that a single device/article
may be used in place of the more than one device or article or a
different number of devices/articles may be used instead of the shown
number of devices or programs. The functionality and/or the features of a
device may be alternatively embodied by one or more other devices which
are not explicitly described as having such functionality/features. Thus,
other embodiments of the present invention need not include the device
itself.

[0116] The illustrated operations of FIGS. 5-8 show certain events
occurring in a certain order. In alternative embodiments, certain
operations may be performed in a different order, modified or removed.
Moreover, operations may be added to the above described logic and still
conform to the described embodiments. Further, operations described
herein may occur sequentially or certain operations may be processed in
parallel. Yet further, operations may be performed by a single processing
unit or by distributed processing units.

[0117] The terminology used herein is for the purpose of describing
particular embodiments only and is not intended to be limiting of the
invention. As used herein, the singular forms "a", "an" and "the" are
intended to include the plural forms as well, unless the context clearly
indicates otherwise. It will be further understood that the terms
"comprises" and/or "comprising," when used in this specification, specify
the presence of stated features, integers, steps, operations, elements,
and/or components, but do not preclude the presence or addition of one or
more other features, integers, steps, operations, elements, components,
and/or groups thereof.

[0118] The terms "an embodiment", "embodiment", "embodiments", "the
embodiment", "the embodiments", "one or more embodiments", "some
embodiments", and "one embodiment" mean "one or more (but not all)
embodiments of the present invention(s)" unless expressly specified
otherwise.

[0120] The enumerated listing of items does not imply that any or all of
the items are mutually exclusive, unless expressly specified otherwise.

[0121] The corresponding structures, materials, acts, and equivalents of
all means or step plus function elements in the claims below are intended
to include any structure, material, or act for performing the function in
combination with other claimed elements as specifically claimed. The
description of embodiments of the present invention has been presented
for purposes of illustration and description, but is not intended to be
exhaustive or limited to the invention in the form disclosed. Many
modifications and variations will be apparent to those of ordinary skill
in the art without departing from the scope and spirit of the invention.
The embodiments were chosen and described in order to best explain the
principles of the invention and the practical application, and to enable
others of ordinary skill in the art to understand the invention for
various embodiments with various modifications as are suited to the
particular use contemplated.

[0122] The flowchart and block diagrams in the figures illustrate the
architecture, functionality, and operation of possible implementations of
systems, methods and computer program products according to various
embodiments of the present invention. In this regard, each block in the
flowchart or block diagrams may represent a module, segment, or portion
of code, which comprises one or more executable instructions for
implementing the specified logical function(s). It should also be noted
that, in some alternative implementations, the functions noted in the
block may occur out of the order noted in the figures. For example, two
blocks shown in succession may, in fact, be executed substantially
concurrently, or the blocks may sometimes be executed in the reverse
order, depending upon the functionality involved. It will also be noted
that each block of the block diagrams and/or flowchart illustration, and
combinations of blocks in the block diagrams and/or flowchart
illustration, can be implemented by special purpose hardware-based
systems that perform the specified functions or acts, or combinations of
special purpose hardware and computer instructions.

[0123] The foregoing description of embodiments of the invention has been
presented for the purposes of illustration and description. It is not
intended to be exhaustive or to limit the embodiments to the precise form
disclosed. Many modifications and variations are possible in light of the
above teaching. It is intended that the scope of the embodiments be
limited not by this detailed description, but rather by the claims
appended hereto. The above specification, examples and data provide a
complete description of the manufacture and use of the composition of the
embodiments. Since many embodiments may be made without departing from
the spirit and scope of the invention, the embodiments reside in the
claims hereinafter appended or any subsequently-filed claims, and their
equivalents.