(Cat? OR feline) AND NOT dog?
Cat? W/5 behavior
(Cat? OR feline) AND traits
Cat AND charact*

This guide provides a more detailed description of the syntax that is supported along with examples.

This search box also supports the look-up of an IP.com Digital Signature (also referred to as Fingerprint); enter the 72-, 48-, or 32-character code to retrieve details of the associated file or submission.

Concept Search - What can I type?

For a concept search, you can enter phrases, sentences, or full paragraphs in English. For example, copy and paste the abstract of a patent application or paragraphs from an article.

Concept search eliminates the need for complex Boolean syntax to inform retrieval. Our Semantic Gist engine uses advanced cognitive semantic analysis to extract the meaning of data. This reduces the chances of missing valuable information, that may result from traditional keyword searching.

Stochastic Identification of Duplicate Computer Files

Publishing Venue

The IP.com Prior Art Database

Abstract

This invention inserts a stochastic filtering procedure before any attempt to compare actual file contents, by calculating K-bit checksums for each of the candidate files and discarding files having unique checksums from further consideration as potential duplicate files. It performs the comparisons of actual file contents only between files having identical checksums, reducing the time required to confirm the identification of actual duplicates.

Country

Undisclosed

Language

English (United States)

This text was extracted from a Microsoft Word document.

This is the abbreviated version, containing approximately
53% of the total text.

Stochastic Identification of Duplicate Computer Files

Proliferation
of computer files often causes redundancy.
That is, two or more files are identical. The problem is to identify the redundant files so that they may
be removed.

The
traditional means of identifying duplicate files in a collection has been to
compare the contents of each pair of files for equivalence, with occasional
optimizations in which, for example, the sizes of the files are compared first,
because the files cannot be duplicates if their lengths differ. Such techniques are too slow to be of
practical use for large collections. As
a result, attempts to discover and remove redundancy among computer files are
infrequent, leading to proliferation.

This invention
inserts a stochastic filtering procedure before any attempt to compare actual
file contents, by calculating K-bit checksums for each of the candidate files
and discarding files having unique checksums from further consideration as
potential duplicate files. It performs
the comparisons of actual file contents only between files having identical
checksums, reducing the time required to confirm the identification of actual
duplicates.

This invention
is expressed as an algorithm using mathematical notation, but it is intended to
be implemented in software.

1. Initial definitions.

1.1. Define A as the set of files among which duplicates are to
be sought.

1.2. The symbol N is defined as the number of files in A (i.e.,
the cardinality of A).

1.3. Then we denote the ith file in A as A[i], for 0
< i ≤ N.

2. Filter by file size.

2.1. Define the function S(f) as the size, in octets, of file f.

2.2. For each i such that 0 < i ≤ N, calculate s = S(A[i]),
and place A[i] in set B[s].

2.3. For each set B[s] that contains fewer than 2 files, discard
the set from further consideration; the files in B[s] do not duplicate any
other files in A, because they have unique lengths.

3. Filter by checksum.

3.1. Define the function T(f) as the K-bit checksum of the
contents of file f. For purposes of
this algorithm, T(f) is expected to take all of the contents of file f into
account to produce the same checksum value when applied to identical file
contents and to distribute its results uniformly over the 2K possible values of
the checksum.

3.2. For each remaining B[s], …

3.2.1. Denote the ith file in B[s] as B[s][j].

3.2.2. For each file B[s][j] in B[s], calculate t = T(B[s][j]), and
place B[s][j] in set C[t].

3.3. For each set C[t] that contains fewer than 2 files, discard
the set from further consideration; the files in C[t] do not duplicate any
other files in A, because they have unique checksums.

3.4. At this point, the probability that the contents of any 2
files in any given C[t] differ is 2-K.

4. Verify by file content comparison.

4.1. If the certainty given by the above probability assertion is
sufficient for a given application, then …

4.1.1. Each set C[t] is deemed an equivalency set. That is, the files contained in C[t] are
de...