One of the problems with a web
page-based corpus is the duplicate text that you will find on
different pages, and even within the same page. For example, there
might be 10-15 pages from the same website that include a copyright
notice (e.g. ...you are not permitted to copy this text...).
Or there might be a web page with reader comments, in which a
comment at the top of the page gets repeated two or three times
later on that page.

We have used several methods to remove
these duplicates:

1. As we created lists of web pages
from Google searches, we only used each
web page once, even if it was generated by multiple searches.2.JusText
removed most boilerplate material (e.g. headers, footers, sidebars),
which contains a lot of duplicate material on pages from the same
website.3. Once we had downloaded all 25 million web pages, we then
searched for duplicate n-grams (primarily 11-grams, in our case),
looking for long strings of words that are repeated, such as "This
newspaper is copyrighted by Company_X. You are not permitted..."
( = 11 words, including punctuation). We ran these searches many
times, in many different ways, trying to find and eliminate
duplicate texts, and also duplicate strings within different texts.

Even with these steps, however, there
are still duplicate texts and (more commonly) duplicate portions of
text in different pages, especially since the corpus is so big (1.9
billion words, in 1.8 million web pages). It will undoubtedly be impossible to eliminate
every single one of these duplicates. But at this point, we are
continuing to do the following:

4. In the Keyword in Context (KWIC)
display, you will see a number in
parentheses (e.g. (1) ) after web pages where there was a duplicate.5. As these duplicates are found -- one by one as KWIC displays are
generated for thousands of corpus users -- they will get logged in the database.
Every month or so, we will run scripts to eliminate these duplicate
texts / strings. In this way, the corpus will continue to get
"cleaner and cleaner" over time.

One final issue: what do about
intra-page duplicates, i.e. cases where the same text is copied
on the same web page. As was mentioned above, there might be a web
page with reader comments, in which a comment at the top of the page
gets repeated two or three times later on that page. Our approach at
this point is to log these in the database as users do KWIC displays
(#5 above), but to not delete the duplicates at this point. If a
comment is copied on a page, it may be because the comment is an
important one, and perhaps it deserves to be preserved twice in the
corpus. We're still debating on this, however.

If you have feedback on any of these
issues, please feel free to email us.
Thanks.