On Aug 17, 2010, at 11:51 PM, Ketil Malde wrote:
> Yitzchak Gale <gale at sefer.org> writes:
>>> I don't think the genome is typical text.
>> I think the typical *large* collection of text is text-encoded data, and
> not, for lack of a better word, literature. Genomics data is just an
> example.
I have a collection of 100,000 patents I'm working with.
5.5GB of XML, most of it (US-)English text.
After stripping out the XML markup, it's 4GB of text.
It's a random sample from some 14 million patents I could
have access to, but 100,000 was more than enough.