Using all the Terabytes

Using all the Terabytes

To the Editors:

Brian Hayes forecasts the development of terabyte-capacity disk drives within the next 10 years (Computing Science, May–June), and asks where we are going to find enough data to make it worthwhile to fill them. A subsidiary question is how we are going to index terabyte databases so we can know what the disk drives are filled with.

As an ex-academic biologist who graduated at the tail end of the Baby Boom, I recall that the jobs available to me after graduation did not leave me a way to reach the research libraries whose resources I needed in order to write up my research. Web technology combined with cheap terabyte storage completely changes that picture. Within the next few years, anyone should be able to retrieve the totality of scholarly knowledge from anywhere.

Peter Lyman and Hal Varian of the University of California, Berkeley, have estimated that scholarly journals publish between 0.2 and 9 terabytes of information a year worldwide. Combined with the Web's capacity to index and retrieve current publications, this suggests that by early in the next decade it will be possible for libraries anywhere in the world to provide their members with local access to the totality of humankind's scholarly knowledge. Given bandwidth limitations, it seems reasonable that disk drives will be sold pre-loaded with this kind of information the same way CDs are sold today. With such technology, individual scholars may even be able to afford to own the entire recorded knowledge of their disciplines. This will allow many more people to do thorough scholarly work than is possible today.

The other question—How can we practically index, manage and retrieve such a large volume of knowledge?—has already been answered by software I am currently using in a much more trivial application. Google isn't doing a bad job indexing huge volumes of information, but RMIT University in Melbourne, Australia has developed an even better application, now being marketed under the name TeraText. It is able to concurrently index and retrieve terabyte volumes of text. TeraText's major use appears to be in the U. S. and Australian defense intelligence communities, where it is already being used with multi-hundred-gigabyte text repositories. The program excels with already structured data or where semantic information can be inferred from the text structure. Indexing and serving the world's scholarly knowledge would seem to be easily within TeraText's capacity.

I agree with the concept that data should be free, and with tools like terabyte disk drives and TeraText indexing engines, even generating the metadata to find the data should be affordable.

William P. Hall
Tenix Defence
Williamstown, Australia

Mr. Hayes replies:

The vision of a world where all scholarship is available on every laptop is a grand one. I do hope it comes true. It certainly seems a better use of the technology than storing thousands of hours of television reruns. But when I try to imagine how we get from where we are now to where Dr. Hall suggests we might be in a decade, I get stuck on the economic question. Many publishers view scholarly information as a highly saleable commodity, for which they charge rates so high that even the largest of libraries can subscribe only to a small subset of journals. Making all scholarship available to people of ordinary means implies a revolution in the economics of publishing.