If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

The "surface" Web consists of approximately 2.5 billion documents [1 and 5], up from 1 billion pages at the beginning of the year [3], with a rate of growth of 7.3 million pages per day [1]. Estimates of the average "surface" page size vary in the range from 10 kbytes [1] per page to 20 kbytes per page [4]. So, the total amount of information on the "surface" Web varies somewhere from 25 to 50 terabytes of information [HTML-included basis]. If we want to obtain a figure for textual information, we would use a factor of 0.4 [4], which leads to an estimate of 10 to 20 terabytes of textual content. At 7.3 million new pages added every day, the rate of growth is [taking an average estimate] 0.1 terabytes of new information [HTML-included] per day.

If we take into account all web-accessible information, such as web-connected databases, dynamic pages, intranet sites, etc., collectively known as "deep" Web, there are 550 billion web-connected documents, with an average page size of 14 kbytes, and 95% of this information is publicly accessible [2]. If we were to store this information in one place, we would need 7,500 terabytes of storage, which is 150 times more storage than we would need for the entire "surface" Web, even taking the highest estimate of 50 terabytes. 56% of this information is the actual content [HTML excluded], which gives us an estimate of 4,200 terabytes of high-quality data. Two of the largest "deep" web sites - National Climatic Data Center and NASA databases - contain 585 terabytes of information, which is 7.8% of the "deep" web. And 60 of the largest web sites contain 750 terabytes of information, which is 10% of the "deep" web.

Email & Mailing Lists

Email has become one of the most widespread ways of communication in today's society. A white-collar worker receives about 40 email messages in his office every day [8]. Aggregately, based on different estimates, there will be from 610 billion [9] to 1100 billion [10] messages sent this year alone. With the average size of an email message 18,500 bytes [11] and growing, the amount of flow becomes surprisingly gigantic, somewhere between 11,285 and 20,350 terabytes. Of course, not all of this email gets stored. Mail.com has 14.5 million email boxes and uses 27 terabytes of storage; with approximately 500 million mailboxes worldwide, the required storage space is more than 900 terabytes, which means that only one in 17 messages is kept for some period of time.

Mailing lists can be viewed as a subcategory in email. It is hard to determine the number of mailing lists in existence, but we can approximate it based on some available statistics. One of the most frequently used mailing list managers - LISTSERV - is used to send 30 million messages per day in approximately 150,000 mailing lists [12]. A sample of mailing lists has shown that 30% of them are managed using LISTSERV. Using this information, we would estimate the total number of mailing list messages at 36.5 billion per year with aggregate volume of 675 terabytes.

Distribution of mailboxes has the same pattern as the distribution of web sites. While in 1984, 90% of the world's e-mailboxes were located in the U.S., at the end of 1999 this number dropped to 59%, and is expected to decrease even further.

Usenet

Most of the statistics in this category are vague, so the numbers we have should be regarded with a certain skepticism. Cidera, which is the 14th biggest news provider on the Internet [14], gets approximately 0.150 terabytes of Usenet feeds per day. We would estimate the total amount of original news feeds at 0.2 terabytes per day, which leads to 73 terabytes of original Usenet postings per year, which are redistributed by local ISPs and news servers an endless number of times.

FTP

We are missing any significant data on this sector, but we know that Walnut Creek CD-ROM archive contains a total of 0.412 terabytes of data on two servers [ftp.cdrom.com and ftp.freesoftware.com] and the amount of storage was expanding at 100% every year over the past 6 years [15]. It should be noticed that the distinction between FTP and HTTP becomes more blurred, as more and more file archives become available through HTTP.

IRC, Messaging Services, Telnet...

These categories mostly represent a flow of information as opposed to the stock. Liszt.com has one of the biggest directories of IRC channels - 37750 channels on 27 networks, with 150,000 users, all of them typing text as fast as they can.

Yes, that one awsome task for the search engines. Let's just hope that some better markup-language will kick in, so that we actually can find the information we're looking for in this unbeliviably big pile of data. I don't know about you guys, but most people I know aren't capabale of finding the information they're looking for. You need quite a bit of experience to do a good search nowdays.

0.1 Terabytes wooooooooah

Neat post bagz, the figure that stands out for me the most is the one about the net expanding by 0.1 terabytes a day. Dont need me to tell you thats a lot! Proactive hit the nail on the head about the search engines too, it sometimes is hard to find the info your after, even if you use the Holy Grail of search engines, you still often trawl through pages of search results to find what you want.