The 8 most popular document formats on the web

17

Feb

20

This post is an update to my twice-annual (approximately) track of document file-format popularity as measured by way of Google’s “filetype:” search. Here’s the previous survey, posted in January, 2013.

New for 2014 I’ve decided to start tracking EPUB, Open Office files (ODT, ODP, ODS), TXT and RTF as well. In case you want all the numbers, I’ve provided a table representing this chart’s data.

What about HTML?

I’ve been asked: “Why don’t you include HTML in this survey”? There are three reasons.

The number of HTML and HTM files Google reports is vastly greater (20 – 50 times) than the total of document files Google reports, so you’d learn nothing beyond the fact that HTML files are the primary substance of the web. Big deal; we knew that.

I’m deliberately restricting scope to those documents which may (generally speaking) be abstracted (downloaded or otherwise captured) from the host website without changes to appearance or utilization. Of course, many PowerPoint files can’t even survive abstraction from the author’s computer, which is one of the reasons why we need PDF in the first place.

For the purposes of this survey an HTML page can’t really be considered a document in any event. Why?
A PDF might be a single page invoice, a 40 page catalog, a 500 page annual report or a 5,000 page building plan complete with oversize drawings, layers and 3D models. It might include pages from five different sources, including scanned pages. By contrast, an HTML page is usually.. an HTML page, containing some text that may or may not be a document. In any event, with my humble methodology, I’d have no way to screen the login pages and other scraps of text from “content” HTML pages that might be candidate for consideration as a “document”.

Accordingly, I decided it’s just not meaningful to count each individual static .HTML and .HTM file (if that’s what Google is doing) and compare it to the number of .PDF and .DOC files. You may, if you disagree; feel free to start your own survey. You are also welcome to smirk at the fact that this post is in HTML rather than PDF. You won’t be the first, I assure you!

Why is PDF so dominant?

Born before the web to facilitate the exchange of hardcopy documents, PDF is the format people use when they need an electronic “hard copy” document. Many business, publishing and records-keeping applications require a reliable, flexible and capable analog for paper. Some love their TIFF files, but those are pictures, not documents. For the vast majority, PDF remains the only game in town.

Look around. You may be surprised by how large a proportion of your important (and unimportant) content is in PDF. And don’t just count files, as I’ve done in this survey. Organizations who study their online content are often surprised to find that their PDF files, which may include dozens or hundreds of pages, actually contain far more actual content than their web-pages.

Are you leveraging PDF technology?

What does your ECM / SharePoint / CMS / WCM or other system do to help you manage PDF files. PDF technology is far more than electronic paper. The format’s features include:

Archival quality control

Extensible document and content-level metadata

Annotations and fillable forms

Security and authenticity

Accessibility

Attached content

Content re-use

Redaction

Watermarking

Page management

3D, video and other rich content

Scripting

Collation

and more…

Many vendors have yet to accept that PDF files play a key role in many of their customer’s organizations, and that better use of PDF might lead to new efficiencies and opportunities.

Ask your content management vendors how their software can support your needs.

Chart data

These data are proportional, not absolute. The actual search results (counts by file-type) change violently over time due (I guess) to search algorithm changes, or day of the month – who knows? While the raw numbers fluctuate, there’s (relative) consistency in the proportions, which makes me think the data’s reasonable net of whatever search model Google’s offering on my irregular test days.

The above chart’s data is provided in tabular form:

PDF

DOCx

XLSx

PPTx

EPUB

ODx

TXT/RTF

2011, April

81%

13%

3%

3%

?

?

?

2012, January

86%

10%

3%

1%

?

?

?

2012, August

83%

15%

1%

1%

?

?

?

2013 January

79%

17%

2%

1%

?

?

?

2013 June

83%

9%

5%

2%

?

?

?

2014 February

77.3%

5.5%

6.0%

6.1%

1.4%

0.8%

2.9%

Searches are conducted on the following file-types:

PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, EPUB, ODT, ODP, ODS, TXT, RTF

All searches were conducted on Mac OS / Chrome from Cambridge, Mass. in the USA. Of course, your mileage may vary – I’ve noticed different results in different countries, and indeed, on different days of a given week.