When I first came across the field of information retrieval in the 80's and early 90's (back when TREC began), vectors were all the rage, and the key units were terms, texts, and corpora. Through the 90's and with the advent of hypertext and later the explosion of the Web, that metaphor shifted to pages, sites, and links, and approaches like HITS and Page Rank leveraged hyperlinking between documents and sites as key proxies for authority and relevance.
Today we're at a crossroads, as the nature of the content we seek to leverage through search and discovery has shifted once again, with a specific gravity now defined by entities, structured metadata, and (social) connections. In particular, and based on my work with Common Crawl data specifically, content has shifted in three critical ways:
First, publication and authorship have now been completely democratized. And I'm not just talking about individuals writing blogs, but the way in which any of us can (and do), elect to "comment", "post", "like", "pin", "tag", "recommend" or fifty other interaction events, thereby contributing to the vast corpus of interconnected data at any time, from any device. All of these tiny acts signify, but their meanings in aggregate are not yet fully understood.
Secondly, whereas the Web throughout its growth and development represented a vast public repository of usable information, we're now seeing the locus of ownership shift speedily away from publicly accessible repositories, to highly guarded--and valued--walled gardens. The "deep Web" was nothing compared to the "social Graph" that's now growing rampant. Want to understand why social is such a great priority at the formerly all-seeing eye of Google? Just look at the robots.txt files at facebook.com and graph.facebook.com. The latter is hair-raisingly stark:
Unlike throughout the broader Web, the owners of the great human Graph in which roughly 1/6 of the world population are participating have no seed for SEO.
Finally, the content that's now making its way online is radically different from the Web pages, articles, and local business listings we're used to seeing. It's highly structured, thanks to well-promoted models for metadata decoration like the Open Graph and Schema.org, and socially inflected to a degree that's astonishing. For example: the songs we listen to on Pandora, the games we play online, the books we download to our Kindles, our training runs, hikes with the kids, recipes and their outcomes, and a wide variety of newly forged kinds of socially-vectored entities and activities.
Between Web search--which today by necessity includes reference (Wikipedia) and common entity search on the one hand, and the long-scrolling Wall on the other, there's an undeveloped axis for a new model of social discovery. It's very reminiscent of that shift from textual IR to the Web search we saw in the late 90's.
All of this brings me back around to the Common Crawl mission and data set. Up until very recently, if you wanted to study the Web and its deeper nature and evolution, unless you were among a privileged few, access to sufficient Web crawl content was almost prohibitively expensive. The twin specters of storage and bandwidth alone, not to mention the computational horsepower required to study the data gathered, were more than enough to discourage almost anyone.
But today, thanks to groups like Common Crawl and Amazon Web Services, data and computational muscle are free and/or affordable in ways that make innovation--or even new understanding--possible at Web scale, by almost anyone. In the past few months, I've been leveraging these new tools to dig far deeper into the problems I laid out above than I ever imagined (see Study of ~1.3 Billion URLs: ~22% of Web Pages Reference Facebook and Data Mining the Web: $100 Worth of Priceless). And this is just the beginning....
My hope is that access to this data and these tools, including broader exposure to the kinds of work possible (see here), will inspire more and more companies, groups, and even tinkering engineers to push the envelope, and to make that ever greater graph of human knowledge and interaction ever more accessible, discoverable, and ultimately useful.
In that spirit, we welcome any questions about what we're doing with the data, how we're doing it, or what we aim to solve at Lucky Oyster; just send a friendly note to audacity at lucky oyster dot com. Or if you're in Las Vegas for the upcoming re:Invent show, I'll be presenting more of this material along with Lisa from Common Crawl.