October 30, 2012

Towards Social Discovery - New Content Models; New Data; New Toolsets

This is a guest blog post by Matthew Berk, Founder of Lucky Oyster. Matthew has been on the front lines of search technology for the past decade.

Matthew Berk

Matthew Berk is a founder at Bean Box and Open List, worked at Jupiter Research and Marchex.

When I first came across the field of information retrieval in the 80's and early 90's (back when TREC began), vectors were all the rage, and the key units were terms, texts, and corpora. Through the 90's and with the advent of hypertext and later the explosion of the Web, that metaphor shifted to pages, sites, and links, and approaches like HITS and Page Rank leveraged hyperlinking between documents and sites as key proxies for authority and relevance.

Today we're at a crossroads, as the nature of the content we seek to leverage through search and discovery has shifted once again, with a specific gravity now defined by entities, structured metadata, and (social) connections. In particular, and based on my work with Common Crawl data specifically, content has shifted in three critical ways:

First, publication and authorship have now been completely democratized. And I'm not just talking about individuals writing blogs, but the way in which any of us can (and do), elect to "comment", "post", "like", "pin", "tag", "recommend" or fifty other interaction events, thereby contributing to the vast corpus of interconnected data at any time, from any device. All of these tiny acts signify, but their meanings in aggregate are not yet fully understood.

Secondly, whereas the Web throughout its growth and development represented a vast public repository of usable information, we're now seeing the locus of ownership shift speedily away from publicly accessible repositories, to highly guarded--and valued--walled gardens. The "deep Web" was nothing compared to the "social Graph" that's now growing rampant. Want to understand why social is such a great priority at the formerly all-seeing eye of Google? Just look at the robots.txt files at facebook.com and graph.facebook.com. The latter is hair-raisingly stark:

    User-agent: *
    Disallow: /

Unlike throughout the broader Web, the owners of the great human Graph in which roughly 1/6 of the world population are participating have no seed for SEO.

Finally, the content that's now making its way online is radically different from the Web pages, articles, and local business listings we're used to seeing. It's highly structured, thanks to well-promoted models for metadata decoration like the Open Graph and Schema.org, and socially inflected to a degree that's astonishing. For example: the songs we listen to on Pandora, the games we play online, the books we download to our Kindles, our training runs, hikes with the kids, recipes and their outcomes, and a wide variety of newly forged kinds of socially-vectored entities and activities.

Between Web search--which today by necessity includes reference (Wikipedia) and common entity search on the one hand, and the long-scrolling Wall on the other, there's an undeveloped axis for a new model of social discovery. It's very reminiscent of that shift from textual IR to the Web search we saw in the late 90's.

All of this brings me back around to the Common Crawl mission and data set. Up until very recently, if you wanted to study the Web and its deeper nature and evolution, unless you were among a privileged few, access to sufficient Web crawl content was almost prohibitively expensive. The twin specters of storage and bandwidth alone, not to mention the computational horsepower required to study the data gathered, were more than enough to discourage almost anyone.

But today, thanks to groups like Common Crawl and Amazon Web Services, data and computational muscle are free and/or affordable in ways that make innovation--or even new understanding--possible at Web scale, by almost anyone. In the past few months, I've been leveraging these new tools to dig far deeper into the problems I laid out above than I ever imagined (see Study of ~1.3 Billion URLs: ~22% of Web Pages Reference Facebook and Data Mining the Web: $100 Worth of Priceless). And this is just the beginning....

My hope is that access to this data and these tools, including broader exposure to the kinds of work possible (see here), will inspire more and more companies, groups, and even tinkering engineers to push the envelope, and to make that ever greater graph of human knowledge and interaction ever more accessible, discoverable, and ultimately useful.

In that spirit, we welcome any questions about what we're doing with the data, how we're doing it, or what we aim to solve at Lucky Oyster; just send a friendly note to audacity at lucky oyster dot com. Or if you're in Las Vegas for the upcoming re:Invent show, I'll be presenting more of this material along with Lisa from Common Crawl.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Towards Social Discovery - New Content Models; New Data; New Toolsets

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use