March 26, 2015

5 Good Reads in Big Open Data: March 26 2015

Note: this post has been marked as obsolete.

Analyzing the Web For the Price of a Sandwich - via Yelp Engineering Blog: a Common Crawl use case from the December 2014 Dataset finds 748 million US phone numbers “I wanted to explore the Common Crawl in more depth, so I came up with a (somewhat contrived) use case of helping consumers find the web pages for local businesses…”

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Analyzing the Web For the Price of a Sandwich – via Yelp Engineering Blog: a Common Crawl use case from the December 2014 Dataset finds 748 million US phone numbers
I wanted to explore the Common Crawl in more depth, so I came up with a (somewhat contrived) use case of helping consumers find the web pages for local businesses. Yelp has millions of businesses in its index and we like to provide links back to a business’s own web page wherever possible, but there are plenty of cases where we just don’t have that information.
Let’s try to use mrjob and the Common Crawl to help match businesses from Yelp’s database to the possible web pages for those businesses on the Internet.
Open Source does NOT mean a lack of security -via Information Age: Businesses are increasingly moving to Open Source platforms to reduce costs and improve efficiency; however, many mistakenly believe that Open Source means a tradeoff in security.
Utility Companies should use Machine Learning– via Intelligent Utility: Machine learning can have a huge impact on energy efficiency, customer usage incentive programs and personalizing the customer experience around energy usage

Load Curve graph (via Intelligent Utility) demonstrates "Energy Personalities" of customers — Load Curve graph (via Intelligent Utility) demonstrates “Energy Personalities” of customers

QVC loses lawsuit against Resultly in Web Crawl case via Forbes: under the Computer Fraud & Abuse Act, the courts ruled that Resultly did not demonstrate any intent to damage QVC’s systems, but their overloading of QVC’s servers was a result of “wrinkles in its business operations.”
Can Data Science actually predict the perfect March Madness bracket?– via Sport Techie: (Apparently not)
Cukierski explains: “It is hard to say how well machine learning has improved forecasts prior to Kaggle; allow people to predict before the beginning of the tournament–make a prediction for every single game that could ever occur in the tournament. However, last year we had around ten teams who beat Vegas odds, which are considered to be state-of-the-art.”
“So there is something there.”
Still, they have plenty of people producing predictions, which, statistically, that means some of these teams bound to get lucky. The volume exceeds the propensity for the result to be actualized. Over a short interval of time, though, the execution doesn’t necessarily earmark for these data scientists to be deemed experts in any fashion.
In the end, the odds of forecasting a perfect bracket are slim to none as it gets–predicated on as much luck as it does data science.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

5 Good Reads in Big Open Data: March 26 2015

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use