< Back to Blog
March 26, 2015

5 Good Reads in Big Open Data: March 26 2015

Analyzing the Web For the Price of a Sandwich - via Yelp Engineering Blog: a Common Crawl use case from the December 2014 Dataset finds 748 million US phone numbers “I wanted to explore the Common Crawl in more depth, so I came up with a (somewhat contrived) use case of helping consumers find the web pages for local businesses…”
Common Crawl Foundation
Common Crawl Foundation
Common Crawl - Open Source Web Crawling data‍
  1. Analyzing the Web For the Price of a Sandwich – via Yelp Engineering Blog: a Common Crawl use case from the December 2014 Dataset finds 748 million US phone numbers
  2. I wanted to explore the Common Crawl in more depth, so I came up with a (somewhat contrived) use case of helping consumers find the web pages for local businesses. Yelp has millions of businesses in its index and we like to provide links back to a business’s own web page wherever possible, but there are plenty of cases where we just don’t have that information.
  3. Let’s try to use mrjob and the Common Crawl to help match businesses from Yelp’s database to the possible web pages for those businesses on the Internet.
  4. Open Source does NOT mean a lack of security -via Information Age: Businesses are increasingly moving to Open Source platforms to reduce costs and improve efficiency; however, many mistakenly believe that Open Source means a tradeoff in security.
  5. Utility Companies should use Machine Learning– via Intelligent Utility: Machine learning can have a huge impact on energy efficiency, customer usage incentive programs and personalizing the customer experience around energy usage
Load Curve graph (via Intelligent Utility) demonstrates "Energy Personalities" of customers
Load Curve graph (via Intelligent Utility) demonstrates “Energy Personalities” of customers
  1. QVC loses lawsuit against Resultly in Web Crawl case via Forbes: under the Computer Fraud & Abuse Act, the courts ruled that Resultly did not demonstrate any intent to damage QVC’s systems, but their overloading of QVC’s servers was a result of “wrinkles in its business operations.”
  2. Can Data Science actually predict the perfect March Madness bracket?– via Sport Techie: (Apparently not)
  3. Cukierski explains: “It is hard to say how well machine learning has improved forecasts prior to Kaggle; allow people to predict before the beginning of the tournament–make a prediction for every single game that could ever occur in the tournament. However, last year we had around ten teams who beat Vegas odds, which are considered to be state-of-the-art.”
  4. “So there is something there.”
  5. Still, they have plenty of people producing predictions, which, statistically, that means some of these teams bound to get lucky. The volume exceeds the propensity for the result to be actualized. Over a short interval of time, though, the execution doesn’t necessarily earmark for these data scientists to be deemed experts in any fashion.
  6. In the end, the odds of forecasting a perfect bracket are slim to none as it gets–predicated on as much luck as it does data science.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.

Errata
No items found.
This release was authored by:
No items found.