March 20, 2015

5 Good Reads in Big Open Data: March 20 2015

Startup Orbital Insight uses deep learning and finds financially useful information in aerial imagery - via MIT Technology Review: “To predict retail sales based on retailers’ parking lots, humans at Orbital Insights use Google Street View images to pinpoint the exact location of the stores’ entrances. Satellite imagery is acquired from a number of commercial suppliers, some of it refreshed daily. Software then monitors the density of cars and the frequency with which they enter the lots.”

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Startup Orbital Insight uses deep learning and finds financially useful information in aerial imagery– via MIT Technology Review:
To predict retail sales based on retailers’ parking lots, humans at Orbital Insights use Google Street View images to pinpoint the exact location of the stores’ entrances. Satellite imagery is acquired from a number of commercial suppliers, some of it refreshed daily. Software then monitors the density of cars and the frequency with which they enter the lots.
Crawford’s company can also use shadows in a city to gather information on rates of construction, especially in secretive places like China. Satellite images could also predict oil yields before they’re officially reported because it’s possible to see how much crude oil is in a container from the height of its lid. Scanning the extent and effects of deforestation would be useful to both investors and environmental groups.
Goodbye to Google Code -via eweek.com: Google is closing it’s open source project. With hosts like GitHub and BitBucket, users have migrated and Google Code is no longer needed.
Trends in Big Data Vs Hadoop Vs Business Intelligence– via Hadoop 360: Visualizing how interest has changed over the years

Screen Shot 2015-03-19 at 12.26.02 PM — Image via Hadoop360

Analysis of Common Crawl PDF metadata via PDFinfo.net

Open Data should be the new Open Source– via Computerworld:
But the lack of open data still seriously holds innovation back, and as data becomes more critical, the problem becomes worse.
For example, think about how hard it is for innovative predictive analytics companies to get off the ground. It’s not that they don’t have the software; it’s that they don’t have the data. There are plenty of excellent open source projects to build on top of (Sci-Py, R, etc.). But the lack of usable data is a huge issue when it comes to testing and training the algorithms in any domain.
The same exact thing would be true when an entrepreneur starts an e-commerce company. A high quality search engine is crucial in e-commerce and there plenty of great tools to build the search infrastructure such as Lucene, but no good datasets to test and train the ranking and relevance algorithms.
Which is to say this: There are smart, creative data scientists out there who don’t have the tools to do valuable work.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

5 Good Reads in Big Open Data: March 20 2015

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use