5 Good Reads in Big Open Data: March 26 2015

  1. Analyzing the Web For the Price of a Sandwich – via Yelp Engineering Blog: a Common Crawl use case from the December 2014 Dataset finds 748 million US phone numbers

    I wanted to explore the Common Crawl in more depth, so I came up with a (somewhat contrived) use case of helping consumers find the web pages for local businesses. Yelp has millions of businesses in its index and we like to provide links back to a business’s own web page wherever possible, but there are plenty of cases where we just don’t have that information.

    Let’s try to use mrjob and the Common Crawl to help match businesses from Yelp’s database to the possible web pages for those businesses on the Internet.


  2. Open Source does NOT mean a lack of security -via Information Age: Businesses are increasingly moving to Open Source platforms to reduce costs and improve efficiency; however, many mistakenly believe that Open Source means a tradeoff in security.


  3. Utility Companies should use Machine Learning– via Intelligent Utility: Machine learning can have a huge impact on energy efficiency, customer usage incentive programs and personalizing the customer experience around energy usage

    Load Curve graph (via Intelligent Utility) demonstrates "Energy Personalities" of customers

    Load Curve graph (via Intelligent Utility) demonstrates “Energy Personalities” of customers


  4. QVC loses lawsuit against Resultly in Web Crawl case via Forbes: under the Computer Fraud & Abuse Act, the courts ruled that Resultly did not demonstrate any intent to damage QVC’s systems, but their overloading of QVC’s servers was a result of “wrinkles in its business operations.”


  5. Can Data Science actually predict the perfect March Madness bracket?– via Sport Techie: (Apparently not)

    Cukierski explains: “It is hard to say how well machine learning has improved forecasts prior to Kaggle; allow people to predict before the beginning of the tournament–make a prediction for every single game that could ever occur in the tournament. However, last year we had around ten teams who beat Vegas odds, which are considered to be state-of-the-art.”

    “So there is something there.”

    Still, they have plenty of people producing predictions, which, statistically, that means some of these teams bound to get lucky. The volume exceeds the propensity for the result to be actualized. Over a short interval of time, though, the execution doesn’t necessarily earmark for these data scientists to be deemed experts in any fashion.

    In the end, the odds of forecasting a perfect bracket are slim to none as it gets–predicated on as much luck as it does data science.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.

5 Good Reads in Big Open Data: March 20 2015

  1. Startup Orbital Insight uses deep learning and finds financially useful information in aerial imagery– via MIT Technology Review:

    To predict retail sales based on retailers’ parking lots, humans at Orbital Insights use Google Street View images to pinpoint the exact location of the stores’ entrances. Satellite imagery is acquired from a number of commercial suppliers, some of it refreshed daily. Software then monitors the density of cars and the frequency with which they enter the lots.

    Crawford’s company can also use shadows in a city to gather information on rates of construction, especially in secretive places like China. Satellite images could also predict oil yields before they’re officially reported because it’s possible to see how much crude oil is in a container from the height of its lid. Scanning the extent and effects of deforestation would be useful to both investors and environmental groups.


  2. Goodbye to Google Code -via eweek.com: Google is closing it’s open source project. With hosts like GitHub and BitBucket, users have migrated and Google Code is no longer needed.


  3. Trends in Big Data Vs Hadoop Vs Business Intelligence– via Hadoop 360: Visualizing how interest has changed over the years

    Screen Shot 2015-03-19 at 12.26.02 PM
    Image via Hadoop360

  4. Analysis of Common Crawl PDF metadata via PDFinfo.net
    Screen Shot 2015-03-19 at 2.49.16 PM


  5. Open Data should be the new Open Source– via Computerworld:

    But the lack of open data still seriously holds innovation back, and as data becomes more critical, the problem becomes worse.

    For example, think about how hard it is for innovative predictive analytics companies to get off the ground. It’s not that they don’t have the software; it’s that they don’t have the data. There are plenty of excellent open source projects to build on top of (Sci-Py, R, etc.). But the lack of usable data is a huge issue when it comes to testing and training the algorithms in any domain.

    The same exact thing would be true when an entrepreneur starts an e-commerce company. A high quality search engine is crucial in e-commerce and there plenty of great tools to build the search infrastructure such as Lucene, but no good datasets to test and train the ranking and relevance algorithms.

    Which is to say this: There are smart, creative data scientists out there who don’t have the tools to do valuable work.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.

5 Good Reads in Big Open Data: March 13 2015

  1. Jürgen Schmidhuber- Ask Me Anything– via Reddit:  Jürgen has pioneered self-improving general problem solvers and Deep Learning Neural Networks for decades. He is the recipient of the 2013 Helmholtz Award of the International Neural Networks Society.

    Many think that intelligence is this awesome, infinitely complex thing. I think it is just the product of a few principles that will be considered very simple in hindsight, so simple that even kids will be able to understand and build intelligent, continually learning, more and more general problem solvers. Partial justification of this belief: (a) there already exist blueprints of universal problem solvers developed in my lab, in the new millennium, which are theoretically optimal in some abstract sense although they consist of just a few formulas.  (b) The principles of our less universal, but still rather general, very practical, program-learning recurrent neural networks can also be described by just a few lines of pseudo-code.


  2. An abridged list of Machine Learning topics -via Startup.ML: great presentation of research, software, talks and more on Deep Learning, Graphical Models, Structured Predictions, Hadoop/Spark, Natural Language Processing and all things Machine Learning.


  3. Deeper Content Analysis with Aspects: Interest Graph Grows Beyond Topics– via Prismatic Blog:  Prismatic opens up their Interest Graph with an aspect tagging API to classify URLS by aspect (structural content) and not just topic

    Via Dave Golland- Prismatic Blog
    Via Dave Golland- Prismatic Blog

  4. Wikipedia’s open letter to the NSA- stop spying on our users! via New York Times:  The NSA tracks your every view and edit to a Wikipedia page, on top of your location and (if they can) who you are. Open knowledge collaboration shouldn’t come at the cost of losing privacy over your very private identity, especially when the cost can be as high as prosecution.


  5. Generational Performance of Amazon EC2’s C3 and C4 families– via GigaOm:

    The results presented here indicate that the C4 virtual machines had 10 to 20 percent higher vCPU performance and approximately 6 GB/s more memory throughput than the C3 VMs across different machine sizes. However, after factoring in the price increases, the price-performance values of the C4 VMs averaged the same as the C3 VMs. Both vCPU performance levels and network throughput displayed high stability over time and across all tested machines. The results highlight Amazon’s effort to provide highly predictable performance outputs and to match its C4 family’s price-performance with that of its earlier generation C3 family.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data

5 Good Reads in Big Open Data: March 6 2015

  1. 2015: What do you think about Machines that think?– via Edge:  A.I isn’t so artificial

    With these kind of software challenges, and given the very real technology-driven threats to our species already at hand, why worry about malevolent A.I.? For decades to come, at least, we are clearly more threatened by like trans-species plagues, extreme resource depletion, global warming, and nuclear warfare

    Which is why malevolent A.I. rises in our Promethean fears. It is a proxy for us, at our rational peak, confidently killing ourselves.


  2. What would you do with 139TB of big open data? -via Common Crawl: We’ve just released 1.82 billion web pages for you to discover, build and innovate. Check it out and please email [email protected] to share your work!

  3. Google Makes Overriding Redirection More Difficult  – via Search Engine Land:  Google says this move is to improve local user experience, but is The Right To Be Forgotten the true reason?

  4. Less than 40 percent of the world has ever connected to the internet via Slatethe problems are “infrastructure, affordability and relevance” according to Facebook’s Internet.org. This information may be disheartening to some, but it also shows what tremendous potential the web still has if we can connect the world.

  5. Hadoop gamechangers– via Opensource.com:

    Hadoop, an open source software framework with the funny sounding name, has been a game-changer for organizations by allowing them to store, manage, and analyze massive amounts of data for actionable insights and competitive advantage.

    But this wasn’t always the case.

    Initially, Hadoop implementation required skilled teams of engineers and data scientists, making Hadoop too costly and cumbersome for many organizations. Now, thanks to a number of open source projects, big data analytics with Hadoop has become much more affordable and mainstream.

    Here’s a look at how three open source projects—Hive, Spark, and Presto—have transformed the Hadoop ecosystem.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data

5 Good Reads in Big Open Data: February 27 2015

  1. Hadoop is the Glue for Big Data – via StreetWise Journal: Startups trying to build a successful big data infrastructure should “welcome…and be protective” of open source software like Hadoop. The future and innovation of Big Data depends on it.


  2. Topic Models: Past, Present Future -via O’Reilly Data Show Podcast:

    You might analyze a bunch of New York Times articles for example, and there’ll be an article about sports and business, and you get a representation of that article that says this is an article and it’s about sports and business. Of course, the ideas of sports and business were also discovered by the algorithm, but that representation, it turns out, is also useful for prediction. My understanding when I speak to people at different startup companies and other more established companies is that a lot of technology companies are using topic modeling to generate this representation of documents in terms of the discovered topics, and then using that representation in other algorithms for things like classification or other things.


  3. Border disputes on Europe’s Right To Be Forgotten – via Slate: Is the angle of debate (disruptors vs. regulators) wrong? Should we be thinking of more custom solutions to this global issue?


  4. Flashgraph can analyze massive graphs to the proven tune of 129 billion edges- via the Common Crawl Blog (Flashgraph on GitHub):

    You may ask why we need another graph processing framework while we already have quite a few…FlashGraph seeks performance, capacity, flexibility and ease of programming at the moment when it was created. We hope FlashGraph can have performance comparable to the state-of-art in-memory graph engines while scaling to graphs with hundreds of billions of edges or even trillions of edges. We also hope that FlashGraph can express varieties of algorithms in FlashGraph and hide the complexity of accessing data on SSDs and parallelizing graph algorithms.


  5. The future of the internet is NOT all decided by net neutrality – via The Atlantic: A wonderfully curated net neutrality reading list, including one article where Justice Antonin Scalia tells us the Internet is a pizzeria (he’s right)


    Follow us @CommonCrawl on Twitter for the latest in Big Open Data

5 Good Reads in Big Open Data: Feb 20 2015

  1. Why The Open Data Platform Is Such A Big Deal for Big Data– via Pivotal P.O.V:

    A thriving ecosystem is the key for real viability of any technology. With lots of eyes on the prize, the technology becomes more stable, offers more capabilities, and importantly, supports greater interoperability across technologies, making it easier to adopt and use, in a shorter amount of time. By creating a formal organization, the Open Data Platform will act as a forcing function to accelerate the maturation of an ecosystem around Big Data.


  2. Machine Learning Could Upend Local Search -via Streetfight: From the Chairman of Common Crawl’s Board of Directors (and Factual CEO) Gil Elbaz on the future of search


  3. On opening up libraries with linked data – via Library Journal: While the rest of the web is turning into the “Web of Data,” libraries and catalogs  are (partially for reasons for a closed culture) struggling to keep up


  4. Interactive map: where are we driving, busing, cabbing, walking to work? via Flowing Data:

    Interactive: How Americans Get to Work
    Image via Flowing Data

  5. On the ongoing debate over the possible dangers of Artificial Intelligence– via Scientific American:

    Current efforts in areas such as computational ‘deep-learning‘ involve algorithms constructing their own probabilistic landscapes for sifting through vast amounts of information. The software is not necessarily hard-wired to ‘know’ the rules ahead of time, but rather to find the rules or to be amenable to being guided to the rules – for example in natural language processing. It’s incredible stuff, but it’s not clear that it is a path to AI that has equivalency to the way humans, or any sentient organisms, think. This has been hotly debated by the likes of Noam Chomsky(on the side of skepticism) and Peter Norvig (on the side of enthusiasm). At a deep level it is a face-off between science focused on underlying simplicity, and science that says nature may not swing that way at all.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data

5 Good Reads in Big Open Data: Feb 13 2015

  1. What does it mean for the Open Web if users don’t know they’re on the internet? – via QUARTZ:

    This is more than a matter of semantics. The expectations and behaviors of the next billion people to come online will have profound effects on how the internet evolves. If the majority of the world’s online population spends time on Facebook, then policymakers, businesses, startups, developers, nonprofits, publishers, and anyone else interested in communicating with them will also, if they are to be effective, go to Facebook. That means they, too, must then play by the rules of one company. And that has implications for us all.


  2. Hard Drive Data Sets -via Backblaze: Backblaze provides online backup services storing data on over 41,000 hard drives ranging from 1 terabyte to 6 terabytes in size.  They have released an open, downloadable dataset on the reliability of these drives.


  3. The Open Source Question: critically important web infrastructure is woefully underfunded – via Slate: on the strange dichotomy of Silicon Valley: a “hypercapitalist steamship powered by it’s very antithesis”


  4. February 21st is Open Data Day- via Spatial Source: use this interactive map to find an Open Data event near you (or add your own)

    International Open Data Hackathon
    Image Source: opendataday.org/map

  5. Security is at the heart of the web – via O’Reilly Radar:

    …we want to be able to go to sleep without worrying that all of those great conversations on the open web will endanger the rest of what we do.

    Making the web work has always been a balancing act between enabling and forbidding, remembering and forgetting, and public and private. Managing identity, security, and privacy has always been complicated, both because of the challenges in each of those pieces and the tensions among them.

    Complicating things further, the web has succeeded in large part because people — myself included — have been willing to lock their paranoias away so long as nothing too terrible happened.

    Follow us @CommonCrawl on Twitter for the latest in Big Open Data

5 Good Reads in Big Open Data: Feb 6 2015

  1. The Dark Side of Open Data – via Forbes:

    There’s no reason to doubt that opening to the public of data previously unreleased by governments, if well managed, can be a boon for the economy and, ultimately, for the citizens themselves. It wouldn’t hurt, however, to strip out the grandiose rhetoric that sometimes surrounds them, and look, case by case, at the contexts and motivations that lead to their disclosure.


  2.  Bigger Data; Same Laptop -via Frank McSherry: throwing more machines at a problem isn’t necessarily the best approach. A laptop can outperform clusters when used effectively. This post uses the Web Data Commons 128 billion edge Hyperlink Graph, created using Common Crawl data, to showcase that.


  3. Fixing Verizon’s permacookie – via Slate: 9 lines of code could make Verizon’s controversial user-tracking system slightly less invasive and much less creepy.


  4. Interact with Committee to Protect Journalist ‘s Data- via Reuters Graphics: interactive map of journalists killed over time and by location

    Source: Committee to Protect Journalists Graphic by Matthew Weber/Reuters Graphics

    Source: Committee to Protect Journalists
    Graphic by Matthew Weber/Reuters Graphics


  5. The EU wants the rest of the world to forget too – via The New York Times:

    Countries have different standards for acceptable speech and for invasions of privacy. American libel laws, for example, are much more permissive than those in Britain. That’s why authors sometimes find it easier to have some books published in United States than in Britain. There is no doubt that the Internet has made it harder for governments to enforce certain rules and laws because information is not easily contained within borders. But that does not justify restricting the information available to citizens of other countries.

    Follow us @CommonCrawl on Twitter for the latest in Big Open Data