Data on Pocket

This started out on Twitter and I expected having to write up maybe 50 or so, but then it made the rounds. I don’t think I have 1200+ papers that I’d recommend reading, and even if I did, I don’t think a reading list of that length is actually useful to anyone.

Many aspiring data scientists focus on doing Kaggle competitions as a way to build their portfolios. Kaggle is an excellent way to practice, but it should only be one of many avenues you use to work on data science projects.

WASHINGTON — The Justice Department is trying to force an internet hosting company to turn over information about everyone who visited a website used to organize protests during President Trump’s inauguration, setting off a new fight over surveillance and privacy limits.

Xu Li’s software scans more faces than maybe any on earth. He has the Chinese police to thank. Xu runs SenseTime Group Ltd., which makes artificial intelligence software that recognizes objects and faces, and counts China’s biggest smartphone brands as customers.

The finance ministry’s Economic Survey used Big Data for policy analysis in the first volume, while the second volume used machine learning to extract information from satellite images. Photo: Bloomberg The view from above as your plane lands in Mumbai during the monsoon months is a revelation.

It’s in our nature to compare things. What’s better? What’s worse? What falls in the middle? However, make sure you don’t end up in an apples and oranges situation where the comparisons don’t even make sense or are completely useless.

For years, SEOs have faced multiple paths when it comes to career development options. For some, the general options involving web development or traditional marketing roles have dominated the conversation, leaving out the data wonks out in the cold.

SAN FRANCISCO (Reuters) - A U.S. federal judge on Monday ruled that Microsoft Corp's (MSFT.O) LinkedIn unit cannot prevent a startup from accessing public profile data, in a test of how much control a social media site can wield over information its users have deemed to be public. U.S.

Today we’re excited to announce the general availability of AWS Glue. Glue is a fully managed, serverless, and cloud-optimized extract, transform and load (ETL) service. Glue is different from other ETL services and platforms in a few very important ways.

I joined LinkedIn about six years ago at a particularly interesting time. We were just beginning to run up against the limits of our monolithic, centralized database and needed to start the transition to a portfolio of specialized distributed systems.

Data is ubiquitous — but sometimes it can be hard to see the forest for the trees, as it were. Many companies of various sizes believe they have to collect their own data to see benefits from big data analytics, but it’s simply not true.

I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer. I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence.

Introduction As I was browsing the web and catching up on some sites I visit periodically, I found a cool article from Tom Hayden about using Amazon Elastic Map Reduce (EMR) and mrjob in order to compute some statistics on win/loss ratios for chess games he downloaded from the millionbase archive,

For a company like Slack that strives to be as data-driven as possible, understanding how our users use our product is essential. We knew when we started building this system that we would need flexibility in choosing the tools to process and analyze our data.

In the interest of getting back into writing, I want to break the seal with a simple “what have I been up to and thinking about lately” style post. Hopefully future topics will be more focused and frequent. For the past year, I have been working on data and analytics at GitHub.

There are three elements to our "big data" efforts, or unhyped normal data efforts: Data Collection, Data Reporting, and Data Analysis. We are all aware that the best companies in the world have an optimal DC-DR-DA allocation when it comes to time/money/people: 15%-20%-65%.

We’ve entered the age of big data, in which more and more companies are seeing the value and importance of data in many different areas of their business, from market and customer research, to internal sales figures and HR analytics.

Data Science, Machine Learning, Big Data Analytics, Cognitive Computing …. well all of us have been avalanched with articles, skills demand info graph’s and point of views on these topics (yawn!). One thing is for sure; you cannot become a data scientist overnight.

You don’t need to be a seasoned data scientist or have a degree in graphic design in order to create incredible data visualisations. It has become a lot simpler to mine your data and interpret your insights in an engaging, attractive, and most importantly easy to understand way.

Data science projects offer you a promising way to kick-start your analytics career. Not only you get to learn data science by applying, you also get projects to showcase on your CV. Nowadays, recruiters evaluate a candidate’s potential by his/her work, not as much by certificates and resumes.

In preparing this talk I decided to check out the data landscape, since I hadn't seen it for a while. The terminology around Big Data is surprisingly bucolic. Data flows through streams into the data lake, or else it's captured in logs.

The goal of this tutorial is to introduce the steps for building an interactive visualization of geospatial data. To do this, we will use a dataset from a Kaggle competition to build a data visualization that shows the distribution of mobile phone users in China.

How can organizations leverage data as a strategic asset? Data comes at a high price. Businesses must pay for data collection and cleansing, hosting and maintenance, salaries of data engineers, data scientists and analysts, risk of breach and so on. The line items add up.

More than a million people have now used our Wolfram|Alpha Personal Analytics for Facebook. And as part of our latest update, in addition to collecting some anonymized statistics, we launched a Data Donor program that allows people to contribute detailed data to us for research purposes.

This post is part of a series covering the exercises from Andrew Ng's machine learning class on Coursera. The original code, exercise text, and data files for this post are available here. One of the pivotal moments in my professional development this year came when I discovered Coursera.

AN OIL refinery is an industrial cathedral, a place of power, drama and dark recesses: ornate cracking towers its gothic pinnacles, flaring gas its stained glass, the stench of hydrocarbons its heady incense.

Many conversations about data and analytics (D&A) start by focusing on technology. Having the right tools is critically important, but too often executives overlook or underestimate the significance of the people and organizational components required to build a successful D&A function.

“Learn to code.” Around the world, it has been a familiar refrain. And the world has duly learned. Time was that the standard in-school computer education was little more than word processing, spreadsheets and some basic programming.

Big Data, Data Sciences, and Predictive Analytics are the talk of the town and it doesn’t matter which town you are referring to, it’s everywhere, from the White House hiring DJ Patil as the first chief data scientist to the United Nations using predictive analytics to forecast bombings on scho

Update: This article discusses the lower half of the stack. For the rest, see Part II: The Edge and Beyond. Uber’s mission is transportation as reliable as running water, everywhere, for everyone. To make that possible, we create and work with complex data.

Data Science is an ever-growing field, there are numerous tools & techniques to remember. It is not possible for anyone to remember all the functions, operations and formulas of each concept. That’s why we have cheat sheets.

This blogpost is an excerpt of Springboard's free guide to data science jobs and originally appeared on the Springboard blog. Most data scientists use a combination of skills every day, some of which they have taught themselves on the job or otherwise. They also come from various backgrounds.

Dear Lifehacker, I've been hearing more and more about "big data." What is it, and is it something I should be worried about? Is this another way companies harvest my data and sell it? Dear Bewitched by Buzzwords, "Big data" is the latest tech industry buzz-phrase.

In a tech startup industry that loves its shiny new objects, the term “Big Data” is in the unenviable position of sounding increasingly “3 years ago”. While Hadoop was created in 2006, interest in the concept of “Big Data” reached fever pitch sometime between 2011 and 2014.

It feels good to be a data geek in 2017. Last year, we asked “Is Big Data Still a Thing?”, observing that since Big Data is largely “plumbing”, it has been subject to enterprise adoption cycles that are much slower than the hype cycle.

The software industry today is in need of a new kind of designer: one proficient in the meaning, form, movement, and transformation of data. I believe this Data Designer will turn out to be the most important new creative role of the next five years.

Our new Keystone data pipeline went live in December of 2015. In this article, we talk about the evolution of Netflix’s data pipeline over the years. This is the first of a series of articles about the new Keystone data pipeline. Netflix is a data-driven company.

Service-Oriented Architecture has a well-deserved reputation amongst Ruby and Rails developers as a solid approach to easing painful growth by extracting concerns from large applications. These new, smaller services typically still use Rails or Sinatra, and use JSON to communicate over HTTP.

As the big data analytics market rapidly expands to include mainstream customers, which technologies are most in demand and promise the most growth potential? The answers can be found in TechRadar: Big Data, Q1 2016, a new Forrester Research report evaluating the maturi

Daniel (not his real name) was a VP human resource manager at a Fortune 500 company. I asked him whether he had collected any data that could provide him with insights into systematic patterns. “I made sure we get exit interviews done with every single employee who is leaving us,” he replied.

A year ago, I dropped out of one of the best computer science programs in Canada. I started creating my own data science master’s program using online resources. I realized that I could learn everything I needed through edX, Coursera, and Udacity instead.

Alarmed that decades of crucial climate measurements could vanish under a hostile Trump administration, scientists have begun a feverish attempt to copy reams of government data onto independent servers in hopes of safeguarding it from any political interference.

What shall we make? This talk is about what’s coming next for us as designers. What’s the work ahead for us, and what’s our role and responsibility in this future? Designing for what’s next is what my studio Big Medium does.

While working on my statistical analysis of 142 million Reddit submissions last year, I had a surprising amount of trouble settings things up. It took a few hours to download the 40+ gigabytes of compressed data, and another few hours to parse the data and store in a local database.

Imagine if a company’s three highly valued data scientists can happily work together without duplicating each other’s efforts and can easily call up the ingredients and results of each other’s previous work. That day has come.

Functional programming has been on the rise the last few years. Languages such as Clojure, Scala and Haskell have brought to the eyes of imperative programmers interesting techniques that can provide significant benefits in certain use cases. Immutable.

Before MongoDB, before Cassandra, before “NoSQL”, there was Lucene. Did you know that Doug Cutting wrote the first versions of Lucene in 1999? To put things in context, this was around the time Google was more a research project than an actual trusted application.

You need analytics. I’m very confident of that, because today, everyone needs analytics. Not just product, not just marketing, not just finance… sales, fulfillment, everyone at a startup needs analytics today.

In the last few years I spent a significant time with reading books about Data Science. I found these 7 books the best. These together are a very valuable source of learning the basics. It drives you through everything, you need to know.

You know that old saying, “If it seems too good to be true, it probably is?” We technologists should probably apply that saying to database vendor claims pretty regularly. In the summer of 2014, the Parse.ly team finally kicked the tires on Apache Cassandra.

A year ago, I dropped out of one of the best computer science programs in Canada. I started creating my own data science master’s program using online resources. I realized that I could learn everything I needed through edX, Coursera, and Udacity instead.

When dealing with data, it helps to have a well defined workflow. Specifically, whether we want to perform an analysis with the sole intent of "telling the story" (Data Visualisation/Journalism) or build a system that relies on data to model a certain task (Data Mining), process matters.

Big data! If you don’t have it, you better get yourself some. Your competition has it, after all. Bottom line: If your data is little, your rivals are going to kick sand in your face and steal your girlfriend.

Interested in landing a job as a data scientist? You’re in good company – a recent article by Thomas Davenport and D.J. Patil in the Harvard Business Review calls ‘data scientist’ the sexiest job of the 21st century.

Data is essential to us at Airbnb. We characterize data as the voice of our users at scale. Thus, data science plays the role of an interpreter — we use data and statistics to understand our users and translate it to a voice that people or machines can understand.

A BK-tree is a tree data structure specialized to index data in a metric space. A metric space is essentially a set of objects which we equip with a distance function \(d(a, b)\) for every pair of elements \((a, b)\).

A public dataset is any dataset that is stored in BigQuery and made available to the general public. This page lists a special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications.