Innovation in data processing and machine learning technology

Big data sessions by customers at GCP NEXT 2016

Wednesday, March 30, 2016

That was fun. Many thanks to all of you who came to San Francisco last week for our GCP NEXT 2016 event. My colleagues and I are grateful for all the discussions we shared with current and future users of our data processing services. It was at the same time energizing and exhausting (once it was over I felt like putting on my favorite t-shirt for a day).

But many of you did not get to make the trip to San Francisco, and even those of you who were there in person probably didn’t get to attend all the sessions you wanted, with three tracks going on in parallel. Fortunately, all the sessions were recorded and the videos have now been posted.

Since this blog is focused on big data, you’ll probably be most interested in the Data & Analytics track. And within that track, I wanted to quickly highlight and a few great sessions focused on data processing and analytics (there are other sessions in the track, focused on storage and machine learning, which you really should also watch but I’m not covering them here).

While a few of us Googlers occasionally show up in the sessions, they are all primarily delivered by customers, describing their experience of GCP’s serverless system to solve real problems, fast, at scale and efficiently. For starters, you can watch the portion of the keynote about Spotify’s move to Google Cloud, where Nicholas Harteau, Spotify’s VP of infrastructure, explains that “the thing that drove us to Google, over other cloud providers, was the data platform.” This statement was echoed by Spotify engineers in two sessions later in the conference:In the first one, Kenny Ning and Emily Samuels described how they now use BigQuery as a much faster and simpler way to run ad-hoc queries at Spotify. For example, Kenny and Emily discuss they use queries to calculate KPIs, used to take 20 minutes on Hive, and now take 10 to 20 seconds on Google BigQuery.

In another session, Igor Maravić described how his team revamped the infrastructure used to ingest and process, in streaming mode, hundreds of thousands of events generated by Spotify users every second to use Google Cloud Pub/Sub and Google Cloud Dataflow. Later in that same session, Neville Li discussed his open source project Scio, a new Scala SDK on top of Google Cloud Dataflow, which is primarily Java based. The Python version, now in limited preview, was announced at NEXT).

If you want to know even more, Igor posted a three-part series on the Spotify engineering blog, which covers their journey to Google Cloud: part 1, part 2, and part 3.

In the session, Igor demonstrates handling 2 million events per second in Cloud Pub/Sub and Cloud Dataflow without breaking a sweat.

Two million you say? In their session, Neil Palmer and Todd Ricker from FIS Global decided to raise that by an order of magnitude, simply because that’s what they need to handle U.S. stock exchange market events as they happen. And so they show how they use Cloud Dataflow, Cloud Bigtable and BigQuery to handle 34 million event writes per second and 22 million event reads per second.

In another session, Pablo Caif from Shine Technologies described how he and his colleagues used Cloud Dataflow and BigQuery to build the data analytics platform for Telstra, Australia’s largest ISP, on Google Cloud.

By the way, Pablo and his colleague Graham Polley are some of the most entertaining writers in the big data space, and have written several blogs about their experience using GCP products. I especially recommend:

Finally, at the risk of breaking my promise to focus on customer-centric sessions, I’ll recommend one last session which in truth is split in half between a Googler and a customer. But the Googler in question is one of the main BigQuery engineers, Jordan Tigani, who describes some of the under-the-hood technical improvements in BigQuery, which are producing up to 10x improvements in query performance, with no interruption of service or any other change needed for customers. Jordan also announces a 50% price drop on stored tables older than 90 days, and demonstrated running ad-hoc SQL queries on a Petabyte table. In the same session, Costas Piliotis from Kabam describes Kabam’s transition from Amazon Redshift to BigQuery for game analytics. Suffice to say that it’s a lot harder to cheat and avoid detection in Kabam games now that BigQuery's on the job.

The best part of my job, without any doubt, is hearing customers describe their experience with GCP services, and explain in their own words the value of mixing the power of Google technology with the ease, convenience and efficiency of a serverless architecture. We also believe it’s the most interesting and efficient way for those of you not yet on the platform to learn about its unique value. This is why we’re so grateful to the customers who gave so much of their time to prepare and deliver these sessions at GCP NEXT, thank you! We hope you find them interesting.