NABDConf (Not Another Big Data Conference) is back on June 1st, 2017.

NABDConf (Not Another Big Data Conference) is back for 2017! The conference by developers for developers is again bringing together the engineers who have spent their careers resolving difficult problems at scale with other engineers who are either facing similar problems or who are just plain curious as to what’s going on behind the scenes at web-scale companies like Criteo, Spotify, Uber and Google on June 1st 2017, in the Criteo headquarters at 32 rue Blanche, Paris.

This year we will be discussing topics like moving PB-scale Hadoop operations from bare metal to the Google Cloud, building billion node, billion edge graphs, visualizing tremendous volumes of data in a browser, pretty much everything you would ever want to know about building large scale data pipelines, how one keeps track of PBs of data and raw database performance.

In addition to spending a riveting day with world class engineers in a small format conference, you’ll get to hobnob with them at the after party on Criteo’s world famous (I exaggerate, but only mildly) rooftop deck in the heart of South Pigalle (that’s in Paris for those who don’t know!).

Registration

Book your place now to avoid missing out. The price is 50€ and all proceeds go to Techfugees. You can’t go wrong.

At that time, Spotify had 3000+ machines dedicated to running the services to power data processing: from Kafka, to Hadoop, to Spark, Storm and more. In this talk, Josh Baer — a product owner at Spotify managing the data processing migration— will walk through how the migration of these services has gone so far: outlining which services have translated to the cloud easily, which have been difficult to move and which have been replaced entirely by GCP offerings.

Speaker: Bruno Roggeri, Criteo

Title: Building a billion node / billion edge graph

At Criteo, besides our own cookie ids, we have access to billions of other identifiers:

– Mobile device ids

– Customer ID from thousands of merchants

– Email address hashes

Each of those identifiers can be associated one another whenever we notice a joint activity.

Yup, this is a graph!

Let’s see how we started computing the groups of connected ids 2 years ago – and the problems we’re starting to face just now as we leverage more and more associations of ids.

Speakers: Francois Visconte, Criteo , Mathieu Chataigner, Criteo

Title: DataDisco – One schema to rule them all and kill your data legacy

Big Data begets Big Legacy.

Five years of Hadoop and Kafka in production and you get geological layers of data (90PB of data and 100To/day), jobs (20 000 jobs running every day), and code ( 300+ k LOC).

At some point, the ad-hoc, “anything goes” approach doesn’t work anymore and we had to find a path to make the whole data format agnostic, to remove hardcoded datacenter locations and path, and to move from the obligatory JSON stringly typed behemoth to a binary columnar format. Without converting any data, with no downtime, and as little impact as possible on the code.

From a schema-less approach we moved to using schemas as the source of truth for both data and infrastructure configuration.

We will see how this approach allows to:

Describe data formats and localization as code

Make hadoop development format agnostic

Create observable data flows

Describe infrastructure as code

..and how this actually plays out on a large production system, in terms of infrastructure, code, and data.

Speaker: Rafal Wojdyla, Spotify

Title: Data pipeline at Spotify – from the inception to the production

We all use the same tools and frameworks to process data, but the environment and best practices differ from one company to another. In this talk Rafal – an engineer at Spotify, will present the full journey, an idea has to travel from the inception to the full fledged data pipeline at Spotify. We will cover the tools and frameworks we use to ease the processes of bootstrapping, testing, validating and productionizing a new data pipeline. You will hear about some of the open source tools like scio, ratatool, gcs-tools and styx, as well as some internal ones. This talk will give you a sense of how does it feel to be a data engineer at Spotify – including all the struggle – you will see that we still have a long way to go.

Data is at the core of Uber’s business and is fundamental for making informed decisions. The mission of the Visualization team at Uber is to deliver intelligence through the crafting of visual exploratory data analysis tools. To meet these needs, the team developed an open source visualization stack. In this talk Nicolas will give a brief overview of the Visualization team, their history, mission, and the most challenging problems they tackle. Then he’ll do a deep dive into their core open source components and libraries that power most data products at Uber. He’ll present their abstract and scientific data visualization stack, focusing on deck.gl, a WebGL framework for high-performance visualizations.

We’ve all heard about HyperLogLog by now (if you haven’t, don’t worry, there’s a intro!) and how it can approximate billions of distinct values in data structures on the order of a few kilobytes, and while there’s been quite a lot published on its accuracy there is not a ton available on performance. More to the point, comparing hundreds of millions of 2KB data structures, even when already located in main memory, is expensive. As a result of this, we asked ourselves, “when should we aggregate HLL synopses and when should we aggregate raw event level data?”

In this presentation we review the performance of HLL in Vertica versus raw event level data (also in Vertica). Additionally, to get an idea of the “raw” performance of HLL we use Druid, a popular in-memory datastore with native HLL support, as a baseline.

Speaker: Guillaume Bort, Criteo, Justin Coffey, Criteo

Title: Time-series workflow scheduling with Scala in Langoustine

There are many workflow schedulers available today both in the FOSS and proprietary worlds. Langoustine is a new offering in this space. It is close in spirit to Airflow, though written in Scala and with a Scala DSL and with a specific focus on scheduling time series data set pipelines.

Langoustine is in production running the vast majority of Criteo data pipelines on the largest Hadoop cluster in Europe and in this talk we will discuss our hits and misses in building it as well as its future as an Open Source project.

Speaker: Yann Schwartz, Criteo

Title: Something Wicked This Way Comes: Detecting Fraud With Delight

Ad Fraud is as widespread today as it is touchy a topic. Displaying ads and counting clicks sounds straightforward, right? Approaching it from the system perspective, you’re no longer reasoning in user interactions; you just see a huge amount of discrete HTTP requests (only loosely correlated over time). Actual fraudsters, but also spiders, browser bugs, proxies, our own bugs, and all the cute quirks of the Internet at large make it mandatory to distinguish the Legit from the Dubious.

In this talk, we’ll go through a broad bestiary of anomalies we’ve detected at Criteo. We’ll describe some of the systems and techniques we use and the privacy and scalability constraints we had to consider in designing them.

We’ll see how we tried to squeeze the detection – aggregation – mitigation feedback loop latency, and how sometimes the first step of Threat Modelling is just knowing what your system does in the first place

People love or hate Graphite (https://graphiteapp.org/, https://github.com/graphite-project), and whatever might be your take on it, it’s huge part of the open source monitoring ecosystem.

Criteo stores millions and metrics per minute, and the built in clustering system wasn’t scalable and reliable enough for us anymore. Because of that we started an effort to simplify the architecture and to let a proper database take care of the data, from this effort BigGraphite was created.

BigGraphite (https://github.com/criteo/biggraphite) is a set of Python plugins that glue Graphite and Cassandra together, providing perfect compatibility with the previous system and leveraging the features provided by Cassandra for high-availability and scalability.

During this talk we will deep dive into BigGraphite design choices and look at how Criteo uses Graphite internally.

We look forward to having you around for our developer event of the year!