Accelerating the SamKnows data pipeline

A key component of any platform that deals with large amounts of data is its data pipeline. We have invested significantly in ours over the years, ensuring that it’s robust and can scale to the ever increasing demands of our customers. Our latest iteration of the pipeline adds speed as a requirement. This latest pipeline allows us to deliver measurement results almost instantly to our customers and provides a firm grounding for future products that will act upon real-time data.

Before we talk in detail about what makes up our new data pipeline, we thought it would be interesting to share how it has evolved over the years in the face of increasing usage of our platform.

In 2010, the Federal Communications Commission (FCC) launched their Measuring Broadband America program, using our testing infrastructure. This was an important project that quickly amassed up to 10,000 Whiteboxes. And in 2012, we started two further large-scale projects: one for a prominent UK ISP, and another for the European Commission. Suddenly, we were supporting tens of thousands of Whiteboxes.

To handle this increase, we began sharding our data pipeline. Our data collection server would steer some customers’ data to MySQL server A, others to server B, and so on. This allowed us to scale up to support 70,000 devices on our largest customer, which was sufficient for the time. But this also had downsides in terms of maintenance and fragmentation of our platform.

2008-2011. This approach scaled to around 10,000 Whiteboxes

In 2014, a large ISP asked whether we could integrate our software directly into their routers (commonly known as CPE in the industry). This would have huge benefits for everyone: a much larger testing capacity, quicker deployment, and far fewer hardware test agents (Whiteboxes).

This project was very successful and in 2017 we had multiple ISPs wanting to integrate our Router SDK into many millions of routers! While this was exciting, it also presented us with scaling challenges in a number of areas, including the data pipeline.

This led us to embark upon a total platform overhaul - not just of the data pipeline but of our reporting platform as well. Enter, SamKnows One.

2011-2017: This pipeline handled around 70,000 Whiteboxes per customer at peak

Testing at scale

Our intention was to create a new pipeline that could support many millions of devices in a single logical database. This would allow us to deliver a unified portal (SamKnows One) to all of our customers.

After trialling a few different options, we settled on basing our solution around Presto, a Hadoop-based big data store, originally authored by Facebook. This is also the same database that underpins AWS’s Athena data store. This resolved all our scaling issues. In early 2020, we even had one ISP customer running 160 million DNS tests per day, all being stored in our Presto cluster!

However, like all of our earlier pipelines, everything operated using batch processing. This meant that it could take up to 30 minutes from a measurement being executed to the result appearing in SamKnows One. In the grand scheme of things, this is actually quite fast - it can take Google Analytics a day to show data. But when you’re a large ISP under pressure to respond to reports of network outages or under-performance, 30 minutes is too long. Moreover, a 30 minute delay made it impractical to implement some of our new product ideas!

Detailed results in seconds

Our new data pipeline sought to resolve a number of pain points we had with our Presto-based pipeline. Chief among these was the delay in data reaching SamKnows One.

We’ve re-architected the entire pipeline to remove batch processing; we are now streaming data. But we still retain the ability to buffer data at any point in the pipeline if connectivity to the subsequent part is temporarily lost.

Our new pipeline uses Apache Flume, Kafka, and Google BigQuery. The entire stack is deployed inside Google’s Cloud Platform, rather than in our own datacenters, which allows us to scale up much more easily as demand grows.

2020-Present

Apache Flume replaces our home-grown data collection server, which was showing its age after ten years, and is one less thing for us to maintain.

Kafka provides us with the ability to stream our data in real-time to multiple consuming applications in our environment. Right now this is limited to BigQuery and a few other internal tools. But this event stream will be a key component of some future products that rely on a truly real-time data flow.

BigQuery provides us long term data storage and an SQL interface for SamKnows One and other on-demand systems to interrogate our measurement data. BigQuery has improved significantly from when we looked at it a few years ago, and now supports streaming ingestion of data. Moreover, we can choose to host different parts of our database in different regions, which helps with meeting compliance and regulatory requirements.

Our new pipeline has dramatically improved the time between us receiving data at our edge and making it available to our customers in SamKnows One. Historically, this has been around 30 minutes, and with the new pipeline this is now around 5 seconds. This figure is even lower for applications that consume a real-time feed of our results directly from Kafka. We are carrying out a staged migration to the new data pipeline, during which time both our Presto and BigQuery based pipelines will operate in parallel. This migration will be completely transparent.

And since we’re committed to “always improving” (one of our company values), it was perhaps apt that the first question our CEO asked when Sam presented our new pipeline to the team was, “Can we make it faster?!”