I’m excited to introduce you to a project I’ve been working on recently, which I am tentatively naming Snowplow Serverless: an implementation of (a minimal subset of features of) the Snowplow Collector and Enrich components entirely as functions for AWS Lambda, using the Serverless framework.

To give a bit of background, most of my posts on here are based on my work leading the data architecture at Property Finder Group, where we are heavy users of the Snowplow streaming stack.

However, I’ve worked in the charity sector in the past and continue to do occasional pro-bono advisory work with small charities and social enterprises. For these types of organisations, even the most basic Snowplow infrastructure is prohibitively expensive; the cost of a minimal real-time Snowplow deployment with a relational DB is in the order of hundreds of dollars a month, which immediately places it out of reach.

(Snowplow Mini goes part of the way but serves the distinct use case of experimentation for new users, rather than production cost-saving.)

In contrast, a Lambda-based deployment such as this makes it possible to process several million events per month for just a few dollars.

I’m a long way from being a Serverless crusader, and Lambda certainly isn’t for everyone. Nonetheless, cost-saving aside, there are undeniably other benefits of this approach, such as:

One-click deployment

Seamless scaling

Reduced sysadmin overhead

Reduced code complexity – much of the functionality of the current Snowplow code, such as concurrency, retries, and so on, is delegated to the Lambda execution engine

This code is currently extremely experimental, implements a very basic set of Snowplow functionality, and is almost definitely not for production use. In particular, the following Snowplow features are not yet supported:

Custom Iglu schemas (only Iglu central events are supported)

Custom enrichments

GeoIP enrichment

Webhooks

Graceful handling of bad collector requests

Graceful handling of Kinesis failures

Snowplow monitoring

3rd party cookies (network_userid)

Redirects

Any sinks other than Kinesis

Nonetheless, I’m pretty happy with it, and think this approach has huge potential, particularly when paired with other serverless AWS features. For example, by forwarding the enriched events stream to Kinesis Firehose, events could be stored in S3 and queried using Amazon Athena for a fraction of a cent per query.

There are, no doubt, other angles on this I haven’t considered, and I’d love to get feedback and thoughts from others in the Snowplow community. This is EXTREMELY experimental at the moment (see the README for details) but I’m happy to take it forward if there is an appetite for it.

This is very interesting. I think it could be useful for much more than just Charitables. Many companies have a cost and complexity concern as well. If you can get these other features working, it would be revolutionary!

@arikfr correct - I’m personally not a big Scala fan either, but the Snowplow shared libraries are written in Scala and make heavy use of Scalaz and functional paradigms which don’t convert well to Java at all (I tried, it wasn’t pretty)

This is really cool. I had a crack at refactoring the stream collector in Node.js a while ago (so it could run on Azure Functions, GCP Cloud Functions and AWS Lambda) and got most of though not all of the way.

I think you’ve hit the nail on the head regarding the utility of Lambda/serverless - the main things I noticed when building the cloud function (at least for the collector) were:

Some security limitations - the maximum concurrency for Lambda functions means that the collector is open to very simple denial of service attacks and due to the way that API Gateway performs throttling high load on one API can impact the latency of other unrelated APIs.

Once serverless has dealt with a few of these growing pains I think the collector could be well suited to eventually becoming serverless. I suspect the enricher will also be serverless but is likely to move towards running on something like Apache Beam/Dataflow where having a warm cache for running a variety of enrichments will be a requirement for lower latencies.