I’m working on revamping an existing analytics flow. We use snowplow. The pain-point is keeping events consistent between different client implementations. For example, a screen may be called RegistrationActivity on one client and LoginController on another, or an event may be button_edit_click or edit_button_click. So, naturally, one might wish to enumerate acceptable values for events in a schema.

From my initial understanding, it appears snowplow supports custom schema-validated events. However, I’d like to add stricter validation to the builtin events e.g. screen and track. Is this possible?

I guess the alternative is to track the world and build more complex queries to sift through all the data (the approach Snowplow advocates). Its is more enticing the more I think about this…

It’s not possible to add stricter validation to the built-ins. Our structured event is modeled after a Google Analytics event, and these are deliberately very “loosely typed”. The screen view event is similarly permissive. But if you look at the screen view event’s schema:

From the schema you can see that it would be possible to create your own com.dcow version of the screen view, which you could make much more strongly typed. For example, instead of allowing the name of the screen to be a free-form string, you could make it a JSON Schema enum and thus enforce that the screen name comes from a pre-agreed list of legal values.

Validation only happens in the ETL layer. This ensures that all validation failures are captured within the Snowplow pipeline - if a client were to do validation before sending, then there would be nowhere for a validation failure to go…

This said, in a strongly typed environment like Android or Obj-C, there’s no reason why you couldn’t mandate that all self-describing events and contexts should be created via pre-defined classes/structs (with a helper method to convert them into JSON dicts). This would give you compile time guarantees around all of your entities. It’s not something we’ve tried - let us know how you get on if you give it a go!

It’s an idea I’ve been juggling around but I’m not sure it’s smart for exactly the reasons you mention. At most you could add client logging when an invalid json appears. The real win would be compile time validation of event structures, as you suggest. But once you start going down that road you want to generate those data classes from your json schema anyway. A tool that lets you feed self-describing json into existing json code generator utilities might be useful.

Exactly - if you do runtime validation, you still have to somehow get the validation failures to a back-end for analysis, which would involve adding some other kind of “logging side-pipe”. It’s easier just to pass them un-validated to the Snowplow pipeline and get all the failure reporting in one place. Plus it means you have the option of recovering the failures, using Hadoop Event Recovery.

dcow:

But once you start going down that road you want to generate those data classes from your json schema anyway.

Yes - we have plans for an Iglu registry to handle that auto-generation itself - e.g. for Android/Java/Scala, your Iglu registry would also host a Maven-compatible repository containing POJOs/Scala case classes for all your entities. The ticket to follow here is Placeholder for Maven repo inside Iglu Server #88. This is useful both for enforcing correctness in your tracking instrumentation, but also for making analytics at the other end easier (e.g. writing AWS Lambda functions that operate on the data).