The SQL Query Enrichment lets you perform dimension widening on a Snowplow event via a SQL query against a JDBC-capable relational database.

This is an introductory tutorial to help you get started with this enrichment. Our example use case is deliberately simple but it should be enough to showcase the usage of the enrichment and give you an idea on how you could utilize it in real-world scenarios.

Business goal

Let’s assume that we have a web form containing an email address field. Updates to the form, including form submissions, all result in an event being sent to Snowplow.

We would like to enrich these events with some details related to the owner of the email address; you could say that we want to dimension widen the event using the email address as a key. We are in luck - our company operates an in-house CRM system which stores data in its own MySQL database.

Let’s start by understanding this data we want to join.

Part 1: Understanding our data for joining

Our company’s database has a table called users which contains records for almost all relevant email addresses. This is the query that was used to create the users table in MySQL:

The SQL Query Enrichment is quite powerful - it can support retrieving data from a join across multiple tables. In our case, however, all the data we are after is contained within this single table - here’s a row:

Unfortunately, we can’t use Form Tracking’s submit_form event, because the data obtained from all of the form’s fields are populated into a single value field. This is not something we can work with yet.

Instead, we’ll use the change_form event, which is fired for every changed form field and populates the value of the changed field in its value field, per the JSON schema for this event.

Note that if we had lots of fields in our form, we could trigger many fruitless lookups of non-email fields in our users table; for the purposes of this tutorial let’s assume our field consists only of our email field.

This enables both submit_form and change_form event tracking. The change_form event will end up in its dedicated table in Redshift post processing, called com_snowplowanalytics_snowplow_change_form_1.

Great, we now know the data that we are looking up, and where our source events are coming from; next let’s put together a schema to store the data we are going to look up.

Part 3: Preparing our schema

Snowplow will pass change_form events to the SQL Query Enrichment to look up the email address in each event against our users table in MySQL. If a match is found, the matching record from MySQL will be converted into JSON and added as an additional context to the derived_contexts field of the enriched event.

For this to go smoothly, we need to define the JSON Schema for our MySQL record. Basing this closely on the underlying users MySQL table, here is our JSON Schema:

Now, we are going to use igluctl to generate the corresponding Redshift and JSON Paths file to allow us to load these contexts into Redshift.

We run the following command:

$ igluctl static generate --with-json-paths schemas/

On successful run, the output is like the one below:

File [/Users/ihor/workshop/tutorial/sql-enrichment/./sql/ca.properweb/user_1.sql] was written successfully!
File [/Users/ihor/workshop/tutorial/sql-enrichment/./jsonpaths/ca.properweb/user_1.json] was written successfully!

Let’s take a look at the generated files. First, here is the Redshift DDL to create the table for our looked-up users:

output allowing us to tune how we convert the returned row(s) into one or more self-describing JSONs ready to be attached to our Snowplow event

Let’s look at each in turn.

input

Each field within a Snowplow enriched event is either a pojo if the datapoint comes from the Snowplow enriched event POJO, or json if the datapoint comes from a self-describing JSON inside one of the three JSON fields, namely contexts, unstruct_event or derived_contexts.

The SQL Query Enrichment can use any field from the enriched event can be used as a key to query the database, that is dimension widen with the SQL Query Enrichment.

In our scenario, the value field of the change_form event can be found in the unstruct_event field of the enriched event. We can express this in our configuration using the following coordinates:

We strongly recommend that the username have minimal read-only permissions on just the entities required to execute the SQL query.

If your database server has additional authentication in place, such as IP whitelisting, you will need to configure this security to permit access from all of your servers running the Snowplow Enrichment process.

query

This is where the placeholder parameter from the last section comes into play. The ? symbol is a placeholder for the key(s) defined in the input section. The query is implemented as a prepared statement: a precompiled query used to execute the same or similar database statements repeatedly with high efficiency. It takes the form of a template into which certain constant values are substituted during each execution.

In our example, we have only one placeholder, but there could be more than one; you would then have additional entries in the inputs array.

Some further notes on the behaviour of sql:

If a placeholder index required in the sql prepared statement is not found when searching the enriched event for the inputs, then the query will not proceed, but this will not be flagged as a failure

A final ; is optional

This enrichment makes no attempt to sanitize the SQL statement, nor to verify that the SQL statement does not have harmful side effects (such as SQL injection)

output

The enrichment adds the returned row(s) into the derived_contexts field within a Snowplow enriched event. Because all JSONs in the derived_contexts field must be self-describing JSONs, we use the schema field to specify the Iglu schema URI of the schema that describes the new contexts (which we showed you earlier in the section Phase 2: Schema preparation).

EXACTLY_ONE - exactly one row is expected. 0 or 2 or more rows will throw an error, causing the entire event to fail to process

AT_MOST_ONE - either one or zero rows is expected. 2 or more rows will throw an error

AT_LEAST_ZERO - between 0 and N rows are expected - i.e. we are dealing with an array of results

AT_LEAST_ONE - between 1 and N rows are expected - i.e. a non-empty array of results. 0 rows will throw an error

For our purposes, we use AT_MOST_ONE because we are expecting to find either a single matching user in our MySQL users table or no match.

The describes property dictates whether the schema is the self-describing schema for all rows returned by the query, or whether the schema should be attached to each of the returned rows:

ALL_ROWS means that the schema should encapsulate all returned rows - i.e. one context will always be added to derived_contexts, regardless of how many rows that schema contains. Each returned row will be an array element within the self-describing JSON

EVERY_ROW means that the schema should be attached to each returned row - so e.g. if 3 rows are returned, 3 contexts with this same schema will be added to derived_contexts

In our case, the JSON Schema we created represents a single user, not an array of users, so we use EVERY_ROW.

The propertyNames property supports reformatting of the returned columns to fit the JSON Schema’s conventions better. Supported options are:

AS_IS - preserve the column names exactly as they are

CAMEL_CASE - so date_of_birth becomes dateOfBirth

PASCAL_CASE - so date_of_birth becomes DateOfBirth

SNAKE_CASE - so dateOfBirth becomes date_of_birth

LOWER_CASE - changes all characters to lowercase

UPPER_CASE - changes all characters to uppercase

If these options aren’t bespoke enough, remember that you can use column aliasing in your SQL statement to tweak individual column names.

For the purposes of our tutorial, the regular MySQL column names are fine.

##cache

A Snowplow enrichment can run many millions of time per hour, effectively launching a DoS attack on a data source if we are not careful! The cache configuration attempts to minimize the number of lookups performed.

The cache is an LRU (Least Recently Used) cache, where less frequently accessed values are evicted to make space for new values. The enrichment uses the results of the query as the cache keys. You can configure the cache as follows:

size is the maximum number of entries to hold in the cache at any one time

ttl is the number of seconds that an entry can stay in the cache before it is forcibly evicted. This is useful to prevent stale values from being retrieved in the case that your API can return different values for the same key over time

Part 5: Running and testing our enrichment

Thanks for sticking with this tutorial so far - we are now ready to put our enrichment into action.

To recap, we are dimension widening our Snowplow enriched events by using the value property in our change_form self-describing events to look up rows from a users table we are keeping in MySQL.

The email address expressed in the value property in our change_form will be used as the key to query the MySQL database during the Snowplow Enrichment process. The new data we want to obtain could be retrieved with the following SQL query:

We do not expect more than one record to be returned because each email address should be unique within the users table. The new data will be added as a self-describing JSON to the derived_contexts array within our event.

We put our SQL Query Enrichment configuration live in our Snowplow installation and wait for a successful run (if batch) or for a few minutes to pass (if real-time).

Looking in Redshift, we are expecting to see the change_form event in the atomic.events table, the actual value for the captured email address in the atomic.com_snowplowanalytics_snowplow_change_form_1 table and finally the new (enriched) data obtained from our MySQL database in the atomic.ca_properweb_user_1 table.

We should be able to link the last 2 to the atomic_events table by means of the event_id = root_id relation:

Introduction Snowplow’s new [API Request Enrichment] api-request-enrichment-doc lets us perform dimension widening on an incoming Snowplow event using any internal or external HTTP-based API. This tutorial walks you through a practical application...

Reading time: 6 mins 🕑Likes: 3 ❤

Finally, as always, do please give this a go and share your experience and use cases with us in the comments.