Most App developers are familiar with Firebase – a powerful platform and Software Development Kit backed by Google, which provides a secure database, analytics, messaging, and everything else needed for app development, all in one seamless package.

Firebase provides convenient user profile and event logging functionality, which facilitates app analytics, user attribution, and churn analysis. A distinct feature of Firebase is its ability to log an unlimited number of events comprised of 500 user-defined kinds, and it also provides free analytics.

Firebase also generates predictions and organizes user segmentation based on an event stream (event occurrences): all with full integration.

However, when we base predictions only on event occurrences, it’s possible to miss the context of each event or user. Ideally, we need to analyze both event occurrences and aggregated (‘meta’) event data (i.e., engineer new features).

Additionally, the nested storage format of event and user parameters in Firebase makes it challenging to construct efficient aggregation queries in any database and makes data extraction even more complicated. Luckily, Google’s BigQuery can speed-up feature engineering and analytics, which makes the overall process much more manageable.

Understanding BigQuery

BigQuery is Google’s enterprise analytical data warehouse, which runs blazing-fast SQL queries on gigabytes and petabytes of data. Firebase can export complete event data in its original format to BigQuery daily, where it is processed and analyzed on a massive scale. BigQuery can handle huge feature aggregation queries as long as they have efficient joins, and as long as analysts de-normalize and unnest repeated Firebase data.

Understanding Firebase Data

After setting up a daily export to BigQuery, Firebase creates a new table in specified BigQuery dataset, or in separate ones, if you have Android and iOS versions of your app. As previously mentioned, Firebase exports all data to BigQuery in its native format. This provides user data, such as demographic and campaign data, in a nested structure named user_dim and event data in a structure named event_dim.

The user_dim-event_dim tuples display as repeated rows in small batches per user, which we can easily unroll. However, Firebase also supports custom app events, which can be associated with up to 25 parameters as key-value pairs. Unpacking and analyzing the parameters is a desired target for app analysts, but it’s also a complicated task.

Let’s take a look at the example row of event data for one user, loaded in a single batch from Firebase, shown in Figure 2:

Figure 2: sample event data from Firebase

In the above figure, we see that each (sub)row of event data is associated with a set of repeated key-value pair rows. Additionally, all rows in a batch are associated with a contemporary user profile that has a set of its own parameter key-value records (not shown). This requires writing rather inefficient SQL constructs when you need to unroll and aggregate specific event parameters with the UNNEST function (see Figure 3).

Figure 3: sample aggregate query

Efficient Feature Engineering: Step-By-Step

First, we unpack all user and event properties from repeated rows to serialized JSON. Next, we detach repeated static user data, such as country and language, into a separate lookup table, to focus on events and profile aggregation.

Then, we extract and store all events and associated profiles from batch records on the same denormalized row structure, which contains a JSON field that stores each property. This row structure creates query event data for users and analyzed features on-demand. It also streams user events and constructs features continually.

BigQuery’s UDF forms the basis for the corresponding transform query. This allows us to define the JavaScript function that transforms repeated key-value rows into JSON. Figure 4 shows this query:

Figure 4: Denormalizing transform query, aggregating repeated rows

After this transformation, we define aggregation queries that construct features for every user in the database and their relevant events – all in one single pass.

Since the source table stores the complete set of data, the massive parallelism of BigQuery is, in this case, extremely efficient. For example, we can execute complex queries, (shown in Figure 5), by creating complete user and event features, readied for Data Science or other types of advanced analytics:

DELVE Experts

Efficient Feature Engineering of Event Data in BigQuery

Efficient Feature Engineering of Event Data in BigQuery

Mikalai Tsytsarau, PhD GCP Professional Data Engineer

Most App developers are familiar with Firebase – a powerful platform and Software Development Kit backed by Google, which provides a secure database, analytics, messaging, and everything else needed for app development, all in one seamless package.

Firebase provides convenient user profile and event logging functionality, which facilitates app analytics, user attribution, and churn analysis. A distinct feature of Firebase is its ability to log an unlimited number of events comprised of 500 user-defined kinds, and it also provides free analytics.

Firebase also generates predictions and organizes user segmentation based on an event stream (event occurrences): all with full integration.

However, when we base predictions only on event occurrences, it’s possible to miss the context of each event or user. Ideally, we need to analyze both event occurrences and aggregated (‘meta’) event data (i.e., engineer new features).

Additionally, the nested storage format of event and user parameters in Firebase makes it challenging to construct efficient aggregation queries in any database and makes data extraction even more complicated. Luckily, Google’s BigQuery can speed-up feature engineering and analytics, which makes the overall process much more manageable.

Understanding BigQuery

BigQuery is Google’s enterprise analytical data warehouse, which runs blazing-fast SQL queries on gigabytes and petabytes of data. Firebase can export complete event data in its original format to BigQuery daily, where it is processed and analyzed on a massive scale. BigQuery can handle huge feature aggregation queries as long as they have efficient joins, and as long as analysts de-normalize and unnest repeated Firebase data.

Understanding Firebase Data

After setting up a daily export to BigQuery, Firebase creates a new table in specified BigQuery dataset, or in separate ones, if you have Android and iOS versions of your app. As previously mentioned, Firebase exports all data to BigQuery in its native format. This provides user data, such as demographic and campaign data, in a nested structure named user_dim and event data in a structure named event_dim.

The user_dim-event_dim tuples display as repeated rows in small batches per user, which we can easily unroll. However, Firebase also supports custom app events, which can be associated with up to 25 parameters as key-value pairs. Unpacking and analyzing the parameters is a desired target for app analysts, but it’s also a complicated task.

Let’s take a look at the example row of event data for one user, loaded in a single batch from Firebase, shown in Figure 2:

Figure 2: sample event data from Firebase

In the above figure, we see that each (sub)row of event data is associated with a set of repeated key-value pair rows. Additionally, all rows in a batch are associated with a contemporary user profile that has a set of its own parameter key-value records (not shown). This requires writing rather inefficient SQL constructs when you need to unroll and aggregate specific event parameters with the UNNEST function (see Figure 3).

Figure 3: sample aggregate query

Efficient Feature Engineering: Step-By-Step

First, we unpack all user and event properties from repeated rows to serialized JSON. Next, we detach repeated static user data, such as country and language, into a separate lookup table, to focus on events and profile aggregation.

Then, we extract and store all events and associated profiles from batch records on the same denormalized row structure, which contains a JSON field that stores each property. This row structure creates query event data for users and analyzed features on-demand. It also streams user events and constructs features continually.

BigQuery’s UDF forms the basis for the corresponding transform query. This allows us to define the JavaScript function that transforms repeated key-value rows into JSON. Figure 4 shows this query:

Figure 4: Denormalizing transform query, aggregating repeated rows

After this transformation, we define aggregation queries that construct features for every user in the database and their relevant events – all in one single pass.

Since the source table stores the complete set of data, the massive parallelism of BigQuery is, in this case, extremely efficient. For example, we can execute complex queries, (shown in Figure 5), by creating complete user and event features, readied for Data Science or other types of advanced analytics:

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.