Introduction

Before being able to predict one's Beast Mode Quotient, we must find an easy way to ingest data from multiple different sources, in this case flat files and APIs. Luckily for everyone in the world, HDF, specifically Nifi is designed to handle multiple data inputs and outputs extremely efficiently, as I will detail in this tutorial.

A few notes about the choices of technologies selected for this tutorial and series of article:

Target - MySQL: I selected MySQL because this is what my BMQ app is going to use, but it could be any target depending on your use-case.

Sources - Fitbit, Strava, MyFitnessPal: There are a bunch of options out there to track health. Fitbit and Strava are the only ones offering a comprehensive and public API (looking at you Garmin, and Apple Health; both have no public APIs, and make it not easy to extract data via file, either having to parse .fit files or only exporting the full set of data); as for MyFitnessPal, I could have used Fitbit API (after sync with MFP) to get Calories data, but I figured it would be nice to demonstrate file ingestion.

Notions you will learn in this tutorial

Here are some high level concepts highlighted in this article, that you can re-apply to any implementation:

Running MySQL queries including create, upsert and selection of date series

Section 1: Setting up a MySQL DB to host health and fitness data

Step 1: Create the MySQL DB and users

Before being able to do anything, you will have to create a MySQL instance. There are a bunch of tutorials out there that explain you how to do it, but here is one for
centos 7 for instance.
Once you have this data setup connect as
root and run the following database creation script:

Section 2: Creating a Nifi flow to consume MyFitnessPal CSV exports

Step 1: Survey folder for new CSV Files

List Files in MFP Folder: ListFile Processor

This processor will monitor a folder for new files, based on timestamp, then maintain the last file read in its state, as seen below after right-clicking on the processor and selecting view state:

Here is a list of properties to configure that are not default:

Input directory: [YOUR_LOCAL_PATH]

File filter: .*

Run Schedule: 12 h

Get Latest File: FetchFile Processor

This processor gets the file from the list generated previously and sends it in a flow file. It is configured using default properties.

Step 2: Convert CSV to JSON

Update Schema Name: UpdateAttribute Processor

This processor updates the schema name attribute (later used by next processors and controller services).

HortonworksSchemaRegistry Controller Service

This controller service is pointing to your HDF Schema Registry. The only non-default property to configure in the controller service before enabling is:

Schema Registry URL: http://[YOUR_SCHEMA_REGISTRY_HOST]:7788/api/v1/

CSVReader Controller Service

This controller service is going to read your CSV and use your Hortonworks Schema Registry to parse its schema. The following non-default properties should be configured in the controller service before enabling is:

Schema Access Strategy: Use 'Schema Name' Property

Schema Registry: HortonworksSchemaRegistry

Treat First Line as Header: true

Ignore CSV Header Column Names: true

JsonRecordSetWriter Controller Service

This controller service is going to write your JSON output based on your Hortonworks Schema Registry. The following non-default properties should be configured in the controller service before enabling is:

Schema Write Strategy: Set 'schema.name' Attribute

Schema Access Strategy: Use 'Schema Name' Property

Schema Registry: HortonworksSchemaRegistry

Convert CSV to JSON: ConvertRecord Processor

This processor does the record conversion. Configure it with your enabled controllers as such:

Record Reader: CSVReader

Record Writer: JsonRecordSetWriter

Step 3: Get fields needed for insert

SplitJson Processor

This processor splits each meal in your export in a flow file, by using the following non-default property:

JsonPath Expression: $.[*]

Extract Relevant Attributes: EvaluateJsonPath Processor

This processor extracts the relevant elements of your split JSON into attributes. It is configured using the following:

Destination: flowfile-attribute

carb: $.carbohydrates_g

diary_day: $.date

fat: $.fat_g

meal: $.meal

prot: $.protein_g

total_calories_in: $.calories

Step 4: Insert Data to MySQL

Create Upsert Query: ReplaceText Processor

This processor creates the query to be executed on your MySQL server to upsert the different lin. Here is the ReplacementValue you should use:

Upsert into MySQL: PutSQL Processor

This processor executes the query generated in the ReplaceText processor, and relies on your DBCPConnectionPool controller service, configured as such:

JDBC Connection Pool: DBCPConnectionPool

Section 3: Creating a Nifi flow to consume Fitbit health & sleep data

The goal is to setup the following flow (separated in two screenshot, so you know it's going to be fun). It assumes that you registered an application with the Fitbit API and have a refresh/bearer token ready with the right permissions (see details over at
Fitbit documentation).

Step 1: Get Delta parameters

Get Latest Date: ExecuteSQL Processor

This processor runs a query to get a series of dates not yet in the BMQ DB in order to run the Fitbit API. It is relying on the configuration of DBCPConnectionPool detailed above (configured in the property Database Connection Pooling Service, and a 12 h Run Schedule). Here is the query it should run:

I realize this is not the only method to generate such a list (using GenerateTableFetch, etc.). This makes sure that I only run what the DB needs if something else changes the DB.

ConvertAvroToJSON Processor

This processor converts the Avro response of the previous processor. Make sure you configure the following property to avoid splitting error when only one record is returned.

Wrap Single Record: true

SplitJson Processor

This processor splits each record in the query in a flow file, by using the following non-default property:

JsonPath Expression: $.*

Extract Relevant Attributes: EvaluateJsonPath Processor

This processor extracts the relevant elements of your split JSON into attributes. It is configured using the following:

Destination: flowfile-attribute

DIARY_DAY: $.list_date

Step 2: Handle Token Refresh

DistributedMapCacheServer Controller Service

This controller service is the server that will run the distributed cache. I should be configured using all default properties before being enabled.

DistributedMapCacheClientService Controller Service

This controller service is connecting to the DistributedMapCacheServer configured previously. Before enabling it, just configure the following:

Server Hostname: [YOUR_HOSTNAME]

Get Token From Cache: FetchDistributedMapCache Processor

This processor will try and fetch a refresh token from the DistributedMapCacheServer. If not found, we will send it to an UpdateAttribute to set a default value (e.g. for the first run), otherwise we will send it to the InvokeHTTP to get a new token. Here are the non-default properties to be configured:

Cache Entry Identifier: refresh-token-fitbit

Distributed Cache Service: DistributedMapCacheClientService

Put Cache Value In Attribute: CurrentRefreshToken

A few notes: This configuration requests a refresh token every run, which works with my 12 h run schedule. Ideally, you would store the expiration in distributed cache as well and only request a refresh token then, but I didn't want to overcomplicate things. Moreover, in a production environment, I would probably recommend to persist these token on disk and not only in memory in case of failure.

Use Default Value if Not Found: UpdateAttribute Processor

This processor updates the CurrentRefreshToken if FetchDistributedMapCache returns not-found. Configure it using this property:

CurrentRefreshToken: [A_VALID_REFRESH_TOKEN]

Refresh Fitbit Token: InvokeHTTP Processor

This calls the Fitbit API to get a new valid token and a refresh token. Configure the following non-default properties to run it:

Extract Token: EvaluateJsonPath Processor

This processor extracts the bearer and refresh tokens from the HTTP response. It is configured using the following:

Destination: flowfile-attribute

carb: $.access_token

diary_day: $.refresh_token

Extract Token: EvaluateJsonPath Processor

This processor extracts the bearer and refresh tokens from the HTTP response, then continues to two routes: one storing back the refresh token in cache, the other one using the bearer token to call the Fitbit API. It is configured using the following:

Destination: flowfile-attribute

carb: $.access_token

diary_day: $.refresh_token

Put Refresh Token in Flowfile: ReplaceText Processor

This processor puts the refresh token in flowfile in order to be consumed:
${REFRESH_TOKEN}

PutDistributedMapCache Processor

This processor takes the refresh token in the flowfile and stores it under refresh-token-fitbit for next time the flow is executed:

Cache Entry Identifier: refresh-token-fitbit

Distributed Cache Service: DistributedMapCacheClientService

Step 3: Call Fitbit APIs (Health and Sleep)

For this step, I will only detail one processor. First, because, as you can see on the screenshot above, both Get FitBit Daily Summary are very similar (they are just calling diffrent endpoints of the fitbit API). Secondly, because the flow Extract Relevant Attributes > Create SQL Query > Upsert to MySQL has been described above. Finally, because I will keep the Calculate Averages and Sums for the next flow I run in Strava. All detailed processors can be found here:
feed-fitbit-anonymized.xml

Get FitBit Daily Summary: InvokeHTTP Processor

Section 4: Creating a Nifi flow to consume Strava activity data

As for the previous section, the goal is to setup a flow calling Strava's activity API. It assumes that you registered an application with the Strava API and have a refresh/bearer token ready with the right permissions (see details over at
Strava documentation). As opposed with Fitbit the refresh token remains the same, so no need for distributed cache. I'm only going to detail the Convert Units processor to talk about Nifi language. The rest of the flow can be found here: feed-strava-anonymized.xml

Convert Units: UpdateAttribute Processor

This processor uses some of the data extracted to convert units before putting them into attributes later on used in your upsert query. Here are the calculation executed:

AVG_HR: ${AVG_HR:isEmpty():ifElse(0.0, ${AVG_HR})} (makes sure that we have a value if the API does not return HR, in the case of HR not present in activity)