Subscribe to the Stitch Newsletter

Google Sheets + Stitch: An easy way to sync your small data to Redshift

A few months back, I found myself in a situation I’m sure someone at your company has been in. We had some goal data in Google Sheets that we wanted to bring into Redshift to join up to our sales data and trend against. This is small data – maybe a few dozen rows. Small as it is, it was important and we needed to join it with the rest of our data.

It would have been easy to populate a table with those goals directly in Redshift, but then our CFO would have had to manually update it in SQL, which would have been a harder task. So I spent a few hours coding up a Google Sheets integration for Stitch that makes loading to Redshift trivial. I’ll explain how it works.

Feel free to skip ahead to the repo. Please note: I am not a programmer by trade and pull requests are welcome!

I knew that there were a few key things that this script needed to do:

Allow for nontechnical users to set up and use

Grab all the data from a sheet and format as JSON

Conform to the Stitch API for JSON formatting, datatyping, and records per request

Step 1: Prepare your sheet to load data to your data warehouse

The easiest way for a general user to set up and use the script is to add a dropdown to the menu. Google Sheets makes this easy with the addMenu function.

I added a menu that has a two options that execute functions:

Sync to Stitch

Set up spreadsheet for sync

The onOpen function makes sure the menu dropdown loads every time the spreadsheet loads, and it contains the options for the other functions.

The onInstall function has prompts to set up the spreadsheet and current sheet for syncing. The Stitch API requires an API token and the client ID of the Stitch account. You will also need to define the primary key(s) for the sheet, so Stitch can determine what makes each row unique. Users can just click cancel on the API token and CID fields if they’re already added for additional sheets.

In the code below I use the normalizeHeaders function, which is defined later on, but basically makes sure that the sheet keys are formatted for the API and Redshift.

Step 2: Grab data from your sheet

In Google Scripts, you can define specific ranges of cells based on coordinates of the sheet. Below, you can see how we can get all of the data for a sheet into a JavaScript object. I took this code from a section of Google Scripts documentation about sheets-to-JSON that seems to no longer be available. I edited it a bit to account for spaces in headers and to make sure the headers are alphabetical.

Before we can push all of the data, we need to account for the 10K record limit per request that the Stitch API enforces. While most spreadsheets I’ve worked with are not larger than that limit, you never know. The push function includes a simple conditional statement that takes care of the initial breaking apart of large docs vs. small docs, while getting the relevant information staged for the other functions.

I’m only going to focus on the smalldoc section, because most of the code is the same, and it doesn’t need to be repeated.

There’s one function that handles the push, and one that handles the formatting of the record for the push, both shown below. The code makes one call out to transform the JavaScript object to Transit format, which is like JSON enriched with data type encoding. The Stitch API can accept Transit, and will use the encoding to have data show up in your data destination with encoding intact.

At the top of the script, there’s a large minified library, which is the JavaScript Transit library. We use a simple function to transform the data to Transit.

Additional notes

I haven’t explained the sequence_id and why it’s so important. You could theoretically push two versions of the same record within milliseconds of each other. Stitch, which is a distributed system in AWS, has to make sure that your latest update is the one of record. The chart below illustrates why this matters.

The easiest way to handle sequencing is to assign a Unix timestamp as the sequence ID, which is shown in the insertKeys function above. Later syncs will have a higher number, which Stitch will account for.

The full script, which accounts for sheets larger than 10K records, can be found here. Please let me know if you’d like any clarification on anything above, and please open an issue on Github if you have any problems using this script.