Connecting a Jupyter Notebook to Snowflake through Python (Part 3)

In part two of this four-part series, we learned how to create a Sagemaker Notebook instance. In part three, we’ll learn how to connect that Sagemaker Notebook instance to Snowflake.

If you’ve completed the steps outlined in part one and part two, the Jupyter Notebook instance is up and running and you have access to your Snowflake instance, including the demo dataset. Now, you’re ready to connect the two platforms.

The first step is to open the Jupyter service using the link on the Sagemaker console.

There are two options for creating a Jupyter Notebook. You can create the notebook from scratch by following the step-by-step instructions below, or you can download sample notebooks here. If you decide to build the notebook from scratch, select the conda_python3 kernel. Alternatively, if you decide to work with a pre-made sample, make sure to upload it to your Sagemaker notebook instance first.

There are several options for connecting Sagemaker to Snowflake. The simplest way to get connected is through the Snowflake Connector for Python. By the way, the connector doesn’t come pre-installed with Sagemaker, so you will need to install it through the Python Package manager. (Note: As of the writing of this post, the Snowflake Python connector has a dependency to C foreign function interface (CFFI), so it requires a minimum version of 1.11). Sagemaker, on the other hand, comes preinstalled with libcffi 1.10, which unfortunately causes the Snowflake Python connector to fail.

The step outlined below automatically detects the failure and, if necessary, triggers de-installation and re-installation of the most recent version of libcffi. Note:If re-installation of the CFFI package is necessary, you must also restart the Kernel so the new version is recognized.

This problem is persistent and needs to be resolved each time the Sagemaker server is shut down and restarted. The Jupyter Kernel also needs to be restarted after each shutdown. The good news is that the cffi package will eventually be updated on the Sagemaker AMI, and the step indicated above will begin to automatically recognize the update.

The next step is to connect to the Snowflake instance with your credentials.

Here you have the option to hard code all credentials and other specific information, including the S3 bucket names. However, for security reasons, it’s advisable to not store credentials in the notebook. Another option is to enter your credentials every time you run the notebook.

Rather than storing credentials directly in the notebook, I opted to store a reference to the credentials. The actual credentials are automatically stored in a secure key/value management system called AWS Systems Manager Parameter Store (SSM).

With most AWS systems, the first step requires setting up permissions for SSM through AWS IAM. Please ask your AWS security admin to create another policy with the following Actions on KMS and SSM.

Adhering to the best-practice principle of least permissions, I recommend limiting usage of the “Actions by Resource”. Also, be sure to change the region and accountid in the code segment shown above or, alternatively, grant access to all resources (i.e. “*”).

In the code segment shown above, I created a root name of “SNOWFLAKE”. This is only an example. You’re free to create your own, unique naming convention.

Next, check permissions for your login. Assuming the new policy has been called SagemakerCredentialsPolicy, permissions for your login should look like the example shown below.

With the SagemakerCredentialsPolicy in place, you’re ready to begin configuring all your secrets (i.e. credentials) in the S3 bucket locations. Be sure to take the same namespace that you used to configure the credentials policy and apply it to the prefixes of your secrets.

After setting up your key/value pairs in SSM, use the following step to read the key/value pairs into your Jupyter Notebook.

Instead of hard coding the credentials, you can reference key/value pairs via the variable param_values. In addition to the credentials (account_id, user_id, password), also I stored the warehouse, database and schema.

Now, you’re ready to read data from Snowflake. To illustrate the benefits of using data in Snowflake, we will read semi-structured data from the database I named “SNOWFLAKE_SAMPLE_DATABASE”.

When data is stored in Snowflake, you can use the Snowflake JSON parser and the SQL engine to easily query, transform, cast and filter JSON data data before it gets to the Jupyter Notebook.

From the JSON documents stored in WEATHER_14_TOTAL, the following step shows the minimum and maximum temperature values, a date and timestamp and the latitude/longitude coordinates for New York City.

The final step converts the result set into a Pandas DataFrame, which is suitable for machine learning algorithms.

Conclusion

Now that we’ve connected a Jupyter Notebook in Sagemaker to the data in Snowflake using the Snowflake Connector for Python, we’re ready for the final stage: Connecting Sagemaker and a Jupyter Notebook to both a local Spark instance and a multi-node EMR Spark cluster. I’ll cover how to accomplish this connection in the fourth and final installment of this series — Connecting a Jupyter Notebook to Snowflake via Spark.