Pandas and Jupyter integration
AWS USAWS TokyoIDCF

Arm Treasure Data provides a cloud-based analytics infrastructure accessible via SQL. Interactive engines like Presto enable you to crunch billions of records easily. However, writing a SQL query is sometimes painful for data scientists, and you’ll still need to use external tools like Excel or Tableau to visualize the result. You can use Treasure Data with the Python-based data analysis tool called Pandas, and visualize the data interactively via Jupyter (iPython Notebook)

Prerequisites

Basic knowledge of Python.

Basic knowledge of Treasure Data.

Set Treasure Data API Key

Set your master api key as an environment variable before launching Jupyter. The master API KEY can be retrieved from the TD Console profile.

$ export TD_API_KEY="1234/abcde..."

The TD_API_KEY must be set before launching Jupyter.

Install Jupyter, Pandas, and Pandas-TD

We recommend that you use Miniconda to install all required packages for Pandas-TD.

If you get "KeyError: 'TD_API_KEY'" error, try "apikey='your apikey'" instead of "apikey=os.environ['TD_API_KEY']". If it works, Jupyter didn't recognize TD_API_KEY variable from OS, confirm the TD_API_KEY again and re-launch Jupyter.

Explore Data

There are two tables in sample_datasets. Let’s explore the nasdaq table.

We’ll use presto as a query engine for this session. To see how the table looks, you can retrieve a few lines by read_td_table:

You can also use the time_range parameter to retrieve data within a specific time range:

Now, your data is stored in the local variable df as a DataFrame. Because the data is located in the local memory of your computer, you can analyze it interactively using the power of Pandas and Jupyter. See Time Series / Date functionality for the details of time-series data.

Sample Data

As your data set grows very large, the method from the previous step doesn’t actually scale very well. We don't recommend that you retrieve more than a few million rows at a time due to memory limitations or slow network transfer. If you’re analyzing a large amount of data, you need to limit the amount of data getting transferred.

There are two ways to do this:

you can sample data. For example, the “nasdaq” table has 8,807,278 rows. Setting a limit of 100000 results in 100,000 rows, which is a reasonable size to retrieve:

write SQL and limit data from the server side. For example, as we are interested only in data related to “AAPL”, let’s count the number of records, using read_td_query:

It’s small enough, so we can retrieve all the rows and start analyzing data: