Introduction

One of my favorite things about Python is that users get the benefit of observing the R community and then emulating the best parts of it. I'm a big believer that a language is only as helpful as its libraries and tools.

This post is about pandasql, a Python package we (Yhat) wrote that emulates the R package sqldf. It's a small but mighty library comprised of just 358 lines of code. The idea of pandasql is to make Python speak SQL. For those of you who come from a SQL-first background or still "think in SQL", pandasql is a nice way to take advantage of the strengths of both languages.

In this introduction, we'll show you to get up and running with pandasql inside of Rodeo, the integrated development environment (IDE) we built for data exploration and analysis. Rodeo is an open source and completely free tool. If you're an R user, its a comparable tool with a similar feel to RStudio. As of today, Rodeo can only run Python code, but last week we added syntax highlighting for a bunch of other languages to the editor (markdown, JSON, julia, SQL, markdown). As you may have read or guessed, we've got big plans for Rodeo, including adding SQL support so that you can run your SQL queries right inside of Rodeo, even without our handy little pandasql. More on that in the next week or two!

Downloading Rodeo

ps If you download Rodeo and encounter a problem or simply have a question, we monitor our discourse forum 24/7 (okay, almost).

A bit of background, if you're curious

Behind the scenes, pandasql uses the pandas.io.sql module to transfer data between DataFrame and SQLite databases. Operations are performed in SQL, the results returned, and the database is then torn down. The library makes heavy use of pandaswrite_frame and frame_query, two functions which let you read and write to/from pandas and (most) any SQL database.

locals() vs. globals()

pandasql needs to have access to other variables in your session/environment. You can pass locals() to pandasql when executing a SQL statement, but if you're running a lot of queries that might be a pain. To avoid passing locals all the time, you can add this helper function to your script to set globals() like so:

Final thoughts

pandas is an incredible tool for data analysis in large part, we think, because it is extremely digestible, succinct, and expressive. Ultimately, there are tons of reasons to learn the nuances of merge, join, concatenate, melt and other native pandas features for slicing and dicing data. Check out the docs for some examples.

Our hope is that pandasql will be a helpful learning tool for folks new to Python and pandas. In my own personal experience learning R, sqldf was a familiar interface helping me become highly productive with a new tool as quickly as possible.

We hope you'll check out pandasql and Rodeo; if you do, please let us know what you think!