pyODBC and SQL

Getting Data with pyODBC

Moving data from a structured database via either a SQL query or a HiveQL query to a local machine is many times desired for deeper exploratory analysis. For small and medium sized data sets, transferring data straight to RAM is ideal, without the intermediate step of saving the query results to a file. pyODBC is a python library that enables database connects — this will require that the ODBC driver is installed along with any other required database-specific drivers.

First we create a database connection object by instantiating the pyodbc.connect python class with all the connection details. Here the server_address will be the actual address to your server, and the database_name is actual name of the database for the connection.

Then we create a cursor object and begin to use the execute method to run our queries. To return query results, we have to call the fetchall method, which returns a tuple of tuples back to python, i.e. query_results = ((row0), (row1), (row2), …).

Warning: There is the possibility of crashing your computer by completely filling the RAM. When working with new data, I tend to watch the RAM utilization on the Activity Monitor during data pulls, killing the process if necessary.

Simple Database Interface Class

Below is a database interface class that I personal used for SQL Server queries before finding out that pandas has build-in support for sql queries. Nearly all my exploratory work is done in python using pandas, thus this class was written to convert the tuple of tuples into a pandas dataframe object by default. The current version as shown parses the HiveQL query to get the correct column names, and does not support ‘select * …’ queries at the moment if the data_frame flag is set to True.