Tutorial: Creating a Pandas DataFrame from a Shapefile

ArcPy doesn´t have an option to export shapefile attribute tables to pandas DataFrame objects. Fortunately, there are number of workarounds available to make this happen.

Pandas DataFrame objects are comparable to Excel spreadsheet or a relational database table. They come from the R programming language and are the most important data object in the Python pandas library. They are handy for data manipulation and analysis, which is why you might want to convert a shapefile attribute table into a pandas DataFrame. Unfortunately, ArcMap offers no such functionality and pandas itself doesn´t know what to do with spatial data. However, there are various workarounds available to create DataFrame objects from shapefiles attribute tables. A few of them are covered below.

To follow the instructions below, you can download the MTBS wildfire data (national datasets) and store them on your pc. The file that will be analyzed here is the mtbs_fod_pts_20170501 shapefile’s attribute table, that counts 20,340 rows and 30 columns.

Option 1: convert a shapefile’s attribute table to an Excel table

If you have ArcMap available, head over to the System Toolboxes in ArcCatalog and choose “Conversion Tools” -> “Excel” -> “Table to Excel”. Choose an input file, name the Excel file that will be created and mark the “use field alias as column header” box. The same can be done using arcpy as follows:

import arcpy

# Set local variables

in_table = “C:\data\mtbs_fod_pts_data\mtbs_fod_pts_20170501.dbf”

out_xls = “excel_table.xls”

# Execute TableToExcel

arcpy.TableToExcel_conversion(in_table, out_xls)

The resulting 8+ megabyte Excel file can be imported as a pandas DataFrame as follows:

import pandas as pd

df = pd.read_excel(“C:\data\mtbs_fod_pts_data\excel_table.xls”)

df(head) # this will print out the first 5 rows and corresponding columns

df.shape #this will print out a list with the total amount of rows and columns.

Option 2: use Arcpy to convert a Numpy Array to a pandas DataFrame

The second option uses arcpy functionality to create NumPy arrays from attribute tables. The following code converts a pandas DataFrame from an NumPy array with attribute table data:

import arcpy

import pandas as pd

input = “C:\data\mtbs_fod_pts_data\mtbs_fod_pts_20170501.shp”

arr = arcpy.da.TableToNumPyArray(input, (‘FIRE_ID’, ‘FIRENAME’))

df = pd.DataFrame(arr)

As you can see, only two columns are used for the creation this DataFrame, as selecting all columns will give an memory allocation error. In other words, this solution is not scalable. However, this solution can come in handy when you only want a subset of the data returned as pandas DataFrame.

A similar, but more efficient approach uses Search Cursors to select the columns of interest:

df_2 = pd.DataFrame(arcpy.da.FeatureClassToNumPyArray(input,

[‘FIRE_ID’, ‘FIRENAME’],

skip_nulls=False,

null_value=-99999))

Option 3: Use The GeoPandas Library´s to Create a GeoPandas DataFrame

The quickest and easiest option to create a DataFrame from a shapefile is by using GeoPandas, a Python library for working with geospatial data. GeoPandas inherits the standard pandas methods for indexing and selecting data and adds geographical operations as spatial joins and merges. Geopandas can read almost any vector-based spatial data format, including Esri shapefile so that with only two lines of code, you can place all rows and columns into a GeoDataFrame, the library´s data object that is modeled after the pandas DataFrame. This GeoDataFrame also lists a Geometry column containing points for each row (feature):

Eric van Rees is a freelance writer and editor. His specialty is GIS technology. He has more than eight years of proven expertise in editing, writing and interviewing as editor and editor-in-chief for the international geospatial publication GeoInformatics, as well as GIS Magazine and CAD Magazine, both published in Dutch. Currently, he writes about geospatial technology for various clients, publications and blogs.