Using Data Packages with Pandas

Frictionless Data is about making it effortless to transport high
quality data among different tools and platforms for further analysis.
We obviously ♥ data science, and pandas is one of the most
popular Python libraries for advanced data analysis and modeling.
This post highlights our most recent community
contribution1—pandas integration for Data Packages—what it
means, and how you can contribute.

Pandas

pandas is a Python package providing fast, flexible, and expressive
data structures designed to make working with “relational” or
“labeled” data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real
world data analysis in Python.

One of the primary data structures in pandas is the DataFrame. The
DataFrame, similar to R’s data
frame, stores the kind of 2-dimensional, tabular data common across
various data analysis use cases. While pandas has extremely powerful
tools for importing, exporting, and manipulating data, the process of
loading data from, say, a single CSV file, often requires some trial
and error to do optimally. For instance, one might need to manually
specify CSV dialect parameters, index columns, datetime fields, etc.
Pandas has automatic type and encoding guessing, but guessing often
fails, requiring manual intervention to accurately describe and load
your data. (See
my recent post on R
for an example of this.)

A Tabular Data Package consists of one or more CSV resources, each
containing a schema (indicating type, constraints, and other
metadata useful for validation and analysis) and, optionally, a
dialect (specifying characters for separating or quoting values).
See our
JSON Table Schema guide
and the CSVDDF specification
for more information. Given that a single Tabular Data Package can
consist of multiple tables, pandas integration means loading multiple
DataFrames—with appropriately set types, encodings, indexes and
dialects—at once. And once you have Tabular Data Packages in a
pandas DataFrame, you now get all the power provided by Pandas to
reshape, explore and visualise data as well as access to Pandas’
variety of export formats.

jsontableschema-pandas

The newly developed
Pandas plugin
allows users to generate and load Pandas DataFrames based on JSON
Table Schema descriptors. In order to use it, you first need to
install the datapackage and jsontableschema-pandas libraries.

pip install datapackage
pip install jsontableschema-pandas

You can load a Data Package into your environment by using the
datapackage.push_datapackage function. We pass a path to the
descriptor file (datapackage.json), and we are choosing pandas for
our backend:

Contributing

The Python library
jsontableschema-py
provides the core set of utilities for working with Tabular Data
Package tables, and it implements a plugin-based system for adding
different
storage
backends. In a
recent post,
I highlighted the first two of these storage integrations:
SQL and
BigQuery.
These libraries, and the Pandas library, were written as drivers
implementing the jsontableschema.storage.Storageinterface.
If you have another storage backend you’d like to use with Data
Packages in Python, consider writing a
plugin.

We’re also looking to support other integrations beyond Python. You
can find user stories we’re looking to support on the
User Stories section of
the Frictionless Data site. Do you have a library, tool, or platform
that you’d like to see support importing and exporting Data Packages?
Let us know by voting and commenting on what you’d like to see! If
you have any questions about how to contribute, jump into the
Frictionless Data chat or
post in the forum.