Training

Data Management with pandas (Python)

Instructor

Chang Chung worked as a Statistical Programmer and Data Archivist at the Office of Population Research at Princeton University. He earned his Ph.D. in Sociology from the University of South Carolina and M.S.E. in Systems Engineering from University of Pennsylvania.

pandas is an open-source Python package that enables users to handle table-like (""relational"") and key-value paired (""labeled"") data, large and small, easily, intuitively, and quickly. Designed for practical, real world data handling and analysis in Python, pandas is considered one of the new killer apps for the Big Data era Python language, and one of the six packages of the SciPy core stack, which itself is rapidly gaining popularity among scientific communities. Specific data management problems/topics that we will discuss include: handling missing data; fast insertion and deletion; (automatic) data aligning; group by (like SQL) or split-apply-combine (like plyr); efficient slicing, indexing, and subsetting larger data based on hierarchical labels; using intuitive merge and join operations on multiple datasets; and utilizing robust and extensive I/O tools that interact well with many data formats, including CSV, Excel, SQL databases, HDF5, JSON, and even STATA.

Audience

Attendees are assumed to be proficient in Python language enough so that they feel comfortable writing and running 5 to 10 lines of Python code in an IPython Notebook environment. No previous pandas experience is required. Knowledge of Numpy ndarray object is recommended but a pre-workshop handout will be provided for those who are not familiar with the basics of NumPy.

Format

Presentation and hands-on exercises are combined around data wrangling a couple of interesting and sizable datasets in order to provide an enhanced understanding of the data management problems and tools for dealing with them. Attendees will have an opportunity to work with two powerful yet intuitive-to-use objects that pandas provides: Series (1-dimensional) and DataFrame (2-dimensional).