Friday, May 2, 2014

loading MATLAB mat-file into Pandas

Pandas is a great tool for analyzing large data sets, especially time-series data. It quickly and easily imports most basic data files: Excel, comma-separated values, etc., but not MATLAB mat-files. However, SciPy does import MATLAB mat-files, so combining packages gets the job done.

Here's an example of a mat-file that has a single variable, called measuredData, that contains a MATLAB structure with a timeStamps field and several time series data fields, voltage, current and temperature and some other fields that are irrelevant. There is also a field called numIntervals that contains the number of intervals in the time series data sets. The struct itself has only one element.

import numpy as np
from scipy.io import loadmat # this is the SciPy module that loads mat-files
import matplotlib.pyplot as plt
from datetime import datetime, date, time
import pandas as pd
mat = loadmat('measured_data.mat') # load mat-file
mdata = mat['measuredData'] # variable in mat file
mdtype = mdata.dtype # dtypes of structures are "unsized objects"
# * SciPy reads in structures as structured NumPy arrays of dtype object
# * The size of the array is the size of the structure array, not the number
# elements in any particular field. The shape defaults to 2-dimensional.
# * For convenience make a dictionary of the data using the names from dtypes
# * Since the structure has only one element, but is 2-D, index it at [0, 0]
ndata = {n: mdata[n][0, 0] for n in mdtype.names}
# Reconstruct the columns of the data table from just the time series
# Use the number of intervals to test if a field is a column or metadata
columns = [n for n, v in ndata.iteritems() if v.size == ndata['numIntervals']]
# now make a data frame, setting the time stamps as the index
df = pd.DataFrame(np.concatenate([ndata[c] for c in columns], axis=1),
index=[datetime(*ts) for ts in ndata['timestamps']],
columns=columns)