Wickham, a major contributor to the open source R package for statistical analysis, was addressing the needs of statisticians in his paper. The paper gives examples in R, and my former IBM colleague, Jean-François Puget, has written a blog post demonstrating how to create tidy data in Python, which is my preferred programming environment along with the pandas library. When solving mathematical optimization problems, it is useful to work with data in a tidy form. However, the definition given by Wickham is not well suited for optimization data, so I have taken the liberty of creating my own definition of Tidy Data for Optimization. These are contrasted in the table below, where the key differences in wording are highlighted.

Wickham Definition for Statistics

Lustig Definition for Optimization

"Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with records, parameters and types. In tidy data:

1. Each variable forms a column.

1. Each parameter forms a column.

2. Each observation forms a row.

2. Each record forms a row.

3. Each type of observational unitforms a table."

3. Each type ofdata unit sharing an indexforms a table.

4. A MultiIndex of each record implicitlydefines multiple sets.

An interesting aspect of optimization that is different from the statistical world is the need for tables that have indexes. In the world of mathematical programming languages such as AMPL, OPL, GAMS, AIMMS and MPL, the user of those languages is responsible for defining the indexes as sets, and then defining different data structures indexed by those sets. When I receive data from clients, the data is most often output from a database, including its index, so that the index and the resulting data are defined in the data. If the data is in a spreadsheet, the indices are often row names and column names, which may need to be reshaped into a tidy form for use in an optimization model. Working with data in a tidy form when developing an optimization model is useful, because it allows the model to naturally scale as the dimension of different index sets grow.

One of my main issues with the mathematical programming languages is that they require the developer to declare the schema of data, and then use different methods to import that data into the modeling language environment. I've always been frustrated by this, as I'd receive data in CSV files, spreadsheets and databases, each implicitly defining the data schema within the data itself. I would spend lots of effort recoding the data schema that was given in the source data files into modeling language declarations. Now I use python and pandas, which eliminates the need to define a data schema in my modeling environment. pandas makes it natural to start with source data files and let the data definitions define the schema of the data. I will discuss this in future blog entries, where I will show the advantages of using pandas to read in data from various data sources, transform that data in preparation for an optimization model, and to manage the entities of an optimization model.

Here are two examples that illustrate tidy data for optimization. The first is an excerpt from an AMPL model provided with the AMPL distribution. One file, steelT.mod, has the declarations of the data model:

This data is not in a tidy form. The tidy form of the same data could come from a CSV file, a database or a spreadsheet, and would be formatted as follows:

market

revenue

PROD

T

bands

1

6000

25

coils

1

4000

30

bands

2

6000

26

coils

2

2500

35

bands

3

4000

27

coils

3

3500

37

bands

4

6500

27

coils

4

4200

39

Here, the first 2 columns (named "PROD" and "T") define the index for each row. In this particular case, because the index is defined using two values, I borrow the term MultiIndex from pandas to define a multi-dimensional index. The two parameters, "market" and "revenue", define each column. Finally, the index is implicitly defining 3 sets:

1.

The unique indices of the table:

{ ('bands', 1),('coils', 1),('bands', 2),('coils', 2),

('bands', 3),('coils', 3),('bands', 4),('coils', 4) }

2.

The set "PROD":{'bands', 'coils'}

3.

The set "T": {1, 2, 3, 4}

Each of those sets is useful when creating decision variables for an optimization model.

Our second example comes the IBM CPLEX Optimization Studio distribution, using OPL. Part of the model file knapsack.mod reads:

In this example, the data for the matrix Use is sparse, as most of the values are 0. A challenge with this representation is that as the number of resources and items grows, the size of data in this representation grows much faster than the number of nonzeros in the Use matrix. The tidy representation takes care of that. Here is an excerpt from that representation:

Resources

Items

Use

R1

I1

19

I10

1

I11

1

I12

1

I2

1

I3

10

I4

1

I5

1

I6

14

I7

152

...

...

...

R6

I3

40

I4

70

I5

4

I6

63

I9

60

R7

I10

660

I12

9

I2

32

I6

5

I8

3

Here, the records are grouped by the names of the resources, and only the nonzero items are specified.As a preview of some future blog entries, let's assume the data for the AMPL model was given in a tidy form in a CSV file named 01.tidysteel.csv.