SqlPlotTools - Gnuplot and Pgfplots from SQL Statements

Summary

SqlPlotTools is a tool to process data series from algorithm experiments using SQL statements and embed the results in gnuplot datafiles or pgfplots LaTeX files. Using SQL to generate plots can be see as cracking a nut with a sledgehammer, but it really works well in practice.

The data series can be read by parsing simple text files for RESULT lines, which are automatically imported into SQL tables. A RESULT line contains one row of data in form of key=value pairs.

To generate plots, one then writes (arbitrarily complex) SQL statements embedded in Gnuplot or LaTeX files as comments. The embedded SQL is then executed by SqlPlotTools against the imported data, and the results are formatted as Gnuplot or Pgfplots pictures. The toolset can also create LaTeX tabular's in a similar way.

There are many advantages of this approach:

No need for SQL in the experimental code, it just needs to write RESULT text lines. All extra debug output is ignored during import, thus one can save the complete output of an experimental program run for future reference.

The plotted values are embedded in the LaTeX or Gnuplot code, thus after processing by SqlPlotTools the full data is available and can be made public.

Furthermore, the SQL statement completely specifies how the results are generated from the input. Thus even when complex expressions are needed to generate the plots, they are still available inside the LaTeX file as a comment and can be verified by other authors.

The declarative power of SQL is available for generating plots. One can use subqueries, JOINs between datasets and more to generate data series.

One streamlined process to generate plots from experiments. Thus real-time plotting is possible!

The only slight disadvantage is that the SQL statement processing requires an SQL database. The current SqlPlotTools version supports SQLite3, PostgreSQL, and MySQL on Linux. I recommend to start with SQLite3, because it is embedded in the SqlPlotTools binaries, and no extra database server is needed. For more complex applications, I recommend Postgresql, since it is the most flexible and advanced database.

Compilation

SqlPlotTools is written in C++ and requires the Boost.Regex and SQLite3 development libraries. Furthermore, it can optionally be compiled with PostgreSQL and MySQL support, to function as a client for these database server. Install these libraries using your distribution's package manager.

Tutorial

The SqlPlotTools package contains a very simple C++ example experiment in examples/sorting-speed, which measures the speed of sorting integer items using std::sort, std::stable_sort and STL's heap sort. The snippets in the following tutorial are largely taken from this example.

Creating RESULT data lines

When performing experiments, usually many data points are generated, and each data point has many parameters. It is often unclear how the final plots are going to be created from the parameter spaces. Likewise, experimental programs are usually already very complex, and adding SQL libraries or similar is out of the question.

With SqlPlotTools one only has to interleave the debug output of the program with RESULT lines, which look like this:

No extra libraries are needed, and all other lines outputted by the program will be ignored during the data import. The data import is very fast, thus even large sets of results can be processed conveniently.

Note: If you need to have spaces in text fields in a RESULT, then you must use tabs as key=value delimiters. If a RESULT line contains any TAB character, then the line is split by tabs instead of spaces. Quoted values are currently not supported.

This will import RESULT rows from the included stats.txt file into the table ex1 in an SQLite3 database called test.db. The types of the different columns are automatically detected during import, thus there is no need to specify a CREATE TABLE directive. Without the -D sqlite:test.db, the table would be created in a temporary in-memory database, and thus discarded after the program ends. However, since we saved the database, we can manually select from the data. The imported table looks as follows:

Generating GnuPlots using SQL Statements

Automatically importing data is already pretty neat, but now we will generate a Gnuplot from the data.

Regard the file examples/sorting-speed/speed.plot. This file contains boilerplate Gnuplot code, except for two comments. These two comments are directives, which are processed by SqlPlotTools (other comments are ignored). Plot directives have to be in CAPITALS!

The first comment is equivalent to calling the program with import-data:

# IMPORT-DATA stats stats.txt

This line imports stats.txt into the table stats, except that the data imported into a temporary table! To make tables permanent add -P as a "command line" parameter (IMPORT-DATA -P stats stats.txt).

The second comment is a multi-line SqlPlotTools directive, due to the double # at the beginning of each line:

This is already a pretty complex SQL statement, and the MULTIPLOT directive specifies that a plot with multiple lines will be generated from it. To simplify writing the SQL statement, all occurrences of "MULTIPLOT" is replaced by the content in the parenthesis, in the example with "algo". We first focus on the result of the SQL statement:

The SQL statement already suggests how the data rows are transformed by SqlPlotTools into plot lines. The parenthesised argument "algo" (in general "col1,col2,col3") is used to group multiple rows into a plot line. The plot line is automatically labelled using the values of the group columns.

To generate the plot data from the stats and update the Gnuplot file, simply run sqlplot-tools in the examples/sorting-speed directory, followed by gnuplot:

sqlplot-tools speed.plot
gnuplot speed.plot

The sqlplot-tools call will parse speed.plot for SQL directives, execute them, and modified the plot file. The lines after these directives are replaced with the corresponding results, and in the case of Gnuplot, and additional speed-data.txt file is generated, which contains the actual data points of the plot. Since the current tarball already contains speed-data.txt and speed.pdf, we suggest deleting these two files and recreating them with the command above.

Generating LaTeX Pgfplots and Tabulars using SQL Statements

Generating Gnuplots is fast, however, for publications in LaTeX the Pgfplots package renders plots much nicer using TikZ, all natively in LaTeX. The main problem of Pgfplots is to get the data into the plot. And this what SqlPlotTools was originally intended to do: to generate high-quality Pgfplots directly from data.

While Pgfplots does have facilities to read datafiles like Gnuplot, the more general workflow is to embedd the data points directly in LaTeX. And this is what SqlPlotTools does: it executes SQL statements against the datasets and replaces the lines after the directive with corresponding Pgfplots lines. While replacing the lines, it tries hard to just replace the coordinates clause, and leaving additional formatting in tact.

In the example paper.tex you find the following pgfplot, in which the \addplot lines were generated by SqlPlotTools:

Note that the MULTIPLOT directive is identical to the one in the Gnuplot example, except for being written as LaTeX comments. This example creates a high-quality plot, within LaTeX, using matching fonts and sizes. For more information on how to format the plot, see the Pgfplots manual.

Additionally to generating plots, SqlPlotTools can also generate tabular data. The paper.tex contains a reasonably complex example of such tabular. In general, SqlPlotTools will just output the result of an SQL query as a row/column tabular, just like the query is defined. Column headers are ignored, and you must provide the tabular itself including column numbers and formatting.

The included example is already pretty complex: it first selects all the sizes tested by the experiment, and then selects the median of the test result for the three algorithms using a subquery. Formatting the results of TABULAR can be done in two ways: one can either use the SQL database facilities (e.g. by defining arbitrary formatting functions using SQL procedures), or the experimental REFORMAT() subclause of TABULAR. As REFORMAT is not finished, please looks into the reformat.cpp source file for available directives.

Another additional feature of SqlPlotTools in LaTeX is to generate plain text result tables. The main application of this is to calculate summary values, which are then contained inside the text. For example: "the total runtime of all experiments in this paper is 12345 seconds". To actually calculate the "12345", we again use an SQL statement (probably involving a SUM()). SqlPlotTools allows one to use a TEXTTABLE command, which outputs a formatted SQL result similar to what an SQL command line tool would output. This text table is embedded inside the LaTeX file, and usually wrapped inside a comment area.