The problem

Engineers often need to work with data sets that have been published in text or source code from different projects. Often times, these data sets are formatted or embedded in such a way that manual intervention is required to extract them. This slows down the process of reusing the data in a different program or within a spreadsheets.

As an exemple, consider the following constant table (as a list of individual) constants, from a C program (from uClibc source code here):

Admitedly, this pipeline is not very easy to understand or maintain. Furthermore, the text is extracted, but it would be a much more difficult task to use it immediately (and automatically) for some calculation, as the text would need to be parsed a second time by a custom program.

In some cases, the data is deep within a text file and the data is in a format that would make it difficult to create the requisite pipeline to allow the extraction of the desired information. It is the case for the following (simple) example:

In this case, suppose we want a list of the twenties, one per line, recast in hexadecimal representation. Extraction would be complicated because of the following reasons:

There is an arbitrary number of items on a line instead of a single item

There are several arrays with the same format in the set

We want an output representation that requires arithmetic computation on the text extracted before it can be obtained (goodbye sed !)

We want an output on each line

The array size (“twenties[12]”) is of the same format as the data to extract

Of course, this problem could be solved with awk, but at this point, I think a different approach can be taken.

The solution: the extract_lines Python module

extract_lines is a universal text extraction tool that can be used within Python scripts.

Prerequisites

To use the extract_lines module, basic knowledge of Python is required. Additionally, knowledge of regular expressions (regexps) is essential for anyone wishing to do advanced text processing. Here are some ressources to get you started.

It returns a list of lines where substitution has been done according to the following parameters:

lines: a list of lines (strings) upon which to operate

startPat: regexp pattern to match before starting the extraction and substitution phase.

endPat: regexp pattern to match before ending the extraction and substitution phase.

extractPat: regexp pattern to match for substitution.

subPat: substitution string to replace every match of extractPat. This parameter can also be a string-returning function which will then be called for every substitution with the re.Match object of the extractPat match (see examples below).

removeStart: if True (the default), the initial match from startPat is removed from the lines before starting substitution. In most cases, this should remain True.

Even though the function returns a list of lines, some easy tricks can be used to extract data in arbitrarily complex data structures. This will be shown in the examples below. The number of lines need not be more than one, either.

How it works

The extract_lines module uses a simple state machine, driven by regular expressions. The algorithm is the following:

FOR EACH line IN lines:

IN CASE OF “wait for start match” state (initial state):

Try to match startPat against the line;

IF startPat is matched THEN

IF removeStart is True THEN

Eliminate start match from line;

END IF

Switch to “extract” state;

ELSE

remain in “wait for start match” state

END IF

IN CASE OF “extract” state:

Substitute subPat for every occurence of extractPat in the current line;

Save the current line to the output list;

Try to match endPat against the line;

IF endPat is matched THEN

Stop iterating through lines;

ELSE

Remain in “extract” state;

END IF

END: Return the output list of matched lines with substitutions done

An interesting aspect of the state machine is that the substitution can happen several times during a single line. If subPat is a function, then it will be called for every match. The algorithm is thus line-based to weed-out undesired data, but is pattern-based for extraction.

The caveat of this algorithm is that patterns cannot span more than a single line. This makes it difficult to extract data that spans several lines, such as data from within long expressions in source code. For most applications, this algorithm is sufficient. For languages such as C, where each statement is separated by a semi-colon (“;”), this can be resolved by a preprocessing step of splitting on semicolons and removing newlines.

Examples

Example 1: Basic usage

The following code uses the extract_lines function to realize the same function as the pipeline shown earlier. The full example source code is in file ex_coeffs.py in the archive.

The complete source of this example is in file ex_twenties1.py in the archive.

Here, we are using an anonymous function (Python lambda function) on line 6 to convert the first matching group to an integer and append it to a list. Notice we did not use the return value of extract_lines. In fact, if we look at it, we will notice that all the numbers have been deleted (replaced with nothing) since the lambda function does not return anything !

We used the substitution function’s side-effects to extract data in a new structure.

Instead of using a lambda function, we could have used the reference to a full function. There still needs to be a lambda function as an adapter in order to pass other local parameters along with the match object. Otherwise, the list receiving the data would need to be in global scope. The following snippet shows this case (see file ex_twenties2.py in archive for full example source):

Example 3: Extracting binary vectors from a VHDL testbench of a FIR filter

This example illustrates how the extract_lines module can be used for a typical automated data extraction task.

In this case, we want to extract the coefficients for an FIR digital filter from VHDL source code. The VHDL code was written by an automated tool (HDL Coder in Filter Design and Analysis tool of MATLAB). The source file in this example contains 16 constants for coefficients within a source of about 250 lines. The desired data is encoded as follows:

We want to extract the values of coeff1 through coeff16, which are stored as signed decimal numbers of arbitrary size. When writing the data output, we want to store them as two’s complement binary coefficients with the proper width, as specified in the to_signed() conversions.

function to_std_logic(), lines 4-20: takes an integer and a bit width and outputs a binary representation of it.

function subst_coeff(), lines 22-28: used by extract_lines() at line 42. Takes a match containing 2 groups and returns the string from a call to to_std_logic() with the groups extracted.

function extract_coeffs(), lines 30-54: reads the lines from the filter VHDL source code, extracts the coefficient data and stores the result in a binary table file.

Notice how straightforward the task becomes when the extract_lines module is used.

Conclusion

The extract_lines module can be used to automate a significant number of different text extraction and substitution cases. While it is not a universal solution, it is highly applicable to engineering data in tabular or machine-generated human-readable formats. The simplicity of the solution and availability of the source code allow you to modify it to suit more advanced needs.

Downloads

The extract_lines module is Public Domain. You do what you want with it. This code is offered without any warranty of fitness for any purpose.