proof creates data processing pipelines by defining “analyses”, each of which is a stage in the process. These analyses naturally flow from one to another. For instance, the first analysis of a process might load a CSV of data. From there you might select a subset of the rows in the table, then group the results and finally take the median of each group. More complex pipelines might alse diverge at some point, having several analyses that use the same input, but each produce different outputs.

proof contains a single class, Analysis, which is used for creating processes like these. Each of the analyses is constructed by instantiating it with a function that does some work:

Analyses which depend on the result of this stage can then be created using the Analysis.then() method.

defselect_rows(data):# Select relevant rows from the tablepassdefcalculate_average(data):# The average of a value in the rows is takenpassdata_loaded.then(select_rows)data_loaded.then(calculate_average)

In the previous example, both select_rows and calculate_average depend on the result of load_data. If instead we wanted our average to be based on only the selected rows, we would instead do:

Each analysis function must accept a data argument, which is a dict of data to be persisted between analyses. Modifications made to data in the scope of one analysis will be propagated to all dependent analyses. For example, the three functions we saw before might be implemented like this:

importcsvdefload_data(data):# Load the datawithopen('example.csv')asf:reader=csv.DictReader(f,fieldnames=['name','salary'])reader.next()data['table']=list(reader)defselect_rows(data):# Select relevant rows from the tabledata['low_income']=filter(lambdar:int(r['salary'])<25000,data['table'])defcalculate_average(data):# The average of a value in the rows is takenmean=sum([int(r['salary'])forrindata['low_income']])/len(data['low_income'])print(mean)

We can see here how the data dictionary gets passed from function to function with it’s state intact. You can also modify values that already exist in data and those changes will be propagated forward.

Finally, we run the analysis, starting at the beginning, by calling Analysis.run():

When Analysis.run() is invoked, the analysis function runs, followed by each of dependent analyses created with Analysis.then(). These in turn invoke their own dependent analyses, allowing a hierarchy to be created. Within each of those functions you can do whatever you want–print to the console, import other dependencies, save to disk–proof doesn’t care how you analyze your data.

After each analysis the value of data is cached to disk along with a “fingerprint” describing the source code of the analysis function at the time it was invoked. If you run the same analysis twice without modifying the code, the cached version out of the data will be used for its dependents. This allows you to experiment with a dependent analysis without constantly recomputing the results of its parent. For example, if I rerun the previous script, I will see:

This indicates that the results of each analysis will be loaded from disk if they are needed. proof tries to be very smart about how much work it does. So, for instance, if you modify the middle analysis in this process, select_rows, only it and other analyses that depend on it will be rerun. Try modifying the threshold for low_income down to 20000 and rerun the script. You should see:

One very important caveat exists to this automated dependency resolution. The fingerprint which is generated for each analysis function is not recursive, which is to say, it does not include the source of any functions which are invoked by that function. If you modify the source of a function invoked by the analysis function, you will need to ensure that the analysis is manually refreshed by passing refresh=True to Analysis.run() or deleting the cache directory (.proof by default).

Sometimes there are analysis functions you always want to run, even if they are up to date. This is most commonly the case when you simply want to print your results. proof allows you to flag a that an analysis function should always run using the never_cache() decorator. Let’s modify our previous example to move the print statement into a separate analysis:

defcalculate_average(data):# The average of a value in the rows is takendata['mean']=sum([int(r['salary'])forrindata['low_income']])/len(data['low_income'])@proof.never_cachedefprint_results(data):print(data['mean'])data_loaded=proof.Analysis(load_data)rows_selected=data_loaded.then(select_rows)average_calculated=rows_selected.then(calculate_average)average_calculated.then(print_results)data_loaded.run()

An Analysis is a function whose source code fingerprint and output can be
serialized to disk. When it is invoked again, if it’s code has not changed
the serialized output will be used rather than executing the code again.

Implements a callback-like API so that Analyses can depend on one another.
If a parent analysis changes then it and all it’s children will be
refreshed.

Parameters:

func – A callable that implements the analysis. Must accept a data
argument that is the state inherited from its ancestors analysis.

Execute this analysis and its descendents. There are four possible
execution scenarios:

This analysis has never been run. Run it and cache the results.

This analysis is the child of a parent analysis which was run, so it
must be run because its inputs may have changed. Cache the result.

This analysis has been run, its parents were loaded from cache and
its fingerprints match. Load the cached result.

This analysis has been run and its parents were loaded from cache,
but its fingerprints do not match. Run it and cache updated results.

On each run this analysis will clear any unused cache files from the
cache directory. If you have multiple analyses running in the same
location, specify separate cache directories for them using the
cache_dir argument to the the Analysis constructor.

Parameters:

refresh – Flag indicating if this analysis must refresh because
one of its ancestors did.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the “Software”), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.