Sunday, July 10, 2016

I have some mining drill hole data that I need to merge into an old vendor FORTRAN input format. Basically I do a series of SQL pulls from the drillhole database to csv files, then merge the data. My methodology has been a bit brute force in matching the separate parts of the drill hole data (lists, opening and closing of files to find matching holes, etc.). My thought was that I could do this more elegantly and efficiently by iterating through the files with generators.

The ability of generators to communicate with each other via the send() method intrigued me. I had always been a bit shy about using this language feature. My csv problem gave me a justification for checking it out.

The reference I used was Dr. Dave Beazley's 2009 Pycon Tutorial. He does a nice job of explaining things as well as dispatching good advice. (I disobeyed the good advice in the interest of shoehorning coroutines into my solution; I'll cover this below.) Beazley defines a coroutine in the sense of generators and the "yield" keyword as generators where "yield" is used more generally. That is the context I'm using the word "coroutine" in this post.

Given my problem of a one (drill hole start survey) to many (drill hole interval values) relationship, I attempted a very simple (perhaps oversimplified) toy program demo of what I wanted to do with real data:

def coroutinex(subgenerator): """ Generator function that consumes a key value sent from a higher level generator. This generator yields two tuples of the form (<boolean>, data). The boolean value indicates whether the key matches the data.

Back to Dr. Beazley's advice - he doesn't recommend this - even though "yield" is the keyword, it means two different things in two different contexts. Do not mix generator and coroutine functionality. I'm going ahead in this post and doing it anyway. I don't have an excuse. It does remind me of some old Bob Dylan lyrics:

Now the rainman gave me two curesThen he said, "Jump right in"The one was Texas medicineThe other was just railroad ginAn' like a fool I mixed themAn' it strangled up my mind

It's OK, Bob, some of us just need to learn things the hard way.

Onward.

A brief diversion on drill holes - the data for a small scale (about 2,000 feet or less) geotechnical or gelogic drill hole come back in three parts:1) collar - where the hole starts in space (coordinates).

2) surveys - where the hole ends up going in space relative to the collar (drill pipe has proven to be amazingly flexible when passing through rock).

3) assays - usually the hole is sampled along intervals and chemically or physically analyzed. The assay intervals may or may not coincide with survey intervals.

Clear as (drilling) mud? Great - back to Python.

The problem:

Three tabular csv dumps from SQL - a collar file, a survey file, and an assay file. Each has a unique key in the first column that matches across files (the drill hole key). On the SQL side I have ensured that there are no orphan key rows in any of the three files and that all three are sorted on the key.

I present the sanitized ouput here first - it will give some context to the domain specific parts of the code:

formats is a list of two-lists of namedtuple attributes and numeric string formats to be applied to each attribute's value. """ return [formatassay(record.__getattribute__(pairx[0]), pairx[1]) for pairx in formats]

The bad news: this was more difficult with a real world dataset than I anticipated. Beazley's admonition was an apt one.

The good news: it does perform better than my previous brute force implementations. From the standpoint of iterating through datasets and not wasting resources (even with the polling or interrupting or whatever facilitates the generator communication closer to the metal), this is a better implementation. Also, I learned a bit more about the "yield" keyword.