2) The user wants the data in Excel, and, further, wants to play with it (pivot, etc.) there. The problem is that the grouping column labels are not in every record, only in the one row that begins the list of records for that group (sanitized screenshot below):

But I don't WANT to copy and paste all those groupings for 30,000 records :*-(

I had this assignment recently from a remote request. It took about four rounds of an e-mail exchange to figure out that it really wasn't a data problem, but a formatting one that needed solving.

It is possible to do the whole thing in Python. I did the Excel part by hand in order to get a handle on the data:

1) In Excel, delete the extra rows on top of the report leaving just the headers and the data.

2) In Excel, select everything on the data page, format the cells correctly by unselecting the Merge Cells and Wraparound options.

3) In Excel, at this point you should be able to see if there are extra empty columns as space fillers; delete them. Save the worksheet as a csv file.

At this point you've got your data in csv format - you can open it in Excel and go to work.

There may be a free or COTS (commercial off the shelf) utility that does all this somewhere in the Microsoft "ecosystem" (I think that's their fancy enviro-friendly word for vendor-user community) but I don't know of one.

Wednesday, February 15, 2017

1) recreate an equivalent calculation in a different format for a different purpose and check the results against the original calculation.

2) shepherd a calculation process from one vendor system through a transition to another (an upgrade, for example) by hacking a set of provisional scripts together.

3) implement a bunch of linear regressions in calculations. If I recall correctly, there has been a linear regression functionality in Excel for ages (since the early 90's?); it is the tried and (maybe) true tool of data fitters/forcers everywhere. Conceivably you could accurately, if not precisely, model just about any curve with enough linear segments. Mercifully, the ones I show below have only two segments per data set.

This problem embodies all three bullets above. I've sanitized the code which makes it a little ridiculous, but no less voluminous (sorry).

Here's what we have in the vendor's system - it is Python (2.7) code, but it's run inside special a la carte purchased software that my department doesn't have. Also, it's full of a bunch of constants that I'm not really comfortable recognizing or maintaining:

My code is less multiple function based and more a single function with a bunch of lookup dictionaries rolled into one big dictionary. I'm not arguing my approach is necessarily better. For instance, I implemented my x variable ranges with lower bounds based on the precision of my data. This isn't very portable.

The need to lock down my results to keep them in line with the original led me to use of the assert statement and the writing of a little walk of my dictionary against my function and the vendor's. This way, when I get a new "vendor function" (actually a snippet of code for a particular location or area) I can paste it into this crude ersatz test suite and see what needs changing.

I caught a few missed decimal places, typos, transposed digits, and plain old omissions in my code using this approach. It is possible I've gone overboard with constants. I don't care. I have to read them and the only way I can keep them straight is by lining up the decimal places and locking them down as named constants (programmatically they are variables, but I'm not changing them).

"But why don't they and why don't you use scientific notation?"

As we used to say in the Navy years ago, "There is the right way, there is the wrong way, and the Navy way." Guess which one the vendor uses? Onward.