Friday, December 11, 2015

Last post I brute forced the storage of binary image (jpeg) data as text in a Postgresql database, and accordingly brute forced the data's display in the Unix image viewer feh from output from a psql query. It was hackish and I received some negative, but good constructive criticism on how to improve it:

1) use Python's base64 module instead of the binascii one.

2) use bytea as a storage type in Postgresql instead of text.

Marius Gedminus made the base64.b64encode suggestion for text. It does make for a little less storage space. Ultimately we won't go with this solution because we want to go with bytea, the Postgresql data type intended for this type of data. But for completeness, here is what a base64.b64encode text solution would look like:

>>> # Works in both cases.>>> os.system('feh --geometry 400x300+200+200 test.jpg')

0>>> os.system('feh --geometry 400x300+200+200 test2.jpg')

0>>>

The results for both commands in the last lines (show picture with feh) look the same:

Storing the BASE 64 string in a Postgresql text column is the same as storing the hex one like I did in the last post. The main thing to look out for is the proper stripping of the Python generated string for extra characters - single quotes are OK as long as they are matched on either end of the string. As I mentioned in the code comments above, knowing what I know now, I would strip them out too even prior to storing the string in a database.

On to the Postgresql bytea storage part of the post. Someone I respect asked me on Facebook, "Why didn't you just use bytea (for storage)?" I had to sheepishly own up to just not being used to working with binary data (as opposed to strings) so I went with what I knew. Shame drove me to at least attempt to do things the right way - binary storage for binary data, in this case a jpeg image.

Postgresql 9.4 uses a hex based representation (hex format) for the bytea data type by default. It is possible to mess this up - it is covered in the doc but I didn't read it carefully enough:

If you preface your hexadecimal string with \x (single backslash) you will end up with an octal representation of your binary data (digits 0 through 7). \\x prior to the hexadecimal string will give you what you, or at least I want, hexidecimal representation of your binary data on output. The SQL string I used for processing my string data (already in the database from my work on the last blog post):

/* Postgresql SQL code */CAST('\\x' || <hexadecimal string> AS bytea)

The || operator is for concatenation of strings (this is probably obvious to Postgresql and other database distro users but MSSQL uses a + symbol so it was a little new to me).

To deal with transitioning all my text picture columns to bytea I did the following:

1) create a new set of identical tables to the ones I had in the same database with new relations identical to the old ones but with the new set of tables.

2) fill the new tables in with the new data that has all the former text columns for binary as bytea.

3) delete the old tables once the new ones are filled in.

4) rename the new tables to match the names of the old ones (how I wanted the database schema to look in the first place).

Postgresql is different than MSSQL in that the database is more its own autonomous entity that needs to be connected to other databases by some introduced mechanism. In MSSQL, databases on the same server can reference each other in queries by default. I started looking into the Postgresql fdw (foreign data wrapper) plugin, then realized I could do this more easily with the path I took above.

It's not necessary to post all the SQL code. I used a psql variable in my SQL for the hexadecimal data predicate to make sure I got it right each time. From inside psql I executed the SQL files with the \i metacommand. Here is a snippet with the variable./* Postgresql SQL code to be used with the Postgresql psql interpreter */

The variable thing in psql takes a little getting used to but the Postgresql documentation is good about explaining when and how to use the single quote marks and where to put them. It worked out.

The most important part: getting the picture to show up from a psql metacommand through the use of a python script. Here is my modified script similar to the one in my last post:#!/usr/local/bin/python3.5

An important change I made from last time is fixing the call to the image viewer feh to eliminate all that hacky intermediate writing of a jpeg file that took forever (in computer time). It turns out feh accepts binary input from a pipe or stdin just fine - I just needed to read the man page more thoroughly.

Saturday, October 17, 2015

Last post I set up a toy (literally) Postgresql database for my model train car collection. A big part of the utility of the database is its ability to store images (pictures or photos) of the real life prototype and model train cars. Postgresql (based on my google research) offers a couple methods of doing that. I'll present how I accomplished this here. The method I chose suited my home needs. For a commercial or large scale project, something more efficient in the way of storage and speed of retrieval may be better. Anyway, here goes.

I chose to store my photos as text representations of binary data in Postgresql database table columns with the text data type. This decision was mainly based on my level of expertise and the fact that I am doing this for home use as part of a learning experience. Storing the binary data as text inflates their size by a factor of two - very inefficient for storage. For home use in a small database like mine, storage is hardly an issue. At work I transfer a lot of binary data (3 dimensional mesh mined solids) to a distant server in text format using MSSQL's bcp. Postgresql is a little different, but I am familiar with the general idea of stuffing a lot of text in a database column.

In order to get the data into comma delimited rows without dealing with a long, unwieldy string of text from the photos, I wrote a Python script to do it:

#!python3.4

"""Prepare multiple rows of datathat includes a hexlify'dpicture for a column ina table in the model traindatabase."""

The basic gist of the script is to get each photo name provided into a file that can be later imported into a table in Postgresql. The paths in the capitalized "constants" would have to be adjusted for your situation (I tend to go overboard on capitalized constants because I'm a lousy typist and want to avoid screwing up and then having to debug my typos). The INFOFILE referred to in the script has roughly the following format:

<column1data>, <column2data>, . . . , <photofilename>

So the idea is to take a comma delimited file, encode it in UTF-8, and stuff the binary data from the (correct) photo at the end as text. I designed my database tables with photos (I use the column name "picture") with the text data column as the last - this is kind of a hack, but it made scripting this easier.

An example of importing one of these "row" files into the database table from within psql:

My Python script for actually displaying a photo or image is a little hacky in that in requires checks for the size of the output versus the size of the information pulled from the Postgresql database table. My original script would show the picture piped to the lightweight UNIX image viewer feh as partially complete. In order to get around this I put a timed loop in the script to check that the image data were about half of the size of the text data pulled. It works well enough, if slowly at times:

#!/usr/local/bin/python3.4

"""Try to mimic processing of imagecoming out of postgresql queryas a stream."""

Depending on which directory I've logged into psql under, I may have to type the full paths of the output and Python file.

There is more I could do with this, but for now I'm OK with it. Writing to a file and then checking on its size is slow. There is probably a way to write to memory and check what's there, but I got stuck on that and decided to go with the less efficient solution.

That ftp3.usa.openbsd.org server is the one in Boulder, Colorado - that's the one I usually use. I'm in Tucson, Arizona in the Mountain timezone, so it kind of makes sense to use that one. My understanding is that, in general, you want to use a mirror away from the main one to spread out the bandwidth and server use for the OpenBSD (or any other open source) project.

Now to install the package - this has to be done as root. I use sudo for this (sudo's replacement, as I understand it, in OpenBSD 5.8 will be doas(1) although you'll still be able to get sudo(1) as a package).

$ sudo pkg_add postgresql-serverquirks-2.54 signed on 2015-03-09T11:04:08ZNo change in quirks-2.54postgresql-server-9.4.1p1 (extracting)1%1%2%3% ********<etc.>100%postgresql-server-9.4.1p1 (installing)0% useradd: Warning: home directory `/var/postgresql' doesn't exist, and -m was not specifiedpostgresql-server-9.4.1p1 (installing)|1%1%2%3% ********<etc.>100%postgresql-server-9.4.1p1: okThe following new rcscripts were installed: /etc/rc.d/postgresqlSee rcctl(8) for details.Look in /usr/local/share/doc/pkg-readmes for extra documentation.$

Given an internet connection with decent speed, this all goes pretty quickly. The first set of per cent numbers are the download of the gzippped tar package binary, the second are the unzipping and install of the Postgresql binaries in the proper location in the operating system file hierarchy.

For years I had some trouble getting my head around setting up users for Postgresql and running the daemon. Much of my database experience is as an application user at work using Microsoft SQL Server. We use Windows Authentication there primarily. Working on my own UNIX-based (OpenBSD) home system is a little different.

Most of the problems I've had overcoming this user/security hump related to my lack of a good strong grasp of UNIX users and permissions (like I could do it in my sleep strong grasp). OpenBSD is a bit unique in that it has a special name for the postgresql unprivileged user: _postgresql. The underscore is a convention in OpenBSD for this general class of user, usually associated with a daemon that runs on startup or gets started by root, doesn't have a login (nor a password). Michael Lucas spends several pages with a good summary of the rational behind this, the history and its conventions in his authoritative OpenBSD book.

There is no data directory there (just . and .. in the /var/postgresql directory - the 2 in the ls output). This is typically where I would get stuck in the past. I ended up doing it manually . . . and wrong, or at least in a way that was more difficult than necessary. Anyway, I recorded it that way, so I'll blog it as executed.

What I had difficulty understanding before was the whole unprivileged user concept. Basically you need to use su to log on as root, then further su to log on as _postgresql:

1) There's a line in the output about fixing permissions on the existing data directory (this will show up as highlighted on the blog, possibly not in the planetpython blog feed) - had I done this correctly (just let initdb make the directory itself), that line would look something like this (I created another cluster while writing the blog just so I would understand how to do it right):creating directory /var/postgresql/data4 ... ok

Right there in the initdb(1) man page: "Creating a database cluster consists of creating the directories in which the database date will live . . ." The man page goes on to explain how to get around permission problems, etc. in this process. Note to self: read the man page . . . carefully.

2) What I also learned is that you can make as many database clusters as you want, all with different data directories. postgres is the superuser name you see in the documentation and /var/postgresql/data is the directory, but, as demonstrated above in the output, you could put your data in a folder called data4. If you gave a different name at the -U switch in the initdb command, the superuser name would be different too. Or you could have more than one cluster with postgres named superusers but with different passwords.

All that said, one cluster per physical box and the conventional names are plenty for me - I'm just trying to get used to the Postgresql environment and get started.

At this point I need to start up the Postgresql daemon. In the package install above, the output mentions an rc script /etc/rc.d/postgresql. This is run by root - below is a demo of using it manually with su (instead of using it as part of an rc startup sequence at boot):

All I did there was start the Postgresql daemon with the installed rc script, check to see that it's associated processes are running, then stop the daemon with the same script.

Me being me, I can't leave good enough alone. I want the control of starting and stopping the daemon when I decide to (I am running this on a laptop). As I understand it, pg_ctl is a wrapper program provided with the Postgresql install for even more low level commands and functionality. I use pg_ctl to run the daemon and start it with the _postgresql user account:

I asked pg_ctl to make a specific log file for this session (firstlog - this will go in directory /var/postgresql/). The logs are human readable and I wanted to study them later to see what's going on (there's all kinds of stuff in there about autovacuum and what not - sorry, we're not covering that in this blog post - but I'll have it available later).

Shutting down (stopping) the daemon is pretty simple with pg_ctl - just a few more keystrokes than if I had done it from root with the rc script:

I created a user/role carl with SUPERUSER capabilities within this instance of Postgresql. It's a bit ugly and I'm not sure I've done this correctly or the easiest way. Also, and of importance, I have given Postgresql user carl (not OpenBSD user carl) all permissions on everything. Really, carl only needs permissions to work on the database he's working on. Josh Drake (@linuxhiker on twitter) pointed this out to me. I am grateful for this. He is right. I am lazy.

Now to create my database. I got into model trains around Christmas of 2015 and went crazy collecting stuff and setting up a layout. I needed to somehow keep track of all the cars before it all got too unwieldy.

The command line entry to start psql is something I'm a bit fuzzy on - postgres isn't, to the best of my knowledge, a database per se, but a means of connecting to psql when you don't want to designate a default database ot work on.

I'm not going to post the full database code for the sake of brevity - it's only 11 tables but that's a bit much for a blog post. Instead I'll post a graphic schema I made and talk to it a little bit before posting one related SQL code snippet.

Disclaimer: I'm not a designer. This schema diagram I did with Dia, a fairly lightweight Linux/UNIX desktop tool for flowcharts and stuff. I've never met a color palette or font choice I could simply let be. Asking me to do a flowchart with a lot of leeway on design is like leaving a two year old home alone with a Crayola 64 pack of crayons and the 300 year old family Bible - it can't end well.

All that said, I find schema diagrams helpful for showing relationships between tables and having an ugly one is better than none at all. I've embedded an svg version of it below; hopefully it shows up on the planetpython feed:

The focus of my crude toy database design was the use of foreign keys to maintain consistency in naming things I want to track: rail name for example. I went with "Santa Fe" where I could have went with (and probably should have) "ATSF." It doesn't matter as long as it's consistent and I know what it means.

Years ago I was called in to do some work on a blasting database at the mine. There weren't any constraints on the entry of names of blasting materials, but what could go wrong? There were only three or four products with four digit designators and "None." Well . . . it was a mess. I didn't want to take any chances on having a situation like that again, even, or especially, if I was doing all of the data entry. Foreign keys it was!

Here's a quick dump of the code I used to create the validsidenumbers table. The idea is to make sure the rail line or company name is consistent in all the side number records (yes, I did actually purchase some identical rolling stock with the exact same side numbers - it's a long story):

Saturday, September 26, 2015

A couple months back I had a one-off assignment to dump some data from a vendor provided relational database to a csv file and then from there to Excel (essentially a fairly simple ETL - extract, transform, load exercise). It was a little trickier than I had planned it. Disclaimer: this may not be the best approach, but it worked . . . at least twice . . . on two different computers and that was sufficient.

Background:

Database: the relational database provided by the vendor is the back end to a graphic mine planning application. It does a good job of storing geologic and mine planning data, but requires a little work to extract the data via SQL queries.

Weighted Averages: specifically, the queries are required to do tonne-weighted averages and binning. Two areas that I've worked in, mine planning and mineral processing (mineral processing could be considered a subset of metallurgy or chemical engineering), require a lot of work with weighted averages. Many of the database programming examples on line deal with retail and focus one sales in the form of sum of sales by location. The weighted average by tonnes or gallons of flow requires a bit more SQL code.

Breaking Up the SQL and the CSV Dump Problem: in order to break the weighted average and any associated binning into smaller, manageable chunks of functionality, I used MSSQL (Microsoft SQL Server) global temporary tables in my queries. Having my final result set in one of these global temporary tables allowed me to dump it to a csv file using the MSSQL bcp utility. There are other ways to get a result set and produce a csv file from it with Python. I wanted to isolate as much functionality within the MSSQL database as possible. Also, the bcp utility gives some feedback when it fails - this made debugging or troubleshooting the one off script easier, for me, at least.

As far as the SQL goes, I may have been able to do this with a single query without too much trouble. There are tools within Transact-SQL for pivoting data and doing the sort of things I naively and crudely do with temporary tables. That said, in real life, the data are seldom this simple and this clean. There are far more permutations and exceptions. The real life version of this problem has fourteen temporary tables versus the four shown here.

Sanitized Mock Up Scenario: there's no need to go into depth on our vendor's database schema or the specific technical problem - both are a tad complicated. I like doing tonne-weighted averages with code but it's not everyone's cup of tea. In the interest of simplifying this whole thing and making it more fun, I've based it on the old Star Trek Episode Devil in the Dark about an underground mine on a distant planet.

Mock Data: we're modeling mined out areas and associated tonnages of rock bearing pergium, gold, and platinum in economic concentrations. (I don't know what pergium is, but it was worth enough that going to war with Mother Horta seemed like a good idea). Here is some code to create the tables and fill in the data (highly simplified schema - each mined out area is a "cut").

SQL Server 2008 R2 (Express) - table creation and mock data SQL code . I'm not showing the autogenerated db creation code - it's lengthly - suffice it to say the database name is JanusVIPergiumMine. Also, there are no keys in the tables for the sake of simplicity.

Python Code to Run the Dump/ETL to CSV: this is essentially a series of os.system calls to MSSQL's sqlcmd and bcp. What made this particularly brittle and hairy is the manner in which the lifetime of temporary tables is determined in MSSQL. To get the temporary table with my results to persist, I had to wrap its creation inside a process. I'm ignorant as to the internal workings of buffers and memory here, but the MSSQL sqlcmd commands do not execute or write to disk exactly when you might expect them to. Nothing is really completed until the process hosting sqlcmd is killed.

At work I actually got the bcp format file generated on the fly - I wasn't able to reproduce this behavior for this mock exercise. Instead, I generated a bcp format file for the target table dump "by hand" and put the file in my working directory.

As I show further on, this SQL data dump will be run from a button within an Excel spreadsheet.

Mr. Spock, or better said, Horta Mother says it best:

Subprocesses, sqlcmd, bcp, Excel . . .

PAAAAAIIIIIIIN!

#!C:\Python34\python

# blogsqlcmdpull.py

# XXX# Changed my laptop's name to MYLAPTOP.

# Yours will be whatever your computer

# name is.

import osimport subprocess as subximport shleximport time

import argparse

# Need to make sure you are in proper Windows directory.# Can vary from machine to machine based on# environment variables.

# Use Peoplesoft/company id so that more than# one user can run this at once if necessary# (note: will not work if one user tries to# run multiple instances at the same# time - theoretically <not tested># tables will get mangled and data# will be corrupt.)USER = os.getlogin()

# Get month and year into query dictionary.QUERYDICT['month'] = QUERYDICT['month'].format(monthx)QUERYDICT['year'] = yearx

getdataincsvformat()

print('done')

It's ugly, but it works.

Keeping with the Horta theme, this would be a good spot for an image break:

Damnit, Jim, I'm a geologist not a database programmer.

You're an analyst, analyze.

Load to Excel: this is fairly straightforward - COM programming with Mark Hammond and company's venerable win32com. The only working version of the win32com library I had on my laptop on which I am writing this blog entry was for a Python 2.5 release that came with an old version of our mine planning software (MineSight/Hexagon) - the show must go on!

End SubI always use Option Explicit in my VBA code - that's not particularly pythonic, but being pythonic inside the VBA interpreter can be hazardous. As always, YMMV.

Lastly, a rough demo and a data check. We'll run the SQL dump from the top button on the Excel worksheet:

And now we'll run the lower button to put the data into the spreadsheet. It's probably worth noting here that I did not bother doing any type conversions on the text coming out of the SQL csv dump in my Python code. That's because Excel handles that for you. It's not free software (Excel/Office) - might as well get your money's worth.

We'll do a check on the first row for tonnes and a pergium grade. Going back to our original data:

Cuts 1 and 2 belong to the drift Level23East.

Tonnes:

VALUES (1, 28437.0), (2, 13296.0),Total: 41733Looks good, we know we got a sum of tonnes right. Now the tonne-weighted average:

Notes:
This is a messy one-off mousetrap. That said, this is often how the sausage gets made in a non-programming, non-professional development environment. We do have an in-house Python developer Lori. Often she's given something like this and told to clean it up and make it into an in-house app. That's challenging. Ideally, the mining professional writing the one-off and the dev get together and cross-educate vis a vis the domain space (mining) and the developer space (programming, good software design and practice). It's a lot of fun but the first go around is seldom pretty.