A long, long, long time ago I started using interpreter optimizations to help organize code. If a block of code is within a constant (or expression of constants) that evaluate to False, then the block (or line) is optimized away.

if (false){ # this is optimized away }

Perl was very generous, and would allow for constants, true/false, and operations between the two to work.

Thankfully, Python has some optimizations available via the PYTHONOPTIMIZE environment variable (which can also be adjusted via the -o and -oo flags). They control the __debug__ constant, which is True (by default) or False (when optimizations are enabled), and will omit documentation strings if the level is 2 (or oo) or higher. Using these flags in a production environment allows for very verbose documentation and extra debugging logic in a development environment.

Unfortunately, I implemented this incorrectly in some places. To fine-tune some development debugging routines I had some lines like this:

if __debug__ and DEBUG_CACHE:
pass

Python’s interpreter doesn’t optimize that away if DEBUG_CACHE is equal to False, or even if it IS False (i tested under 2.7 and 3.5). I should have realized this (or at least tested for it). I didn’t notice this until I profiled my app and saw a bunch of additional statistics logging that should not have been compiled.

The correct way to write the above is:

if __debug__:
if DEBUG_CACHE:
pass

Here is a quick test

trying running it with optimizations turned on:

export PYTHONOPTIMIZE=2
python test.py

and with them off

export PYTHONOPTIMIZE
python test.py

As far as the interpreter is concerned during the optimization phase, __debug__(False) and False is True.

The (forthcoming) Aptise platform features a web-indexer that tabulates a lot of statistical data each day for domains across the global internet.

While we don’t currently need this data, we may need it in the future. Keeping it in the “live” database is not really a good idea – it’s never used and just becomes a growing stack of general crap that automatically gets backed-up and archived every day as part of standard operations. It’s best to keep production lean and move this unnecessary data out-of-sight and out-of-mind.

An example of this type of data is a daily calculation on the relevance of each given topic to each domain that is tracked. We’re primarily concerned with conveying the current relevance, but in the future we will address historical trends.

Let’s look at the basic storage requirements of this as a PostgreSQL table:

This structure conceptually stores the same data, but instead of repeating the date in every row, we record it only once within the table’s name. This simple shift will lead to a nearly 40% reduction in size.

In our use-case, we don’t want to keep this in PostgreSQL because the extra data complicates automated backups and storage. Even if we wanted this data live, having it within hundreds of tables is a bit much overkill. So for now, we’re going to export the data from a single date into a new file.

I skipped a lot of steps above because I do this in Python — for reasons I’m about to explain.

As a raw csv file, my date-specific table is still pretty large at 7556804 bytes — so let’s consider ways to compress it:

Using standard zip compression, we can drop that down to 2985257 bytes. That’s ok, but not very good. If we use xz compression, it drops to a slightly better 2362719.

We’ve already compressed the data to 40% the original size by eliminating the date column, so these numbers are a pretty decent overall improvement — but considering the type of data we’re storing, the compression is just not very good. We’ve got to do better — and we can.

We can do much better and actually it’s pretty easy (!). All we need to do is understand compression algorithms a bit.

Generally speaking, compression algorithms look for repeating patterns. When we pulled the data out of PostgreSQL, we just had random numbers.

We can help the compression algorithm do its job by giving it better input. One way is through sorting:

As a raw csv, this sorted date-specific table is still the same exact 7556804 bytes.

Look what happens when we try to compress it:

Using standard zip compression, we can drop that down to 1867502 bytes. That’s pretty good – we’re at 25.7% the size of the raw file AND it’s 60% the size of the non-sorted zip. That is a huge difference! If we use xz compression, we drop down to 1280996 bytes. That’s even better at 17%.

17% compression is honestly great, and remember — this is compressing the data that is already 40% smaller because we shifted the date column out. If we consider what the filesize with the date column would be, we’re actually at 10% compression. Wonderful.

I’m pretty happy with these numbers, but we can still do better — and without much more work.

As I said above, compression software looks for patterns. Although the sorting helped, we’re still a bit hindered because our data is in a “row” storage format. Consider this example:

As a raw csv, my transposed file is still the exact same size at 7556804 bytes.

However, if I zip the file – it drops down to 1425585 bytes.

And if I use xz compression… I’m now down to 804960 bytes.

This is a HUGE savings without much work.

The raw data in postgres was probably about 12594673 bytes (based on the savings, the file was deleted).
Stripping out the date information and storing it in the filename generated a 7556804 bytes csv file – a 60% savings.
Without thinking about compression, just lazily “zipping” the file created a file 2985257 bytes.
But when we thought about compression: we sorted the input, transposed that data into a column store; and applied xz compression; we resulted in a filesize of 804960 bytes – 10.7% of the csv size and 6.4% of the estimated size in PostgreSQL.

This considerably smaller amount of data can not be archived onto something like Amazon’s Glacier and worried about at a later date.

This may seem like a trivial amount of data to worry about, but keep in mind that this is a DAILY snapshot, and one of several tables. At 12MB a day in PostgreSQL, one year of data takes over 4GB of space on a system that is treated for high-priority data backups. This strategy turns that year of snapshots into under 300MB of information that can be warehoused on 3rd party systems. This saves us a lot of time and even more money! In our situation, this strategy is applied to multiple tables. Most importantly, the benefits cascade across our entire system as we free up space and resources.

These numbers could be improved upon further by finding an optimal sort order, or even using custom compression algorithms (such as storing the deltas between columns, then compressing). This was written to illustrate a quick way to easily optimize archived data storage.

The results:

Format

Compression

Sorted?

Size

% csv+date

csv+date

12594673

100

row csv

7556804

60

row csv

zip

2985257

23.7

row csv

xz

2362719

18.8

row csv

Yes

7556804

60

row csv

zip

Yes

1867502

14.8

row csv

xz

Yes

1280996

10.2

col csv

Yes

7556804

60

col csv

zip

Yes

1425585

11.3

col csv

xz

Yes

804960

6.4

Note: I purposefully wrote out the ASC on the sql sorting above, because the sort order (and column order) does actually factor into the compression ratio. On my dataset, this particular column and sort order worked the best — but that could change based on the underlying data.

I remembered going though this before, but couldn’t find my notes. I ended up fixing it before finding my notes ( which point to this StackOverflow question http://stackoverflow.com/questions/2088569/how-do-i-force-python-to-be-32-bit-on-snow-leopard-and-other-32-bit-64-bit-quest ) , but I want to share this with others.

psycopg2 was showing compilation errors in relation to some PostgreSQL libraries

I needed to upgrade from Python 2.6 to 2.7 and ran into a few issues along the way. Learn from my encounters below.

# Upgrading Python
Installing the new version of Python is really quick. Python.org publishes a .dmg installer that does just about everything for you. Let me repeat “Just about everything”.

You’ll need to do 2 things once you install the dmg, however the installer only mentions the first item below:

1. Run “/Applications/Python 2.x/Update Shell Command”. This will update your .bash_profile to look in “/Library/Frameworks/Python.framework/Versions/2.x/bin” first.

2. After you run the app above, in a NEW terminal window do the following:

* Check to see you’re running the new python with `python –version` or `which python`
* once you’re running the new python, re-install anything that installed an executable in bin. THIS INCLUDES SETUPTOOLS , PIP , and VIRTUALENV

It’s that second thing that caught me. I make use of virtualenv a lot, and while I was building new virtualenvs for some projects I realized that my installs were building against `virtualenv` and `setuptools` from the stock Apple install in “/Library/Python/2.6/site-packages” , and not the new Python.org install in “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages”.

It’s worth nothing that if you install setuptools once you’ve upgraded to the Python.org distribution, it just installs into the “/Library/Frameworks/Python.framework” directory — leaving the stock Apple version untouched (basically, you can roll back at any time).

# Installing Mysql-Python (or the ruby gem) against MAMP

I try to stay away from Mysql as much as I can [ i <3 PostgreSQL ], but occasionally I need to run it: when I took over at TheDailyBeast.com, they were midway through a relaunch on Rails , and I have a few consulting clients who are on Django.
I tried to run cinderella a while back ( http://www.atmos.org/cinderella/ ) but ran into too many issues. Instead of going with MacPorts or Homebrew, I've opted to just use MAMP ( http://www.mamp.info/en/index.html )
There's a bit of a problem though - the persons who are responsible for the MAMP distribution decided to clear out all the mysql header files, which you need in order to build the Python and Ruby modules.
You have 2 basic options:
1. Download the "MAMP_components" zip (155MB) , and extract the mysql source files. i often used to do this, but the Python module needed a compiled lib and I was lazy so...
2. Download the tar.gz version of Mysql compiled for OSX from http://dev.mysql.com/downloads/mysql/
Whichever option you choose, the next steps are generally the same:
## Copy The Files
### Where to copy the files ?
mysql_config is your friend. at least the MAMP one is.
Make sure you can call the right mysql_config, and it'll tell you where the files you copy should be stashed. Since we're building against MAMP we need to make sure we're referencing MAMP's mysql_config
iPod:~jvanasco$ which myqsl_config
/Applications/MAMP/Library/bin/mysql_config

Note: if you’re installing for a virtualenv, this needs to be done after it’s been activated.

Set the archflags on the commandline:

export ARCHFLAGS="-arch $(uname -m)"

### Python Module

I found that the only way to install the module is by downloading the source ( off sourceforge! ).

I edited site.cfg to have this line:

mysql_config = /Applications/MAMP/Library/bin/mysql_config

Basically, you just need to tell mysql to use the MAMP version of mysql_config to figure everything out itself.

the next steps are simply:

python setup.py build
python setup.py install

If you get any errors, pay close attention to the first few lines.

If you see something like the following within the first 10-30 lines, it means the various files we placed in the step above are not where the installer wants them to be:

_mysql.c:36:23: error: my_config.h: No such file or directory
_mysql.c:38:19: error: mysql.h: No such file or directory
_mysql.c:39:26: error: mysqld_error.h: No such file or directory
_mysql.c:40:20: error: errmsg.h: No such file or directory

note how we see “/Applications/MAMP/Library/include/mysql” in there. When I quickly followed some instructions online that had all the files in /include — and not in that subdir — this error popped up. Once I changed the directory structure to match what my mysql_config wanted, the package installed perfectly.

Basically what’s happening is that as you run it, mysql_python drops a shared object in your userspace. That shared object is referencing the location that the Mysql.org distribution placed all the library files — which differs from where we placed them in MAMP.

There’s a quick fix — add this to your bash profile, or run it before starting mysql/your app:

There are too many posts on this subject matter to thank. A lot of people posted variations of this method – to which I’m very grateful – however no one addressed troubleshooting the Python process , which is why I posted this.

I also can’t stress the simple fact that if the MAMP distribution contained the header and built library files, none of this would be necessary.