~ Science! Culture! Computational Engines!

Tag Archives: programming

I have always had a problem with the concept of intellectual property. The great western tradition of post-enlightenment values have always placed the free flow of art and ideas on a pedestal, as a sacrosanct cornerstone of a just society. That the ideas living in our heads and flowing from our lips were the domain of no king, pope, or policeman is the one of the most important cultural norms that has emerged from the enlightenment into modern liberal democracies. The legal constructs associated with intellectual property, in my evaluation, cannot be reconciled with this. A corpuscle of information cannot be at once free to be spoken or expressed and also be the property of some individual and corporation. Information Theory, the fantastic work pioneered by Claude Shannon, only swells my distaste for intellectual property. We know now that with simple coding, all information is reducible to a common binary form. Film, print, music, photography: all is merely a collection of ordered bits. Which makes the idea of owning information all the more ridiculous, as the process can be just as easily reversed: A song can be represented by a string of Shakespeare quotations, a movie can be rendered in musical score. As an illustration of this, I’ve written a short program that takes any file and converts it to a long, rambling nonsense-poem. Poetry as Piracy.

Making the Wordlists

The first step is generating a set of words to use to generate our poems, categorized by their grammatical type. To do this, I downloaded the English wiktionary. I then used grep, sed, and awk to split it into plain lists of words: nouns, past tense verbs, present participle verbs, and adjectives. I then shuffled these lists, and trimmed them down so that their length was a multiple of 2. I didn’t need to do this, but it simplified the work slightly. In the end, I was left with 17 bits worth of information stored in each noun (131,072 words), 13 bits in each past-tense verb (8192), 13 bits in each present-participle verb (8192), and 15 bits for each adjective.

Sentence Skeletons

I then decided on two rough sentence skeletons:

The ADJECTIVE NOUN PAST-VERBED the ADJECTIVE NOUN.

ADJECTIVE NOUN is PRESENT-VERBING the ADJECTIVE NOUN.

Each of those sentences can store 77 bits of information. A 1Mb file, for example, will require roughly 10,000 sentences, or about a novelette worth of words. If that 1 Mb file was a copyrighted song, you would not in fact have the freedom to print and distribute your nice new novel (not that you would want to, it would be random nonsense.)

Encoding the File

Now, 77 bits is a bit awkward. Just choosing between each sentence type gives me 1 bit of information. I also get punctuation at the end. If I end each sentence with either a period, exclamation mark, two exclamation marks, or three exclamation marks, that gets me an extra two bits of information. This gets me up to 80 bits per sentence, or 10 bytes. I can now easily encode my data as nonsense poetry! I use the first bit to select which tense of verb, the second two decide if I get a period or exclamation series, and the rest determine the sentence itself. If my file isn’t nicely divisible into base 10, I simply add an additional line at the end:

All that remains are NUM memories and NUM regrets.

Where NUM is the base-10 representation of the remaining bytes in the first case, and the number of bytes remaining in the second instance (as a long string of leading zeros will get truncated in converting to decimal).

Decoding the File

Decoding the file is as simple as just reading in each line, checking what sentence type it is, and what the punctuation at the end is, and returning it to the original binary form!

I’ve been warned that I sometimes veer too far in the direction of toolmaker away from the standard path followed by most scientists. Try as I might, I cannot seem to avoid finding the process of doing science nearly as interesting as the goal of getting that science done. And so, my mind has been orbiting around a problem I suspect is endemic amongst all physicists, if not all scientists. That problem, captured so nicely by this PhD comic is that of filesystem cruft. Science, being at it’s core an experimental art, produces for every successful idea a whole panoply of failed experiments, mistakes, and generally messed-up crap. Being paranoid creatures consumed by our own fears, along with the awareness that serendipity has been a cornerstone of great work, we are loathe to sweep these ill-fated children of the mind into the trash where they (mostly) belong. And so those of us who rely on computers for most of our day-to-day work end up with home directories filled to the brim with old scripts, corrupted data files, a dozen different versions of the same list of values, and other digital detritus. And this situation makes for errors, confusion, thousand yard stare, anal leakage, and other evils too foul to discuss in polite company. Just looking at my /home directory on my workstation at the University, I have more than 100,000 files sitting around, waiting for me to stare at them for a quarter hour trying to remember what they were for.

I’ve been thinking of writing an astronomical toolkit for Arduino, to help users build their own go-to telescope mounts, satellite trackers, heliostats, and other cool amateur astronomy equipment. As anyone who has ever worked with astronomical coordinate systems know, since astronomers treat the sky as a 2 dimensional spherical surface, most calculations involving positions on this surface involve a good deal of trigonometry. While the Arduino library includes built-in functions for sine and cosine, it lacked the inverse trigonometric functions arcsine and arccosine. These are necessary if you ever need to convert a length ratio into an angle. It is impossible to convert between different sky coordinate systems (like Horizontal and Equatorial, the two most common) without access to these functions. In today’s post, I’ll show you how I wrote my own arcsin() and arccos() functions using Taylor polynomials.

Continuing my series on using python and matplotlib to generate common plots and figures, today I will be discussing how to make histograms, a plot type used to show the frequency across a continuous or discrete variable. Histograms are useful in any case where you need to examine the statistical distribution over a variable in some sample, like the brightness of radio galaxies, or the distance of quasars.

Continuing my series on using matplotlib and python to generate figures, I’d like to get now to the meat of the topic: actually making a figure or two. I’ll be starting with the simplest kind of figure: a line plot, with points plotted on an X-Y Cartesian plane.

I’m sure many of my fellow scientists spend a relatively large chunk of their time making plots, graphs, and figures of one sort or another. There are a plethora of cool tools out there for doing this, from proprietary tools like Mathematica or IDL to free software kits like GNUplot. While GNUplot is useful and handy (and IDL is powerful and expensive), I’m a python guy primarily, so I like my tools to interface well with my existing code, and has a more pythonicinterface. For this, I turn to matplotlib, a powerful suite for generating all sorts of plots from python.

I had one of those frustrating days where you spend hours and hours searching around for what should be a simple coding solution, to no avail. Finally I was able to patch together enough disparate knowledge to achieve my goal: namely, storing and retrieving a java BitSet on a MySQL database. Below is the solution, which I hope might help any other unfortunate souls looking for this answer.