Month: April 2008

Two of my favorite cmdlets from the PowerShell Community Extensions are get-clipboard and out-clipboard. These cmdlets let you read from and write to the Windows clipboard from PowerShell. For example, the following code will grab the contents of the clipboard, replace every block of white-space with a comma, and paste the result back to the clipboard.

(get-clipboard) -replace 's+(?!$)', ',' | out-clipboard

I saved this to a file comma.ps1 in my path and run it when I get a list of numbers from one program delimited by newlines or tabs and need to make it the input to another program expecting comma-delimited values. For example, turning a column of numbers into an array for R. I copy one format, run comma.ps1, and paste in the new format.

In case you’re curious about the mysterious characters in the script, s+(?!$) is a regular expression describing where I want to substitute a comma. The s refers to white-space characters (tabs, spaces, newlines) and the +says this is repeated one or more times. So match one or more consecutive white-space characters. That would be enough by itself, but it would replace trailing white-space with a comma too, so I might get an unwanted comma at the end. The sequence (?!$) fixes that. The $ matches the end of line. The (?! before and the ) after form a negative look ahead, meaning “except when the thing inside matches.” So taken all together, the regular expression matches chunks of white-space except at the end of the input.

About a month ago I wrote a series of four blog posts on innovation. The most important theme from these posts is the statement by Seth Godin just posted an article on his blog entitled The fibula and the safety pin that fits in with a series of innovation post I did about a month ago (Innovation I, II, III, IV). From Godin’s post:

Just about everything has a strike against it. It’s either already been done or it’s never been done. Ignore both conditions. Pushing an idea through the dip of acceptance is far more valuable than inventing something that’s never existed… and then walking away from it.

His phrase “pushing an idea through the dip of acceptance” is a good definition of innovation. (It also contains a passing reference to his excellent little book The Dip.) It goes right along with Michael Schrage’s statement that “innovation is not what innovators do but what customers adopt.” Too often we romanticize the inventor and fail to appreciate the toil of the innovator.

Sweave is a tool for making statistical analyses more reproducible by using literate programming in statistics. Sweave embeds R code inside LaTeX and replaces the code with the result of running the code, much like web development languages such as PHP embed code inside HTML.

Sweave is often launched from an interactive R session, but this can defeat the whole purpose of the tool. When you run Sweave this way, the Sweave document inherits the session’s state. Here’s why that’s a bad idea.

Say you’re interactively tinkering with some plots to make them look like you want. As you go, you’re copying R code into an Sweave file. When you’re done, you run Sweave on your file, compile the resulting LaTeX document, and get beautiful output. You congratulate yourself on having gone to the effort to put all your R code in an Sweave file so that it will be self-contained and reproducible. You forget about your project then revisit it six months later. You run Sweave and to your chagrin it doesn’t work. What happened? What might have happened is that your Sweave file depended on a variable that wasn’t defined in the file itself but happened to be defined in your R session. When you open up R months later and run Sweave, that variable may be missing. Or worse, you happen to have a variable in your session with the right name that now has some unrelated value.

I recommend always running Sweave from a batch file. On Windows you can save the following two lines to a file, say sw.bat, and process a file foo.Rnw with the command sw foo.

R.exe -e "Sweave('%1.Rnw')"
pdflatex.exe %1.tex

This assumes R.exe and pdflatex.exe are in your path. If they are not, you could either add them to your path or put their full paths in the batch file.

Running Sweave from a clean session does not insure that your file is self-contained. There could still be other implicit dependencies. But running from a clean session improves the chances that someone else will be able to reproduce the results.

See Troubleshooting Sweave for some suggestions for how to prevent or recover from other possible problems with Sweave.

Update: See the links provided by Gregor Gorjanc in the first comment below for related batch files and bash scripts.

Finding the percentiles of a list of samples is trivial if you can read all the samples into memory and sort the list. If the list is too big to fit into memory, it’s still not terribly difficult, but there are a couple subtleties if you haven’t thought about it before.

The application that motivated the code presented in the article generates a large amount of data. You could think of the software as generating an enormous matrix one row at a time, then computing percentiles of each column. For example, in one problem the matrix had 800,000 columns and 2,000 rows of floating point numbers.

Do you have a single program that you “live in” when you’re at a computer? Emacs users are known for “living” inside Emacs. This means more than just using the program for a large part of the day. It means using the program as the integration point for other programs, a sort of backplane for tying other things together.

Steve Yegge’s most recent blog post described his switch from Windows to Mac. He said the main reason for the switch was that he prefers the appearance of the fonts on a Mac. Changing operating systems was not a big deal for Yegge because he didn’t really live in Windows before, nor does he live in OS X now. He lives in Emacs. He concludes his essay by saying

So I’ll keep using my Macs. They’re all just plumbing for Emacs, anyway. And now my plumbing has nicer fonts.

Graphic artists may spend the majority of their work day using Photoshop, but they don’t send email from Photoshop, and they don’t keep their calendar in Photoshop. So I wouldn’t say they “live” in Photoshop. Microsoft developers spend a great deal of their time inside Visual Studio, though they don’t live inside Visual Studio to the same extent that Emacs users live inside Emacs. The Visual Studio experience is somewhere between Photoshop and Emacs on the “live in” scale. Unlike Emacs, Visual Studio has no ambition to become an operating system, probably because the company that makes Visual Studio already has an operating system.

I once knew someone who lived in Mathematica, doing his word processing etc. inside this mathematical package. Mathematica is a nice place to visit, but I wouldn’t want to live there.

A growing number of people now live inside their web browser, particularly if that browser is Firefox. There are Firefox plug-ins available to mow your lawn and take your children to the orthodontist. Maybe Firefox is becoming the Emacs of a new generation.

The choice of a program to live in is really a choice of how you want to tie applications together. To live in Emacs, you have to write Emacs Lisp, and that’s a deal-breaker for many. Interestingly, Microsoft has a project to create a highly configurable editor some have nick-named Emacs.NET. You can bet that the extension language will not be Emacs Lisp.

Some people live in their command shell and use shell scripts to tie everything together. While many Unix folks live that way, that hasn’t been practical on Windows until recently when PowerShell came out.

Suppose you’re doing something that has probability of success p and probability of failure q = 1-p. If you repeat what you’re doing m+n times, the probability of m successes and n failures is given by

Now suppose m and n are moderately large. The terms (m+n)! and m! n! will be huge, but the terms pm and qn will be tiny. The huge terms might overflow, and the tiny terms might underflow, even though the final result may be a moderate-sized number. The numbers m and n don’t have to be that large for this to happen since factorials grow very quickly. On a typical system, overflow happens if m+n > 170. How do you get to the end result without having to deal with impossibly large or small values along the way?

Libraries don’t always include functions for evaluating factorial or log factorial. They’re more likely to include functions for Γ(x) and its log. For positive integers n, Γ(n+1) = n!. Now suppose you have a function lgamma that computes log Γ(x). You might write something like this.

The function lgamma is not part of the ANSI standard library for C or C++, but it is part of the POSIX standard. On Unix-like systems, lgamma is included in the standard library. However, Microsoft does not include lgamma as part of the Visual C++ standard library. On Windows you have to either implement your own lgamma function or grab an implementation from somewhere like Boost.

Here’s something to watch out for with POSIX math libraries. I believe the POSIX standard does not require a function called gamma for evaluating the gamma function Γ(x). Instead, the standard requires functions lgamma for the log of the gamma function and tgamma for the gamma function itself. (The mnemonic is “t” for “true,” meaning that you really want the gamma function.) I wrote a little code that called gamma and tested it on OS X and Red Hat Linux this evening. In both cases gcc compiled the code without warning, even with the -Wall and -pedantic warning flags. But on the Mac, gamma was interpreted as tgamma and on the Linux box it was interpreted as lgamma. This means that gamma(10.0) would equal 362880 on the former and 12.8018 on the latter.

Today I needed to use Fibonacci numbers to solve a problem at work. Fibonacci numbers are great fun, but I don’t recall needing them in an applied problem before.

I needed to compute a series of integrals of the form

f(x, y) = xa (1-x)b yc (1-y)d p(x, y)

over the unit square for a statistical application. The function p(x, y) is a little complicated but its specific form is not important here. If the constants a, b, c, and d are all positive, as they usually are in my application, the integrand can be extended to a continuous periodic function in the plane. Lattice rules are efficient for such integration problems, and the optimal lattice rules for two-variable integration are given by Fibonacci lattice rules.

If Fk is the kth Fibonacci number and {x} is the remainder when subtracting off the greatest integer less than x, the kth Fibonacci lattice rule approximates the integral of f by

This rule worked well. I compared the results to those using the two-variable trapezoid rule and the lattice rule always gave more accuracy for the same number of integration points. (Normally the trapezoid rule is not very efficient. However, it can be surprisingly efficient for periodic integrands. See point #2 here. But the Fibonacci lattice rule was even more efficient.)

To implement this integration rule, I first had to write a function to compute Fibonacci numbers. This is such a common academic exercise that it is almost a cliché. I was surprised to have a practical need for such a function. My implementation was

is the golden ratio. The second term in the formula for Fk is smaller than 1, and Fk is an integer, so by rounding the first term to the nearest integer we get the exact result. Adding 0.5 before casting the result to an integer causes the result to be rounded to the nearest integer.

In a recent interview on the PowerScripting Podcast, Jeffrey Snover said that software versioning isn’t a problem, it’s a dilemma. The difference is that problems can be solved, but dilemmas can only be managed. No versioning system can do everything that everyone would like.

The same phenomena exists in biostatistics. As I’d mentioned in Galen and clinical trials, biostatistics is filled with dilemmas. There are problems along the way that can be solved, but fundamentally biostatistics manages dilemmas.

There are two kinds of errors discussed in classical statistics, unimaginatively named Type I and Type II. Aside from having completely non-mnemonic names, they represent odd concepts.

Technically, a Type I error consists of rejecting the “null hypothesis” (roughly speaking, the assumption of no effect, the hypothesis you typically set out to disprove) in favor of the “alternative hypothesis” when in fact the null hypothesis is true. A Type II error consists of accepting the null hypothesis (technically, failing to reject the null hypothesis) when in fact the null hypothesis is false.

Suppose you’re comparing two treatments with probabilities of efficacy θj and θk. The typical null hypothesis is that θj = θk, i.e. that the treatments are identically effective. The corresponding alternative hypothesis is that θj ≠ θk, that the treatments are not identical. Andrew Gelman says in this presentation (page 87)

I’ve never made a Type 1 error in my life. Type 1 error is θj = θk, but I claim they’re different. I’ve never studied anything where θj = θk.

I’ve never made a Type 2 error in my life. Type 2 error is θj ≠ θk, but I claim they’re the same. I’ve never claimed that θj = θk.

But I make errors all the time!

(Emphasis added.) The point is that no two treatments are ever identical. People don’t usually mean to test whether two treatments are exactly equal but rather that they’re “nearly” equal, though they are often fuzzy about what “nearly” means.

My first assignment as a professional programmer was to build another person’s program. I learned right away not to assume a project will build just because the author says it will. I’ve seen the same pattern repeated everywhere I’ve worked. Despite version control systems and procedures, there’s usually some detail in the developer’s head that doesn’t get codified and only the original developer can build the project easily.

The first step in making software builds reproducible is documentation. There’s got to be a document explaining how to extract the project from version control and build it. Requiring screen shots helps since developers have to rehearse their own instructions in order to produce the shots.

The second step is verification. Documentation needs to be tested, just like software. Someone who hasn’t worked on the project needs to extract the code onto a clean machine and build the project using only written instructions — no conversation with the developer allowed. Everyone thinks their code is easy to build; experience says most people are wrong.

The verifiers need to rotate. If one person serves as build master very long, they develop the same implicit knowledge that the original programmers failed to codify.

The third step is automation. Automated instructions are explicit and testable. If automation also saves time, so much the better, but automation is worthwhile even if it does not save time. Clift Norris and I just wrote an article on CodeProject entitled Automated Extract and Build from Team System using PowerShell that helps with this third step if you’re using Visual Studio and VSTS.

The most recent edition of the .NET Rocks podcast interviewed Jeremy Miller and David Laribee. They made the observation that Microsoft’s development tools are implicitly designed to support database-first development rather than object model-first development. Because they are committed to agile and object model-first development, they’re feeling resistance from some of Microsoft’s products.

Miller and Laribee are part of an interesting group of developers who call themselves “ALT.NET” because they’re developing on Microsoft’s .NET framework but are looking for alternative tools. They embrace the bottom of the Microsoft technology stack — operating systems, languages, compilers — but not the top — tools for modeling, code generation, testing, etc.

Suppose you need to evaluate the function f(x) = log(1 + ex). The most obvious code to write something like this in C:

double f(double x) { return log(1 + exp(x)); }

This is so simple you don’t even test it. Then someone calls you to say they’re getting strange output such as 1.#INF depending on your environment. What’s going on?

For large values of x, ex is much bigger than 1, so f(x) is approximately log(ex) which equals x. So, for example, f(1000) should return something very close to 1000. But instead, you get gibberish. In most environments a floating point number can be as large as about 10308, but exp(1000) is about 2×10434 and so the result overflows. Then the code takes the log of “infinity.”

But f(1000) shouldn’t overflow; the result is about 1000. Our function result can be contained in a floating point number, but our intermediate result exp(1000) cannot be. We need some way of telling the computer “Hold on. I know this is going to be a huge number, but trust me for a minute. After I add 1 I’m going to take a log and bring the result back down to a medium sized number.” Unfortunately computers don’t work that way.

One solution would be to see whether x is so large that exp(x) will overflow, and in that case f(x) could just return x. So our second attempt might look like this:

double f(double x) { return (x > X_MAX) ? x : log(1 + exp(x)); }

I’ll come back later to what X_MAX should be. Suppose you’ve found a good value for this constant and release your code again. Then you get another call. They say your code is returning zero when it should not.

Someone is trying to compute f(-50). They know the answer is small, but it shouldn’t be zero. How can that be? Since you just learned about overflow, you suspect the problem might have to do with the opposite problem, underflow. But that’s not quite it. The result of exp(-50) does not underflow; it’s about 2×10-22. But it is so small that machine precision cannot distinguish 1+exp(-50) from 1. So the code returns log(1), which equals zero. This may or may not be a problem. The correct answer is very small, so the absolute error is small but the relative error is 100%. Whether that is a problem depends on what you’re going to do with the result. If you’re going to add it to a moderate size number, no big deal. If you’re going to divide by it, it’s a very big deal.

Now what do you do? You think back to your calculus class and remember that for small x, log(1 + x) is approximately x with error on the order of x2. (To see this, expand log(1 + x) in a power series centered at 1.) If x is so small that 1 + x equals 1 inside the computer, x2 must be very small indeed. So f(x) could return exp(x) if exp(x) is sufficiently small. This gives our third attempt.

This is a big improvement. It can still underflow for large negative x, but in that case the function itself is underflowing, not just an intermediate result.

Now what do we make X_MAX and X_MIN? For X_MAX we don’t really have to worry about exp(x) overflowing. We can stop when x is so large that exp(x) + 1 equals exp(x) to machine precision. In C there is a header file float.h that contains a constant DBL_EPSILON which is the smallest number ε such that 1 + ε does not equal 1 in machine precision. So it turns out we can set X_MAX equal to -log(DBL_EPSILON). Similarly we could set X_MIN equal to log(DBL_EPSILON).

There’s still a small problem. When x is large and negative, but not so negative that 1 + exp(x) equals 1 in the computer, we could lose precision in computing log(1 + exp(x)).

I have just posted an article on CodeProject entitled Avoiding Overflow, Underflow, and Loss of Precision that has more examples where the most direct method for evaluating a function may not work. Example 2 in that paper explains why directly computing log(1 + y) can be a problem even if y is not so small that 1 + y equals 1 to machine precision. The article explains how to compute log(1 + y) in that case. Setting y = exp(x) explains how to finish the example here when x is moderately large and negative.