In applications you often want to take the maximum of two numbers. But the simple function

f(x, y) = max(x, y)

can be difficult to work with because it has sharp corners. Sometimes you want an alternative that sands down the sharp edges of the maximum function. One such alternative is the soft maximum. Here are some questions this post will answer.

What exactly is a soft maximum and why is it called that?

How is it related to the maximum function?

In what sense does the maximum have “sharp edges”?

How does the soft maximum round these edges?

What are the advantages to the soft maximum?

Can you control the “softness”?

I’ll call the original maximum the “hard” maximum to make it easier to compare to the soft maximum.

The soft maximum of two variables is the function

g(x, y) = log( exp(x) + exp(y) ).

This can be extended to more than two variables by taking

g(x1, x2, …, xn) = log( exp(x1) + exp(x2) + … + exp(xn) ).

The soft maximum approximates the hard maximum. Why? If x is a little bigger than y, exp(x) will be a lot bigger than exp(y). That is, exponentiation exaggerates the differences between x and y. If x is significantly bigger than y, exp(x) will be so much bigger than exp(y) that exp(x) + exp(y) will essentially equal exp(x) and the soft maximum will be approximately log( exp(x) ) = x, the hard maximum. Here’s an example. Suppose you have three numbers: -2, 3, 8. Obviously the maximum is 8. The soft maximum is 8.007.

The soft maximum approximates the hard maximum but it also rounds off the corners. Let’s look at some graphs that show what these corners are and how the soft maximum softens them.

Here are 3-D plots of the hard maximum f(x, y) and the soft maximum g(x, y). First the hard maximum:

Now the soft maximum:

Next we look at a particular slice through the graph. Here’s the view as we walk along the line y = 1. First the hard maximum:

And now the soft maximum:

Finally, here are the contour plots. First the hard maximum:

And now the soft maximum:

The soft maximum approximates the hard maximum and is a convex function just like the hard maximum. But the soft maximum is smooth. It has no sudden changes in direction and can be differentiated as many times as you like. These properties make it easy for convex optimization algorithms to work with the soft maximum. In fact, the function may have been invented for optimization; that’s where I first heard of it.

Notice that the accuracy of the soft maximum approximation depends on scale. If you multiply x and y by a large constant, the soft maximum will be closer to the hard maximum. For example, g(1, 2) = 2.31, but g(10, 20) = 20.00004. This suggests you could control the “hardness” of the soft maximum by generalizing the soft maximum to depend on a parameter k.

g(x, y; k) = log( exp(kx) + exp(ky) ) / k

You can make the soft maximum as close to the hard maximum as you like by making k large enough. For every value of k the soft maximum is differentiable, but the size of the derivatives increase as k increases. In the limit the derivative becomes infinite as the soft maximum converges to the hard maximum.

Roughly 35 percent of software defects emerge from missing logic paths, and another 40 percent are from the execution of a unique combination of logic paths.

If these figures are correct, three out of four software bugs are sins of omission, errors due to things left undone. These are bugs due to contingencies the developers did not think to handle. Three quarters seems like a large proportion, but it is plausible. I know I’ve written plenty of bugs that amounted to not considering enough possibilities, particularly in graphical user interface software. It’s hard to think of everything a user might do and all the ways a user might arrive at a particular place. (When I first wrote user interface applications, my reaction to a bug report would be “Why would anyone do that?!” If everyone would just use my software the way I do, everything would be OK.)

It matters whether bugs are sins of omission or sins of commission. Different kinds of bugs are caught by different means. Developers have come to appreciate the value of unit testing lately, but unit tests primarily catch sins of commission. If you didn’t think to program something in the first place, you’re not likely to think to write a test for it. Complete test coverage could only find 25% of a projects bugs if you assume 75% of the bugs come from code that no one thought to write.

The best way to spot sins of omission is a fresh pair of eyes. As Glass says

Rigorous reviews are more effective, and more cost effective, than any other error-removal strategy, including testing. But they cannot and should not replace testing.

One way to combine the benefits of unit testing and code reviews would be to have different people write the unit tests and the production code.

Better to have a simple system than a complex system with a simple abstraction on top.

Abstractions are never perfect. Every new layer creates failure points, interoperability hassles, and scalability problems. New tools can hide complexity, but they can’t justify it … The more complex the system, the more difficult it is to fix when something goes wrong.

Last week .NET Rocks mentioned a good idea in passing: start a screencast tool like Camtasia before you do a software install. Michael Learned, told the story of a client that asked him to take screen shots of every step in the installation of Microsoft’s Team Foundation Server. Carl Franklin commented “What a great idea to throw Camtasia on there and record the whole process.”

It would be better if the installation process were scripted and not just recorded, but sometimes that’s not practical. Sometimes clicking a few buttons is absolutely necessary or at least far easier than writing a script. And even if you think your entire process is automated with a script, a screencast might be a good idea. It could record little steps you have to do in order to run your script, details that are easily forgotten.

Another way to use this idea would be to have one person do a practice install on a test server while recording the process. Then another person could document and script the process by studying the video. This would be helpful when the person who knows how to do the installation lacks either the verbal skills to explain the process or the scripting skills to automate it.

Most software tool and technique improvements account for about a 5- to 30-percent increase in productivity and quality. … Learning a new tool or technique actually lowers programmer productivity and product quality initially. You achieve the eventual benefit only after overcoming this learning curve.

If you’re always learning new tools, you may be less productive than if you stuck with your old tools a little longer, even if the new tools really are better. And especially if you’re a part-time developer, you may never reach the point where a new tool pays for itself before you throw it away and pick up a new one. Kathleen Dollard wrote an editorial to this effect in 2004 entitled Save The Hobbyist Programmer.

Miners know they have a significant problem when the canary they keep with them stops singing. Hobbyist/part-time programmers are our industry’s version of the canary, and they have stopped singing. People who program four to eight hours a week are being cut out of the picture because they can’t increase their skills as fast as technology changes. That’s a danger signal for the rest of us.

So what do you do? Learn quickly or change slowly. The first option is to commit to learning a new tool quickly, invest heavily in up-front training, and use the tool as much as you can before the next one comes along. This is the favored option for ambitious programmers who want to maximize their marketability by always using the latest tools.

The second option is to develop a leap frog strategy, letting some new things pass you by. The less time you spend per week programming, the less often you should change tools. Change occasionally, yes, but wait for big improvements.

James Burke had a television series Connections in which he would create a connection between two very different things. For example, in one episode he starts with the discovery of the touchstone for testing precious metals and tells a winding tale of how the touchstone led centuries later to the development of nuclear weapons.

I had a Connections-like moment when a calendar led to some physics, which then lead to Andrew Lloyd Webber’s musical Cats.

Immediately the lyrics “Up, up, up to the Heaviside layer …” started going through my head. These words come from the song “The Journey to the Heaviside Layer” from Cats. I had never thought about “Heaviside” in that song as being related to Mr. Heaviside. I’ve never seen the lyrics in print, so I thought the words were “heavy side” and didn’t stop to think what they meant.

Andrew Lloyd Webber based Cats on Old Possum’s Book of Practical Cats by T. S. Eliot. The song “The Journey to the Heaviside Layer” in particular is based on the poem Old Deuteronomy from Eliot’s book. Webber used the Heaviside layer as a symbol for heaven, based on an allusion in one of T. S. Eliot’s letters. The symbolism is obvious in the musical, but I hadn’t thought about “Heaviside layer” as meaning “the heavens” (i.e. the upper atmosphere) as well as heaven in the theological sense.

The Central Limit Theorem says that if you average enough independent copies of a random variable, the result has a nearly normal (Gaussian) distribution. Of course that’s a very rough statement of the theorem. What are the precise requirements of the theorem? That question took two centuries to resolve. You can see the final answer here.

The first version of the Central Limit Theorem appeared in 1733, but necessary and sufficient conditions weren’t known until 1935. I won’t recap the entire history here. I just want to comment briefly on how the Central Limit Theorem began and how different the historical order of events was from the typical order of presentation.

Abraham de Moivre discovered he could approximate binomial distribution probabilities using the integral of exp(-x2) and proved an early version of the Central Limit Theorem in 1733. At the time, there was no name given to his integral. Only later did anyone think of exp(-x2) as the density of a probability distribution. De Moivre certainly didn’t use the term “Gaussian” since Gauss was born 44 years after de Moivre’s initial discovery. De Moivre also didn’t call his result the “Central Limit Theorem.” George Pólya gave the theorem that name in 1920 as it was approaching its final form.

Regular expressions are fairly portable. There are two main flavors of regular expressions — POSIX and Perl — and more languages these days use the Perl flavor. There are some minor differences in what it means to be “like Perl” but for the most part languages that say they follow Perl’s lead specify regular expressions the same way. The differences lie in how you use regular expressions: how you form matches, how you replace strings, etc.

Mathematica uses Perl’s regular expression flavor. But how do you use regular expressions in Mathematica? I’ll give a few tips here and give more details in the notes Regular expressions in Mathematica.

First of all, unlike Perl, Mathematica specifies regular expressions with ordinary strings. This means that metacharacters have to be doubly escaped. For example, to represent the regular expression d{4} you must use the string "\d{4}".

The function StringCases returns a list of all matches of a regular expression in a string. If you simply want to know whether there was a match, you can use the function StringFreeQ. However, note the you probably want the opposite of the return value from StringFreeQ because it returns whether a string does not contain a match.

By default, the function StringReplace replaces all matches of a regular expression with a given replacement pattern. You can limit the number of replacements it makes by specifying an addition argument.

The Gudermannian function gd(x) is another interesting relic of an early time. It is closely related to the Mercator projection and shows how to relate ordinary and hyperbolic trig functions without using complex numbers.