5 Answers
5

The answer you'll get from mathematically minded people is "because of the central limit theorem". This expresses the idea that when you take a bunch of random numbers from almost any distribution* and add them together, you will get something approximately normally distributed. The more numbers you add together, the more normally distributed it gets.

I can demonstrate this in Matlab/Octave. If I generate 1000 random numbers between 1 and 10 and plot a histogram, I get something like this

If instead of generating a single random number, I generate 12 of them and add them together, and do this 1000 times and plot a histogram, I get something like this:

I've plotted a normal distribution with the same mean and variance over the top, so you can get an idea of how close the match is. You can see the code I used to generate these plots at this gist.

In a typical machine learning problem you will have errors from many different sources (e.g. measurement error, data entry error, classification error, data corruption...) and it's not completely unreasonable to think that the combined effect of all of these errors is approximately normal (although of course, you should always check!)

More pragmatic answers to the question include:

Because it makes the math simpler. The probability density function for the normal distribution is an exponential of a quadratic. Taking the logarithm (as you often do, because you want to maximize the log likelihood) gives you a quadratic. Differentiating this (to find the maximum) gives you a set of linear equations, which are easy to solve analytically.

It's simple - the entire distribution is described by two numbers, the mean and variance.

It's familiar to most people who will be reading your code/paper/report.

It's generally a good starting point. If you find that your distributional assumptions are giving you poor performance, then maybe you can try a different distribution. But you should probably look at other ways to improve the model's performance first.

Gaussian distributions are the most "natural" distributions. They show up everywhere. Here is a list of the properties that make me think that Gaussians are the most natural distributions:

The sum of several random variables (like dice) tends to be Gaussian as noted by nikie. (Central Limit Theorem).

There are two natural ideas that appear in machine learning, the standard deviation and the maximum entropy principle. If you ask the question, "Among all distributions with standard deviation 1 and mean 0, what is the distribution with maximum entropy?" The answer is the Gaussian.

Randomly select a point inside a high dimensional hypersphere. The distribution of any particular coordinate is approximately Gaussian. The same is true for a random point on the surface of the hypersphere.

Take several samples from a Gaussian Distribution. Compute the Discrete Fourier Transform of the samples. The results have a Gaussian Distribution. I am pretty sure that the Gaussian is the only distribution with this property.

The eigenfunctions of the Fourier Transforms are products of polynomials and Gaussians.

I think all solutions to stochastic differential equations involve Gaussians. -- Isn't that because SDEs are most often defined using a Brownian motion for the stochastic part? Since Brownian motion has Gaussian increments, it's not surprising that the solution typically involves a Gaussian!
–
Chris TaylorSep 27 '12 at 11:11

The signal error if often a sum of many independent errors. For example, in CCD camera you could have photon noise, transmission noise, digitization noise (and maybe more) that are mostly independent, so the error will often be normally distributed due to the central limit theorem.

Also, modeling the error as a normal distribution often makes calculations very simple.

Even non-normal distributions can often be looked as normal
distribution with a large deviation. Yes, it's a dirty hack.

The first point might look funny but I did some research for problems where we had non-normal distributions and the maths get horribly complicated. In practice, often computer simluations are carried out to "prove the theorems".