If random variable $X$ has a probability distribution of $f(x)$ and random variable $Y$ has a probability distribution $g(x)$ then $(f*g)(x)$, the convolution of $f$ and $g$, is the probability distribution of $X+Y$. This is the only intuition I have for what convolution means.

25 Answers
25

I remember as a graduate student that Ingrid Daubechies frequently referred to convolution by a bump function as "blurring" - its effect on images is similar to what a short-sighted person experiences when taking off his or her glasses (and, indeed, if one works through the geometric optics, convolution is not a bad first approximation for this effect). I found this to be very helpful, not just for understanding convolution per se, but as a lesson that one should try to use physical intuition to model mathematical concepts whenever one can.

More generally, if one thinks of functions as fuzzy versions of points, then convolution is the fuzzy version of addition (or sometimes multiplication, depending on the context). The probabilistic interpretation is one example of this (where the fuzz is a a probability distribution), but one can also have signed, complex-valued, or vector-valued fuzz, of course.

What is the operator $C\_f\colon g\mapsto f*g$? Consider the translation operator $T\_y$ defined by $T\_y(g)(x)=g(x-y)$, and look at $f*g(x)=\int_{\mathbb{R}}f(y)g(x-y)dy$. Rewriting this as an operator by taking out $g$, you end up with the operator equation $$C\_f=\int_{\mathbb{R}}f(y)T\_ydy.$$
This is only formally correct of course, but it roughly says that convolution with $f$ is a linear combination of translation operators, the integral being a sort of generalized sum.

Tying this in with Terry Tao's answer, which came in while I was writing the above, if $f$ is a bump function, say nonnegative, with integral equal to 1 and concentrated near the origin, then $f*g$ is a (generalized) linear combination of translates of $g$, each one translated just a short distance, hence the blurryness of the result.

Convolution of probability distributions which are supported on the integers is a special case of multiplying power series together; it corresponds to multiplication of the probability generating functions.
–
Michael LugoNov 18 '09 at 3:31

Among things that it's good to know about convolution is that the identity element for convolution is the Dirac delta function $\delta$.

Another is that if you convolve a function $f$ with $\delta'$, the derivative of the delta function, you get $f'$. Since convolution is associative, that implies that $f'*g = f*g'$.

Another is that often the convolution of two functions is as well-behaved as the better-behaved one of the two. If you convolve something with a smooth function, you get a smooth function; if you convolve something with a polynomial, you get a polynomial. In other words, many classes of "well-behaved" functions are ideals in a ring whose multiplication is convolution.

So if you convolve $f$ with a smooth approximation to Dirac's delta function, you get a smooth approximation to $f$. Thinking about why that works can probably shed a lot of intuitive light on the nature of convolution.

The two things that first come to mind when I think 'convolution' are:

It's the thing that corresponds to multiplication on the other side of the Fourier transform. (This was already mentioned by John D. Cook) It works both ways, of course, $\mathcal F (f*g)=\mathcal F f\cdot \mathcal F g$ and $\mathcal F (f\cdot g)=\mathcal F f* \mathcal F g$. This fact is useful when used in combination with other simple facts about the Fourier transform (such as the fact that a rectangular function corresponds to sinc and, in the limit, a Dirac impulse corresponds to a constant function).

Imagine a black box that receives one number $x_n$ every second and must output a number $y_n$ every second. (DSP people call it a 'filter' and it's used, for example, to process audio signals in a mobile phone in real-time.) The simplest thing the box could do is to output some function of the current input. The natural next step is to remember the last k inputs and output some function of those k values. One of the simplest functions is a linear combination $$y_n=\sum_i c_i x_{n-i}$$ where $c_i$ is non-zero only for $0\le i<k$. That's a convolution! To generalize, you make the filter remember all previous values and even be clairvoyant. That is, you extend the support of $[c_n]$. Then, if you want, you replace digital circuits with analog ones. That is, you go from summing to integration.

As an example of combining these two points, if the filter always outputs the average of the last k inputs then that's a convolution with the rectangular function in the time domain so it must be a multiplication with a sinc in the frequency domain. Therefore, averaging the last k values attenuates high frequencies. (Hardly surprising, but at least you see immediately that the frequency response is not monotonic and there are only a few frequencies that are completely filtered out.)

I want to expand on a special case of Terry's answer which I think is particularly intuitive.

Suppose there is a function $f$ that you want to understand, but perhaps it is not smooth. Convolution gives you a way to construct new, possibly nicer functions which approximate $f$.

If you let $g$ be a bump function centered at the origin, then the convolution $f*g$ is a new function whose value at $x$ is given by averaging the values of $f$ around $x$. What do we mean exactly by "averaging"? Well, you use $g$ as your measure; translate it over so that it is centered at $x$, and then the integral $$f * g(x) = \int_{\mathbb{R}^n} f(y)g(y - x) dy$$ in the convolution corresponds to the $g$-weighted average of the values of $f$ around $x$ (i.e. in the small ball where $g$ doesn't vanish).

The convolution $f*g$ in this case has the advantage that it is much smoother than $f$. Intuitively, this should be not surprising since the value of $f*g(x)$ was gotten by averaging nearby $f$-values of $x$. Furthermore, you can approximate $f$ by smooth(er) things by considering a sequence of convolutions $f*(g_n)$ where $g_n$ is a sequence of bump functions which are more and more concentrated at the origin.

If you think of the second function $g$ in the convolution $f*g$ as a measure, then you can think of convolutions as $g$-weighted averages of $f$.

I like the answer you gave when you asked the question. More generally, the convolution of two measures $\mu$ and $\nu$ is the pushforward of $\mu \times \nu$ by addition. In probability, that means you independently draw from $\mu$ and $\nu$ and add the resulting random vector. It's something that you can visualize to a certain extent if you do think of measures as fuzzy versions of points (like Terry Tao said).

One point of view of measures is that they are linear combinations of points (or limits of things you can get from linear combinations of points). If you take this point of view, then convolution is simply the extension of the addition law by linearity to the case of measures.

Since you can translate functions as well as measures, you can convolve, say, a probability measure with a function by randomly translating the function, giving the averaged out function
$\int f(x-y) d\mu(y)$
which generally looks like a smoothed out version of your function $f$ -- $\mu$ tells you which translations you use and how to average. Again, you can view this as the extension of the operation of translating functions by linearity/continuity to the case of measures.

The Lebesgue measure allows you to identify functions with measures, $g \mapsto g(x) dx$, so you can also convolve functions with other functions, but you might think of this operation is a bit less basic.

Actually, the process of convolution extends by continuity to more than just measures but also to distributions. For example, you can approximate a tangent vector at $0$ (giving the distribution $u(x) = \sum_i c^i \partial_i \delta_0$) by differences of point masses, so convolution extends to distributions as well, but you can even get differential operators this way (in this example, $u \ast f$ is the derivative of $f$ in the $u$ direction). The technical difference here is that the approximation is only valid when integrated against $C^k$ functions (rather than $C^0$ functions in the case of measures). But the principle is the same -- it's the extension of the addition law by linearity and limits.

The fundamental theorem of calculus says that $\frac{d}{dx} \displaystyle \int_a^x f(t)dt = f(x)$. In other words, $f(x) \approx \displaystyle \int_{x-h}^{x+h} \frac{1}{2h} f(t)dt$. Letting $g_h(u) = \frac{1}{2h}$ on the interval $(-h,h)$ and 0 elsewhere, we see by pure algebraic manipulation that $f(x) \approx \displaystyle \int_{-\infty}^\infty g_h(x-t)f(t)dt$. So the fundamental theorem of calculus can very naturally be rephrased in terms of convolution with a bump function. Differentiation under the integral sign immediately gives the differentiation formula for convolutions, and thus that convolutions of two functions are at least as smooth as both factors. Thus finding good smooth approximations to the rectangular bump functions $g_h$ automatically gives us smooth approximations to any integrable function we like, just by convolving against these "smooth molifiers". Pretty cool stuff. As mentioned in other answers they really start to shine when you start thinking about fourier analysis, but that is a another story. If you google "low pass filter" you will find some pretty snazzy applications of the fact that the fourier transform turns convolution into multiplication.

I prefer sound to Terry Tao's light. Listen to my voice through a wall. At each moment in time, you hear not just what I am saying now, but also some reverberation from what I said moments ago. So if I make a sound given by $f(t)$ (density of air), you hear a linear combination $h(0)f(t) + h(1)f(t-1) + h(2)f(t-2) + \dots$, or a continuous version of that, i.e. $h*f$. The function $h(\tau)$ is how much you hear from $\tau$ seconds before the current time. If $h(\tau)$ decays slowly, my voice is muffled by reverb.

Fourier theory shows that recovering my voice $f(t)$ is difficult when $\hat{h}(f)$ is very small at some frequencies: the wall doesn't vibrate at those frequencies.

If $h(\tau) \ne 0$ for some negative $\tau$, you can hear me before I speak!

This is really following on from John's answer, but is a bit long for a comment, so I thought I'd write it out here for extra clarity.

Say you have a semigroup S, and you take the free (complex) vector space this generates, call it $C^S$. By construction/definition this has a distinguished basis, indexed by elements of the semigroup; and since we have a semigroup structure, that means we can multiply basis elements to get other basis elements. But that now gives us a way to multiply two vectors $a$ and $b$ together: write $a$ and $b$ as linear combinations of basis elements, and then define their product as the obvious (bi)linear extension of the multiplication on basis elements. If you do all this starting with $S={\bf Z}$, the group of integers, then what we've done is defined multiplication of trigonometric polynomials, or convolution of finitely supported sequences.

The thing I like about this point of view is that it immediately generalises to $l^1(S)$, and makes $l^1(S)$ into a Banach algebra. If $S$ is a topological group with a Haar measure on it -- such as the real line with Lebesgue measure -- then the same idea gives us the usual Banach algebra structure on $L^1(S)$, which in the case $S={\bf R}$ is precisely the convolution of integrable functions in the usual sense.

(At this point someone -- often me -- usually wants to mention forgetful functors from algebras to vector spaces and from semigroups to sets, but that's probably getting a bit OTT for the question at hand.)

The way you have described is the best way to think about convolution. More generally, if you have a group and a class of square-integrable functions (really I should say "half-densities") on it, then the convolution product precisely extends the group product.

If you convolve an image with a discrete matrix of values - so like a function that is zero outside a few pixles then you can create almost an unlimited number of filtering effects around each point. Fof example you can do some kind of averaging or weighted integration - which looks like blurring as Professor Tao mentions if you use a matrix whose values drop off smoothly, radially from the centre - a bump. You can also compute directional derivatives, look for edges, circles, blobs, steps - basically anything you like.

The interpretation in terms of multiplication of Fourier coeficients is interesting and makes applications of the above in reality fast, especially if the filter is fixed, because you can use the Fast Fourier Transform on both images but you only need to update one of them. However I a not sure how intuitive it is!

I'm not sure I have added much additional information but I hope this helps anyway,

There is a nice relationship between convolution of probability measures and random walks which is very clear on finite groups.

For a particularly concrete example, suppose you are shuffling a deck of cards. You can model this as picking elements of $S_{52}$ according to some probability measure $P$ on $S_{52}$. This generates a Markov chain with transition matrix whose $(s,t)$-th entry is given by $P(ts^{-1})$ --- if I am permitted to abuse notation somewhat, the element $ts^{-1}$ is the shuffle that takes the deck from ordering $s$ to ordering $t$. If one wants to know the transition matrix for two shuffles, this corresponds to the square of the original matrix. One can then check that this new matrix corresponds to constructing the transition matrix generated by $P*P$, that is the matrix whose $(s,t)$-th entry is given by $(P*P)(ts^{-1})$, and in fact $k$ shuffles corresponds to the the $k$-fold convolution of $P$ with itself.

Convolving two different probability measures then corresponds to shuffling your deck according to one technique and then a different technique.

It looks I am a bit late to the party; but hopefully not too late. I can contribute one instance where the application of convolution is very enlightening. Look at this in the light of Harald Hanche-Olsen's answer.

Let us consider the operation of a signal processing system, or an electrical network, or a control system.

Let $i(t)$ be the response of the system to the unit impulse function, ie the Dirac delta function, $\delta(t)$. Now we give an arbitrary signal $f(t)$ as input to the system. Then, the response of the system to $f(t)$ is

$(f * i) (t)$,

i.e. the convolution of $i$ and $f$.

Next, in the spirit of rgrig's answer, I add that convolution becomes multiplication in the frequency domain. In that domain, it is like you multiply individually at each frequency component and add up again(ie integrate).

Here is my take on this. $\newcommand{\bZ}{\mathbb{Z}}$ $\newcommand{\bR}{\mathbb{R}}$ Discretize the real axis and thing of it as the collection of point $\Lambda_\hbar:=\hbar \bZ$, where $\hbar>0$ is a small number. Then a function $f:\Lambda_\hbar\to \bR$ is determined by its generating function, i.e., the formal power series $\newcommand{\ii}{\boldsymbol{i}}$

and the expression in the right hand sum is a "Riemann sum" approximating

$$\int_{\bR} f(x)^{-\ii\xi x} dx. $$

Above we recognize the Fourier transform of $f$. If we let $\hbar\to 0$ in (\ref{2}) and we use (\ref{1}) we obtain the wellknown fact that the Fourier transform maps the convolution to the usual pointwise product of functions. (The fact that this rather careless passing to the limit can be rigorous is what the Poisson formula is all about.)

The above argument shows that we can regard $\hbar G_f^\hbar(1)$ as an approximation for $\int_{\bR} f(x) dx$.

Denote by $\delta(x)$ the Delta function 9concentrated at $0$. The Delta function concentrated at $x_0$ is then $\delta(x-x_0)$. What could be the generating function of $\delta(x)$, $G\delta^\hbar$? First, we know that $\delta(x)=0$, $\forall x\neq 0$ so that

$$G_\delta^\hbar(t) =ct^0=c. $$

The constant $c$ can be determined from the equality

$$ 1= \int_{\bR} \delta(x) dx=\hbar G_\delta^\hbar(1)=\hbar c$$

Hence $\hbar G_\delta^\hbar(1)=1$. Similarly

$$ G^\hbar_{\delta(\cdot-n\hbar)} =\frac{1}{\hbar} t^n. $$

Putting together all of the above we obtain an equivalemt description for the generating functon af a function $f:\Lambda_\hbar\to\bR$. More precisely

The last equality suggests an interpretation for the generating function as an algebraic encoding of the fact that $f:\Lambda_\hbar\to\bR$ is a superposition of $\delta$ functions concentrated along the points of the lattice $\Lambda_\hbar$.

As far as I understand, in simple words, considering the simple moving average algorithm, when you convolve F with G, then G defines how you are going to do the weightings to get the average. G can be seen as a component for defining the weighting policy.

To complement/supplement other answers, it may be worthwhile to note that the question itself blurs two substantially different mechanisms. Namely, there is, first, for any group representation of a topological group $G$ on a topological vector space $V$, an action of compactly-supported continuous functions on $G$ on $V$, by (e.g., Gelfand-Pettis/weak) integrals $f\cdot v=\int_G f(g)\cdot gv\,dg$. It is of some moment to note that this does not depend on $v$ being in any sort of natural function-space. The second point is that $f\cdot (g\cdot v)=(f*g)\cdot v$, where $*$ denotes the convolution. That is, the notion of convolution is externally determined by being what it has to be for (for example) compactly-supported continuous functions to act (associatively) on any representation space.

Depending on one's outlook, this may reduce some element of seeming whimsy in "defining" convolution, since, in a larger context, _there_is_no_choice_.

This definition is very “symmetric”, we do not have to think about the minus-sign, and especially in the case of non-commutative groups it is much more obvious than getting the “right” formula for the convolution of functions. We directly recover the intuition of the sum of two random variables (and it is even more general, because we can have arbitrary distributions (for example $\delta$-distributions) for our random variables): “We account the ‘probability’ for the pair $(x,y)$ at the point $x+y$”.

I'll give two answers. Put together I think they build a decent intuition.

The first is algorithmic. I think about convolution of two lists $\vec{a} \ast \vec{b}$ as contrasted to the dot product $\vec{a} \cdot \vec{b}$. In fact convolution is how I and other students intuitively wanted to describe multiplication of vectors in my first attempt at multivariable calc. The dot product maps $K^N \times K^N \to K^1$ but $\ast: K^N \times K^N \to K^N$. Otherwise they are similar in that they multiply pairs $a_i \cdot b_i$. But where the dot product $\tt{reduce}$s dimensionality by summing $\displaystyle \sum_i$ over the entries, convolution leaves the products in place $\left( a_1 \cdot b_1, \ \ldots\ , \ a_N \cdot b_N \right)$. In other words convolution is the entrywise multiplication your linear algebra students wanted to do when you first asked them how they should multiply two matrices (except on vectors).

The second is visual. Picture a choppy curve, e.g. . To smooth that you would convolve the long data series against a much shorter list. To get the weekly rolling average you would convolve it against ${1 \over 5} \cdot \left[1 \ 1 \ 1 \ 1 \ 1 \right]$ (implied zeroes to the left and right—this is convolving against a $\mathrm{Rect}$ function). To get the monthly rolling average that would be 25 ones. If you want to convolve against something smoother than $\mathrm{Rect}$ you could convolve against a Gaussian. Gaussian blur in 2-D looks like this: . I'm now getting out of my knowledge area but I think box blur is the image-processing word for convolving against a 2D $\mathrm{Rect}$ function, perhaps only meaning with width=three ones: . Perhaps with some computer knowledge you can code up a smoother with data of your choice.

As a post-script to this answer, I think Gaussian vs rectangular convolution is actually a good example to explain to non-mathematicians how mathematicians think about ``ugliness''. There's something intuitively stupid, even to a non-mathematician, about integrating against something like $$\begin{bmatrix}0&0&0&0&0\\ 0&0 & 1 & 0&0 \\ 0&1 & 1 & 1&0 \\ 0&0 & 1 & 0&0 \\ 0&0&0&0&0 \end{bmatrix}$$ when a circle or ellipse would be a more logical shape. There's also something stupid or ugly or pointless or strange about integrating against $\left[ 0 \ \cdots\ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 0 \ \cdots\ 0 \right]$ , but what's the right way to "smoothly" or "logically" go down on both sides? Cue discussion on $\mathcal{F} \left( \exp(-x^2/2) \right) = \exp( -x^2/2)$....

In signal processing, convolution $f(t)*g(t)$ can be seen as a weigted mean so
$f(t)*g(t)=\int{f(t-u)g(u)du}$ is the mean of $f(t)$ throught the window $g$ weighted by the value at the center of this window, $g(u)$. In this way we obtain a varying mean of $f$ depending on the parameter $u$.

This is also true in discrete calculus where discrete convolution is used when multiplying sums, and in particular n-times repeated convolution is used in binomial theorem to express the power of a binomial.

Ignoring the extraneous first $x$, this formula is only correct if the functions are integrable. Convolution makes sense if both functions are square-integrable but not integrable, in which case this equation is incorrect (since Fubini does not apply). But in any case, what intuition do you glean from knowing, for a fixed $f$, that $T_f(g)=f*g$ has the property you stated? The operator $S_f(g)=g(x)\int_{-\infty}^{\infty}f(t)dt$ has the same property but is just a constant times the identity...
–
Peter LuthyJun 16 '11 at 20:42