where K(•) is the kernel — a non-negative function that integrates to one and has mean zero — and h > 0 is a smoothing parameter called the bandwidth. A kernel with subscript h is called the scaled kernel and defined as Kh(x) = 1/h K(x/h). Intuitively one wants to choose h as small as the data allow, however there is always a trade-off between the bias of the estimator and its variance; more on the choice of bandwidth below.

A range of kernel functions are commonly used: uniform, triangular, biweight, triweight, Epanechnikov, normal, and others. The Epanechnikov kernel is optimal in a mean square error sense,[3] though the loss of efficiency is small for the kernels listed previously,[4] and due to its convenient mathematical properties, the normal kernel is often used K(x) = ϕ(x), where ϕ is the standard normal density function.

Kernel density estimates are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel. To see this, we compare the construction of histogram and kernel density estimators, using these 6 data points: x1 = −2.1, x2 = −1.3, x3 = −0.4, x4 = 1.9, x5 = 5.1, x6 = 6.2. For the histogram, first the horizontal axis is divided into sub-intervals or bins which cover the range of the data. In this case, we have 6 bins each of width 2. Whenever a data point falls inside this interval, we place a box of height 1/12. If more than one data point falls inside the same bin, we stack the boxes on top of each other.

For the kernel density estimate, we place a normal kernel with variance 2.25 (indicated by the red dashed lines) on each of the data points xi. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate is evident compared to the discreteness of the histogram, as kernel density estimates converge faster to the true underlying density for continuous random variables.[5]

Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The 6 individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis.

The bandwidth of the kernel is a free parameter which exhibits a strong influence on the resulting estimate. To illustrate its effect, we take a simulated random sample from the standard normal distribution (plotted at the blue spikes in the rug plot on the horizontal axis). The grey curve is the true density (a normal density with mean 0 and variance 1). In comparison, the red curve is undersmoothed since it contains too many spurious data artifacts arising from using a bandwidth h = 0.05, which is too small. The green curve is oversmoothed since using the bandwidth h = 2 obscures much of the underlying structure. The black curve with a bandwidth of h = 0.337 is considered to be optimally smoothed since its density estimate is close to the true density.

Under weak assumptions on ƒ and K,[1][2] MISE (h) = AMISE(h) + o(1/(nh) + h4) where o is the little o notation. The AMISE is the Asymptotic MISE which consists of the two leading terms

where for a function g, and ƒ'' is the second derivative of ƒ. The minimum of this AMISE is the solution to this differential equation

or

Neither the AMISE nor the hAMISE formulas are able to be used directly since they involve the unknown density function ƒ or its second derivative ƒ'', so a variety of automatic, data-based methods have been developed for selecting the bandwidth. Many review studies have been carried out to compare their efficacities,[6][7][8][9][10] with the general consensus that the plug-in selectors[11] and cross validation selectors[12][13][14] are the most useful over a wide range of data sets.

Substituting any bandwidth h which has the same asymptotic order n−1/5 as hAMISE into the AMISE gives that AMISE(h) = O(n−4/5), where O is the big o notation. It can be shown that, under weak assumptions, there cannot exist a non-parametric estimator that converges at a faster rate than the kernel estimator.[15] Note that the n−4/5 rate is slower than the typical n−1 convergence rate of parametric methods.

Knowing the characteristic function, it is possible to find the corresponding probability density function through the Fourier transform formula. One difficulty with applying this inversion formula is that it leads to a diverging integral, since the estimate is unreliable for large t’s. To circumvent this problem, the estimator is multiplied by a damping function ψh(t) = ψ(ht), which is equal to 1 at the origin and then falls to 0 at infinity. The “bandwidth parameter” h controls how fast we try to dampen the function . In particular when h is small, then ψh(t) will be approximately one for a large range of t’s, which means that remains practically unaltered in the most important region of t’s.

The most common choice for function ψ is either the uniform function ψ(t) = 1{−1 ≤ t ≤ 1}, which effectively means truncating the interval of integration in the inversion formula to [−1/h, 1/h], or the gaussian functionψ(t) = e−π t2. Once the function ψ has been chosen, the inversion formula may be applied, and the density estimator will be

where K is the Fourier transform of the damping function ψ. Thus the kernel density estimator coincides with the characteristic function density estimator.

In CrimeStat, kernel density estimation is implemented using five different kernel functions – normal, uniform, quartic, negative exponential, and triangular. Both single- and dual-kernel density estimate routines are available. Kernel density estimation is also used in interpolating a Head Bang routine, in estimating a two-dimensional Journey-to-crime density function, and in estimating a three-dimensional Bayesian Journey-to-crime estimate.

In ELKI, kernel density functions can be found in the package de.lmu.ifi.dbs.elki.math.statistics.kernelfunctions

In ESRI products, kernel density mapping is managed out of the Spatial Analyst toolbox and uses the Quartic(biweight) kernel.

In gnuplot, kernel density estimation is implemented by the smooth kdensity option, the datafile can contain a weight and bandwidth for each point, or the bandwidth can be set automatically[17] according to "Silverman's rule of thumb" (see above).

In MATLAB, kernel density estimation is implemented through the ksdensity function (Statistics Toolbox). This function does not provide an automatic data-driven bandwidth but uses a rule of thumb, which is optimal only when the target density is normal. A free MATLAB software package which implements an automatic bandwidth selection method[18] is available from the MATLAB Central File Exchange for 1-dimensional data and for 2-dimensional data.

In Mathematica, numeric kernel density estimation is implemented by the function SmoothKernelDistributionhere and symbolic estimation is implemented using the function KernelMixtureDistributionhere both of which provide data-driven bandwidths.

In R, it is implemented through the density and the bkde function in the KernSmooth library (both included in the base distribution), the kde function in the ks library, the dkden and dbckden functions in the evmix library (latter for boundary corrected kernel density estimation for bounded support), the npudens function in the np library (numeric and categorical data), the sm.density function in the sm library. For an implementation of the kde.R function, which does not require installing any packages or libraries, see kde.R.

In SAS, proc kde can be used to estimate univariate and bivariate kernel densities.

In Stata, it is implemented through kdensity; for example histogram x, kdensity. Alternatively a free Stata module KDENS is available from here allowing a user to estimate 1D or 2D density functions.

For this example, the data are a synthetic sample of 50 points drawn from the standard normal and 50 points from a normal distribution with mean 3.5 and variance 1. The automatic bandwidth selection and density estimation with normal kernels is carried out by kde.m. This function implements an automatic bandwidth selector that does not rely on the commonly used Gaussian plug-in rule of thumb heuristic.[18]

This example is based on the Old Faithful Geyser, a tourist attraction located in Yellowstone National Park. This famous dataset containing 272 records consists of two variables, eruption duration, and waiting time until next eruption, both in minutes, included in the base distribution of R. We analyse the waiting times, using the ks library since it has a wide range of visualisation options. The bandwidth function is hpi which in turn calls the dpik function in the KernSmooth library: these functions implement the plug-in selector.[11] The kernel density estimate using the normal kernel is computed using kde which calls bkde from KernSmooth. The plot function allows the addition of the data points as a rug plot on the horizontal axis. The bimodal structure in the density estimate of the waiting times is clearly seen, in contrast to the rug plot where this structure is not apparent.

To demonstrate how kernel density estimation is performed in Python, we simulate some data from a mixture of normals, where 50 observations are generated from a normal distribution with mean zero and standard deviation 3 and another 50 from a normal with mean 4 and standard deviation 1.

The gaussian_kde function from the SciPy package implements a kernel-density estimate using Gaussian kernels, and includes automatic determination of bandwidth. By default, gaussian_kde uses Scott's rule to select the appropriate bandwidth.[22]

Free Online Software (Calculator) computes the Kernel Density Estimation for any data series according to the following Kernels: Gaussian, Epanechnikov, Rectangular, Triangular, Biweight, Cosine, and Optcosine.