1 Answer
1

This is treated extensively in the statistics literature, under the topic of regression. Two standard references here are Wasserman's book "all of nonparametric statistics" and Tsybakov's "introduction to nonparametric estimation". I'll talk briefly about some of the standard stuff, and try to give pointers outside of statistics (this is a common topic and different fields have different cultures: prove different kinds of theorems, make different assumptions).

(Kernel regressors, sometimes called the Nadaraya-Watson Estimator.) Here you write the function at any point as a weighted combination of nearby values. More concretely,
since this is in the statistics literature, you typically suppose you have some examples $((x_i,f(x_i)))_{i=1}^n$ drawn from some distribution,
and fix some kernel $K$ (can think of this as a gaussian, but zero mean is what matters most), and write
$$
\hat f(x) := \sum_i f(x_i) \left(\frac{ K(c_n(x-x_i)) }{ \sum_j K(c_n(x-x_j))}\right),
$$
where $c_n\to\infty$ (you are more sensitive to small distances as $n$ increases).
The guarantee is that, as $n\to\infty$, a probilistic criterion of distortion
(expectation of sup-norm, high probability, whatever) goes to zero.
(It hardly matters what $K$ looks like---it matters more how you choose $c_n$.)

(Basis methods.) A similar thing is to choose some family of "basis functions", things like Wavelets or piecewise linear functions, but really anything that forms a (possibly overcomplete) basis for the vector space $L^2$, and determine a weighted linear combination of scaled and translated elements.
The techniques here differ drastically from (1.); rather than plopping down basis functions centered at data points, you carefully compute the weight and location of each in order to minimize some distortion criterion. (Typically, their quantity is fixed a priori.) One approach is "basis pursuit", where you greedily add in new functions while trying to minimize some approximation error between $\hat f$ and $f$. To get a sense of the diversity of approaches here, a neat paper is Rahimi & Recht's "uniform approximation of functions with random bases". Perhaps I should say that the grand-daddy of all of these is the Fourier expansion; there's a lot of good material on this in Mallat's book on Wavelets.

(Tree methods.) Another way is to look at a function as a tree; at each level, you are working with some partition of the domain, and return, for instance, the average point. (Each pruning of the tree also gives a partition.) In the limit, the fineness of this partition will no longer discretize the function, and you have reconstructed it exactly. How best to choose this partition is a tough problem. (You can google this under "regression tree".)

(Polynomial methods; see also splines and other interpolating techniques.) By Taylor's theorem, you know that you can get arbitrarily close to well behaved functions. This may seem like a very basic approach (i.e., just use the Lagrange interpolating polynomial), but where things get interesting is in deciding which points to interpolate. This was investigated extensively in the context of numerical integration; you can find some amazing math under the topics of "clenshaw-curtis quadrature" and "gaussian quadrature". I'm throwing this in here because the types of assumptions and guarantees here are so drastically different than what appears above. I like this field but these methods suffer really badly from the curse of dimension, at least I think this is why they are less discussed than they used to be (if you do numeric integration with mathematica, I think it does quadrature for univariate domains, but sampling techniques for multivariate domains).

Considering various restrictions to your function class, you can instantiate the above to get all sorts of other widely-used scenarios. For instance, with boolean valued functions, thresholding (1.) will look a lot like a nearest-neighbor estimator, or an SVM with some local kernel (gaussian). A lot of the above stuff suffers from curse of dimension (bounds exhibit exponential dependence on the dimension). In machine learning you get around this either by explicitly constraining your class to some family (i.e., "parametric methods), or by an implicit constraint, usually something relating the quality of the approximants to the target function complexity (i.e., an analog of the weak learning assumption in boosting).

By the way, my favorite theorem related to neural net approximation is Kolmogorov's superposition theorem (from 1957!). It says that any multivariate continuous function $f:\mathbb{R}^d \to \mathbb{R}$ has the form
$$
f(x) = \sum_{j=0}^{2d}h_j\left(\sum_{i=1}^d g_{j,i}(x_i)\right),
$$
where each $g_{j,i} : \mathbb{R}\to\mathbb{R}$ and $h_j:\mathbb{R}\to\mathbb{R}$ is (univariate) continuous. Note that unlike neural nets, the $g$'s and $h$'s may all differ. But even so, given that there are only $\Theta(d^2)$ functions floating around, I find this totally amazing.

(You only asked about function classes, but I figured you'd be interested in methods as well.. if not.. oops)