Fit the maxent model p whose feature expectations are given
by the vector K.

Model expectations are computed either exactly or using Monte
Carlo simulation, depending on the ‘func’ and ‘grad’ parameters
passed to this function.

For ‘model’ instances, expectations are computed exactly, by summing
over the given sample space. If the sample space is continuous or too
large to iterate over, use the ‘bigmodel’ class instead.

For ‘bigmodel’ instances, the model expectations are not computed
exactly (by summing or integrating over a sample space) but
approximately (by Monte Carlo simulation). Simulation is necessary
when the sample space is too large to sum or integrate over in
practice, like a continuous sample space in more than about 4
dimensions or a large discrete space like all possible sentences in a
natural language.

Approximating the expectations by sampling requires an instrumental
distribution that should be close to the model for fast convergence.
The tails should be fatter than the model. This instrumental
distribution is specified by calling setsampleFgen() with a
user-supplied generator function that yields a matrix of features of a
random sample and its log pdf values.

The algorithm can be ‘CG’, ‘BFGS’, ‘LBFGSB’, ‘Powell’, or
‘Nelder-Mead’.

The CG (conjugate gradients) method is the default; it is quite fast
and requires only linear space in the number of parameters, (not
quadratic, like Newton-based methods).

The BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm is a
variable metric Newton method. It is perhaps faster than the CG
method but requires O(N^2) instead of O(N) memory, so it is
infeasible for more than about 10^3 parameters.

The Powell algorithm doesn’t require gradients. For small models
it is slow but robust. For big models (where func and grad are
simulated) with large variance in the function estimates, this
may be less robust than the gradient-based algorithms.