Computes the Lagrangian dual L(theta) of the entropy of the
model, for the given vector theta=params. Minimizing this
function (without constraints) should fit the maximum entropy
model subject to the given constraints. These constraints are
specified as the desired (target) values self.K for the
expectations of the feature statistic.

This function is computed as:

L(theta) = log(Z) - theta^T . K

For ‘bigmodel’ objects, it estimates the entropy dual without
actually computing p_theta. This is important if the sample
space is continuous or innumerable in practice. We approximate
the norm constant Z using importance sampling as in
[Rosenfeld01whole]. This estimator is deterministic for any
given sample. Note that the gradient of this estimator is equal
to the importance sampling ratio estimator of the gradient of
the entropy dual [see my thesis], justifying the use of this
estimator in conjunction with grad() in optimization methods that
use both the function and gradient. Note, however, that
convergence guarantees break down for most optimization
algorithms in the presence of stochastic error.

Note that, for ‘bigmodel’ objects, the dual estimate is
deterministic for any given sample. It is given as: