Introduction

Chapter 7 of Machine Learning for Hackers is about numerical
optimization. The authors organize the chapter around two examples of
optimization. The first is a straightforward least-squares problem like
that we’ve encountered already doing linear regressions, and is amenable
to standard iterative algorithms (e.g. gradient descent). The second is
a problem with a discrete search space, not clearly differentiable, and
so lends itself to a stochastic/heuristic optimization technique (though
we’ll see the optimization problem is basically artificial). The first
problem gives us a chance to play around with Scipy’s optimization
routines. The second problem has us hand-coding a Metropolis algorithm;
this doesn’t show off much new Python, but it’s fun nonetheless.

The notebook for this chapter is at the github report here, or you
can view it online via nbviewer here.

Ridge regression by least-squares

In chapter 6 we estimated LASSO regressions, which added an L1
penalty on the parameters to the OLS loss-function. The ridge regression
works the same way, but applies an L2 penalty to the parameters. The
ridge regression is a somewhat more straightforward optimization
problem, since the L2 norm we use gives us a differentiable loss function.

In this example, we’ll regress weight on height, similar to chapter
5. We can specify the loss (sum of squared errors) function for the
ridge regression with the following function in Python:

y=heights_weights['Weight'].valuesXmat=sm.add_constant(heights_weights['Height'],prepend=True)defridge_error(params,y,Xmat,lam):''' Compute SSE of the ridge regression. This is the normal regression SSE, plus the L2 cost of the parameters. '''predicted=np.dot(Xmat,params)sse=((y-predicted)**2).sum()sse+=lam*(params**2).sum()returnsse

The authors use R’s optim function, which defaults to the Nelder-Mead
simplex algorithm. This algorithm doesn’t use any gradient or Hessian
information to optimize the function. We’ll want to try out some
gradient methods, though. Even though the functions for these methods
will compute numerical gradients and Hessians for us, for the ridge
problem these are easy enough to specify explicitly.

defridge_grad(params,y,Xmat,lam):''' The gradiant of the ridge regression SSE. '''grad=np.dot(np.dot(Xmat.T,Xmat),params)-np.dot(Xmat.T,y)grad+=lam*paramsgrad*=2returngraddefridge_hess(params,y,Xmat,lam):'''The hessian of the ridge regression SSE.'''returnnp.dot(Xmat.T,Xmat)+np.eye(2)*lam

Like the LASSO regressions we worked with in chapter 6, the
ridge requires a penalty parameter to weight the L2 cost of the
coefficient parameters (called lam in the functions above; lambda is
a keyword in Python). The authors assume we’ve already found an
appropriate value via cross-validation, and that value is 1.0.

We can now try to minimize the loss function with a couple of different
algorithms. First the Nelder-Mead simplex, which should correspond to
the authors’ use of optim in R.

Fortunately, we get the same results for all three methods. Supplying
the Hessian to the Newton method shaves some time off, but in this
simple application, it’s not really worth coding up a Hessian function
(except for fun).

For this simple problem, all of these methods work well. For more
complicated problems, there are considerations which would lead you to
prefer one over another, or perhaps to use them in combination. There
are also several more methods available, some which allow you to solve
constrained optimization problems. Check out the very good
documentation. Also note that if you’re not into hand-coding
gradients, scipy has a function derivative in its misc module that
will compute numerical derivatives. In many cases, the functions will do
this automatically if you fail to provide a function to their gradient arguments.

Optimizing on sentences with the Metropolis algorithm

The second example in this chapter is a “code-breaking” exercise. We
start with a message “here is some sample text”, which we encrypt using
a Ceasar cipher that shifts each letter in the message to the next
letter in the alphabet (with Z going to A). We can represent the cipher
(or any cipher) in Python with a dict that maps each letter to its
encrypted counterpart.

The inverse_ceasar_cipher dict reverses the cipher, so we can get an
original message back from one that’s been encrypted by the Ceasar
cipher. Based on these structures, let’s make functions that will
encrypt and decrypt text.

defcipher_text(text,cipher_dict=ceasar_cipher):# Split the string into a list of characters to apply# the decoder over.strlist=list(text)ciphered=''.join([cipher_dict.get(x)orxforxinstrlist])returnciphereddefdecipher_text(text,cipher_dict=ceasar_cipher):# Split the string into a list of characters to apply# the decoder over.strlist=list(text)# Invert the cipher dictionary (k, v) -> (v, k)decipher_dict={cipher_dict[k]:kforkincipher_dict}deciphered=''.join([decipher_dict.get(x)orxforxinstrlist])returndeciphered

To decrypt our message, we’ll design a Metropolis algorithm that
randomly proposes ciphers, decrypts the message according to the
proposed cipher, and see’s how probable that message is based on a
lexical database of word frequency in Wikipedia.

The following functions are used to generate proposal ciphers for the
Metropolis algorithm. The idea is to randomly generate ciphers and see
what text they result in. If the text resulting from a proposed cipher
is more likely (according to the lexical database) than the current
cipher, we accept the proposal. If it’s not, we accept it wil a
probability that is lower the less likely the resulting text is.

The method of generating new proposals is important. The authors use a
method that chooses a key (letter) at random from the current cipher,
and swaps its with some other letter. For example, if we start with the
Ceasar Cipher, our proposal might randomly choose to re-map A to N
(instead of B). The proposal would then be the same a the Ceasar Cipher,
but with A → N and M → B (since A originally mapped to B and M
originally mapped to N). This proposal-generating mechanism is
encapsulated in propose_modified_cipher_from_cipher.

This is inefficient in a few ways. First, the letter chosen to modify in
the cipher may not even appear in the text, so the proposed cipher won’t
modify the text at all and you end up wasting cycles generating a lot of
useless proposals. Second, we may end up picking a letter that occurs in
a highly likely word, which will increase the probability of generating
an inferior proposal.

We’ll suggest another mechanism that, instead of selecting a letter from
the current cipher to re-map, will choose a letter amongst the non-words
in the current deciphered text. For example, if our current deciphered
text is “hello wqrld”, we will only select amongst {w, q, r, l, d} to
modify at random. The minimizes the chances that a modified cipher will
turn real words into gibberish and produce less likely text. The
function propose_modified_cipher_from_text performs this proposal mechanism.

One way to think of this is that it’s analogous to tuning the variance
of the proposal distribution in the typical Metropolis algorithm. If the
variance is too low, our algorithm won’t efficiently explore the target
distribution. If it’s too high, we’ll end up generating lots of lousy
proposals. Our cipher proposal rules can suffer from similar problems.

defgenerate_random_cipher():''' Randomly generate a cipher dictionary (a one-to-one letter -> letter map). Used to generate the starting cipher of the algorithm. '''cipher=[]input=lettersoutput=letters[:]random.shuffle(output)cipher_dict={k:vfor(k,v)inzip(input,output)}returncipher_dictdefmodify_cipher(cipher_dict,input,new_output):''' Swap a single key in a cipher dictionary. Old: a -> b, ..., m -> n, ... New: a -> n, ..., m -> b, ... '''decipher_dict={cipher_dict[k]:kforkincipher_dict}old_output=cipher_dict[input]new_cipher=cipher_dict.copy()new_cipher[input]=new_outputnew_cipher[decipher_dict[new_output]]=old_outputreturnnew_cipherdefpropose_modified_cipher_from_cipher(text,cipher_dict,lexical_db=lexical_database):''' Generates a new cipher by choosing and swapping a key in the current cipher. '''_=text# Unusedinput=random.sample(cipher_dict.keys(),1)[0]new_output=random.sample(letters,1)[0]returnmodify_cipher(cipher_dict,input,new_output)defpropose_modified_cipher_from_text(text,cipher_dict,lexical_db=lexical_database):''' Generates a new cipher by choosing a swapping a key in the current cipher, but only chooses keys that are letters that appear in the gibberish words in the current text. '''deciphered=decipher_text(text,cipher_dict).split()letters_to_sample=''.join([tfortindecipherediflexical_db.get(t)isNone])letters_to_sample=letters_to_sampleor''.join(set(deciphered))input=random.sample(letters_to_sample,1)[0]new_output=random.sample(letters,1)[0]returnmodify_cipher(cipher_dict,input,new_output)

Next, we need to be able to compute a message’s likelihood (from the
lexical database). The log-likelihood of a message is just the sum of
the log-likelihoods of each word (one-gram) in the message. If the word
is gibberish (i.e., doesn’t occur in the database) it gets a tiny
probability set to the smallest floating-point precision.

We can now use these functions in our Metropolis algorithm. Each step in
the metropolis algorithm proposes a cipher, deciphers the text according
the proposal, and computes the log-likelihood of the deciphered message.
If the likelihood of the deciphered message is better under the proposal
cipher than the current cipher, we definitely accept that proposal for
our next step. If not, we only accept the proposal with a probability
based on the relative likelihood of the proposal to the current cipher.

I’ll define this function to take an arbitrary proposal function via the
proposal_rule argument. So far, this can be one of the two
propose_modified_cipher_from_* functions defined above.

To run the algorithm, just wrap the step function inside a loop. There’s
no stopping rule for the algorithm, so we have to choose a number of
iterations, and hope it’s enough to get us to the optimum. Let’s use 250,000.

message='here is some sample text'ciphered_text=cipher_text(message,ceasar_cipher)niter=250000defmetropolis_decipher(ciphered_text,proposal_rule,niter,seed=4):random.seed(seed)cipher=generate_random_cipher()deciphered_text_list=[]logp_list=[]foriinxrange(niter):logp=text_logp(ciphered_text,cipher)current_deciphered_text=decipher_text(ciphered_text,cipher)deciphered_text_list.append(current_deciphered_text)logp_list.append(logp)cipher=metropolis_step(ciphered_text,cipher,proposal_rule)results=DataFrame({'deciphered_text':deciphered_text_list,'logp':logp_list})results.index=np.arange(1,niter+1)returnresults

First let’s look at the authors’ proposal rule. While they managed to get a reasonable decrypted message
in about 50,000 iterations, we’re still reading gibberish after 250,000.
As they say in the book, their results are an artefact of a lucky seed value.

Now, let’s try the alternative proposal rule, which only chooses letters
from gibberish words when it modifies the current cipher to propose a
new one. The algorithm doesn’t find the actual message, but it actually
finds a more likely message (according the the lexical database) within
20,000 iterations.

The graph below plots the likelihood paths of the algorithm for the two
proposal rules. The blue line is the log-likelihood of the original
message we’re trying to recover.

Direct calculation of the most likely message

The Metropolis algorithm is kind of pointless for this application. It’s
really just jumping around looking for the most likely phrase. But since
the likelihood of a message is just the sum of the log probabilities of
the log probabilities of its component words, we just need to look for
the most likely words of the lengths of the words of the ciphered message.

If the message at some point is “fgk tp hpdt”, then, if run long enough,
the algorithm should just find the most likely three-letter word, the
most likely two-letter word, and the most likely four-letter word. But
we can look these up directly.

For example, the message we encrypted is ‘here is some sample text’,
which has word lengths 4, 2, 4, 6, 4. What’s the most likely message
with these word lengths?

So, technically, we should have decoded our message to be “with of
united with” instead of “here is some sample text”. This is not a
shining endorsement of this methodology for decrypting messages.

Conclusion

While it was a fun exercise to code up the Metropolis decrypter in this
chapter, it didn’t show off any new Python functionality. The ridge
problem, while less interesting, showed off some of the optimization
algorithms in Scipy. There’s a lot of good stuff in Scipy’s optimize
module, and its documentation is worth checking out.