In our first post on learning theory, we’ll start by examining a nice question that started an interesting discussion and brought out some of the core issues learning theorists constantly have to deal with. It also serves as a good introduction to one of the classic areas of learning theory.

As we know, regular expressions describe regular languages, which are also equivalent to the class of Finite State Automata (DFA), a classic object in computability theory. Suppose we get examples of strings that are members of a regular language (positive examples) and strings that are not members of the language (negative examples), labeled accordingly. Laszlo wanted to know whether finding the smallest regular expression (or DFA) consistent with these examples is NP-hard, and moreover, what this has to do with learning regular languages.

Learning and Occam’s Razor

When we talk about learning, we need to first pick a model, and the first natural one to think about is Leslie Valiant’s classical PAC learning model (’84), partly for which he received this year’s Turing award. In PAC learning, the learner is trying to learn some concept (in this case, a specific regular language) and is given positive and negative examples (member strings and not) from some unknown distribution (say, uniform over all strings). The learner’s goal is to produce a hypothesis that will match the label of the hidden target (the membership in the target regular language) on most examples from the same distribution, with high probability. The learner has to succeed for every choice of target and every distribution over examples.

Say the learner knows the target class (in our case, that the target is some regular language). In that case, if the learner gets some polynomially many examples in the representation of the concept and can find a small hypothesis from the target class that is correct on all the examples, even if this hypothesis isn’t exactly equal to the target, it will generalize well and be suitable for PAC learning. This theorem is aptly named Occam’s Razor (Blumer et al. ’87).

Hardness of Learning

So, indeed, Laszlo was right — whether or not finding the smallest regular expression (or smallest DFA) consistent with a given set of examples is NP-hard has a lot to do with being able to learn regular languages efficiently. Note that if we’re not concerned with efficiency, exhaustive search through all automata would yield the smallest one. But unfortunately, it is indeed NP-hard to find the smallest DFA (Angluin ’78; Gold ’78), and moreover, it is NP-hard to approximate a target DFA of size OPT by a DFA of size OPT^k for any constant factor k (Pitt and Warmuth ’93).

So does this imply regular languages are hard to learn? Unfortunately not; efficiently finding a small, zero error, DFA would have given us an efficient learning algorithm, but just because we can’t find one doesn’t mean there isn’t some other way to learn regular languages. For example, perhaps, we could learn by finding a hypothesis that’s something else entirely, and not a DFA. (For those interested, this is the difference between proper and improper learning.) Just because we’ve eliminated one way of learning doesn’t mean we’ve eliminated other venues.

There is, however, a more serious impediment to learning regular languages. In an important paper, Kearns and Valiant (’94) proved that you can encode the RSA function into a DFA. So, even if the labeled examples come from the uniform distribution, being able to generalize to future examples (also even coming from the uniform distribution) would break the RSA cryptosystem (Rivest et al. ’78), which is quite unlikely. This type of result is known as a cryptographic hardness of learning result. Hence, there is good reason to believe that DFA aren’t efficiently PAC learnable after all.

A Small Puzzle

This question raised a lot of interesting discussion among the answers and in the comments, with the pointers to the crpytographic hardness results bringing up a puzzle. It is known that given an automaton, there exist algorithms to minimize it efficiently (Huffman ’54; Moore ’56), meaning it is possible to efficiently find the automaton with the smallest number of states that accepts exactly the same language as the given one. At first, it may seem that contradicts the the NP-hardness result, that one can find the smallest automaton consistent with a given set of examples using the following procedure: construct a large DFA so that for each positive example, we have a chain accepting only that example. Then, minimize that DFA. It seems like this should work, but why doesn’t it?

The answer turns out to be that when you construct a DFA using the procedure above, you commit to a certain language. Then, upon minimizing the DFA, you have to stay within the same language, whereas the smallest DFA consistent with all the examples might encode a different language altogether. One can easily see this by the following example: imagine the only positive examples are “000001” and “000011.” Building the DFA in the manner described above commits to the language {“000001”, “000011”}, whereas the smallest DFA consistent with all examples is just one accept state pointing to itself, and that accepts everything (because we have no negative examples, this is allowed). Puzzle resolved.

Other Models

The question didn’t specify what it means to learn, and one great thing about the cstheory site is that different answers were able to address learnability in different models. We’ve just discussed the PAC model and learning regular languages in it, but the picture changes when we consider other models of learning. One model addressed in the answers is Mark Gold’s learning in the limit (’67), the first learning model considered in the literature. Here, the learner is presented an infinite stream of data and must eventually converge to the correct hypothesis. It turns out that in this setting, regular languages are unconditionally not learnable in the limit (Gold ’67) from text (positive examples).

To give a positive result, regular language learning is the canonical problem for another model, learning with membership and equivalence queries. In this model, the learner gets to ask membership queries (“Is string [x] in the language?”) and equivalence queries (“Is my hypothesis correct, and if not, what is a string on which it disagrees with the target?”). In a seminal paper, Dana Angluin (’87), who introduced the model, showed that regular languages are exactly learnable with membership and equivalence queries, and this result helped illustrate the power and relevance of this query model.

Open Problems

Because this is a blog post and not a survey, many important results have been left out from this discussion. Nonetheless, it may seem that everything is already known about the learnability of automata. This is far from true. There are many interesting problems left to explore: What happens for specific subclasses of reglar languages? Can we learn when the target language isn’t worst-case, but comes from some distribution? What if we restrict to certain types of algorithms? etc.

But instead of listing many specific questions, we will instead turn to a fascinating (and difficult) question asked on this site which has implications for learning regular languages. Aryeh Kontorovich recently asked: How many DFAs accept two given strings? Or, rather, how hard is it to compute this quantity, as a function of n, the number of states in the DFA. At the time of this post, this question has no accepted answers, so if you think you have an answer, come to cstheory.stackexchange.com and let us know!