Détails sur le séminaire

Abstract:
Learning regular languages is a branch of machine learning, which has been proved useful in many areas, including artificial intelligence, neural networks, data mining, verification, etc. On the other hand, interest in languages defined over large and infinite alphabets has increased in recent years. Although many theories and properties generalize well from the finite case, learning such languages is not an easy task. As the existing methods for learning regular languages depends on the size of the alphabet, a straightforward generalization in this context is not possible.

In this thesis, we present a generic algorithmic scheme that can be used for learning languages defined over large or infinite alphabets, such as bounded subsets of N or R or Boolean vectors of high dimensions. We restrict ourselves to the class of languages accepted by deterministic symbolic automata that use predicates to label transitions, forming a finite partition of the alphabet for every state.

The proposed learning algorithm, an adaptation of Angluin’s L∗, combines standard automaton learning by state characterization, with the learning of the static predicates that define the alphabet partitions. We use the online learning scheme, where two types of queries provide the necessary information about the target language. The first type, membership queries, answer whether a given word belongs or not to the target. The second, equivalence queries, check whether a conjectured automaton accepts the target language, a counter-example is provided otherwise.

We study language learning over large or infinite alphabets within a general framework but our aim is to provide solutions for particular concrete instances. For this, we focus on the two main aspects of the problem. Initially, we assume that equivalence queries always provide a counter-example which is minimal in the length-lexicographic order when the conjecture automaton is incorrect. Then, we drop this “strong” equivalence oracle and replace it by a more realistic assumption, where equivalence is approximated by testing queries, which use sampling on the set of words. Such queries are not guaranteed to find counter-examples and certainly not minimal ones. In this case, we obtain the weaker notion of PAC (probably approximately correct) learnability and learn an approximation of the target language. All proposed algorithms have been implemented and their performance, as a function of automaton and alphabet size, has been empirically evaluated.