$$L = \{w\mid w\text{ contains a substring of form }yy\text{, where }y\text{ is any non-empty string}\}.$$ Is this language regular? We do not know what $y$ looks like in advance. And why is this language regular or not regular?

$\begingroup$What did you try? Where did you get stuck? We're happy to help you understand the concepts but jsut solving homework-style exercises for you is unlikely to really do that. Try to come up with a regular expression or automaton for the language. If that doesn't seem to work, you probably found some problem: maybe you can use that idea to prove that the language isn't regular. If that doesn't seem to work, you'll have more ideas that you can use to try to prove that it is regular, and so on.$\endgroup$
– David RicherbyOct 15 '18 at 17:47

1

$\begingroup$@Apass.Jack That used to be a difficult problem. There exist arbitrary long strings without repetitions over three letters, which is a big difference to the two letter case. For a solution see R.Ross and K.Winklmann, Repetitive strings are not context-free, RAIRO - Theoretical Informatics and Applications, 16 (1982) 191-199, numdam.org$\endgroup$
– Hendrik JanOct 15 '18 at 19:38

1 Answer
1

Slightly more formally, for an alphabet $\Sigma$, the language of repetitive strings over $\Sigma$ is defined as follows.
$$ RR = \{w: w=uyyv \text{ for some }u, y, v \in \Sigma^*, y \text{ is not empty}\}$$ The question is, is RR a regular language?

There are three cases.

$\Sigma$ has one symbol.
WLOG, let $\Sigma=\{a\}$. Then $RR = aaa^*$ is a regular language.

$\Sigma$ has two symbols.
WLOG, let $\Sigma=\{a, b\}$. Suppose $w\in RR$ does not contain $aa$ or $bb$. That is, if an $a$ in $w$ is followed by another symbol, that symbol must be $b$; if a $b$ in $w$ is followed by another symbol, that symbol must be $a$. That means, if $w$ starts with $a$, it must contain $abab$; otherwise, it must contain $baba$. So any string in $RR$ must have one of $aa, bb, abab$ and $baba$ as its substring. Obviously, any string that contain one of those four strings is a repetitive string. So we can write $RR = (a+b)^\ast(aa+bb+abab+baba)(a+b)^\ast$ as a regular expression. Hence, $RR$ is regular.

$\Sigma$ has 3 or more symbols. This is the most interesting case.
WLOG, let $\Sigma\supseteq\{a,b,c\}$. We know there is an infinite sequence $u$ of $a,b,c$'s that has no consecutive repeated substring. Let $u_k$ be the first $k$ symbols in $u$. For example, if $u=abcabacabcb\cdots$, then $u_1=a, u_2=ab, \cdots, u_6=abcaba, \cdots$.
Let $i<j$ be two positive integers. Then $u_j=u_iv$ for some non-empty string $v\in\Sigma^*$. Since $u_iv=u_j\notin RR$ and $u_jv=u_ivv\in RR$, $u_i$ and $u_j$ are in different Myhill–Nerode equivalent classes for $RR$. So we have infinitely many Myhill–Nerode equivalent classes that are represented by $u_1, u_2, \cdots$ respectively. According to the Myhill–Nerode theorem, $RR$ cannot be regular.

In the analysis for the last case, we have used one of Alex Thue's results on non-repetitive string over three symbols. He showed that there are many such strings of infinite length. One of them is given by the following procedure. Let $\Sigma = \{a,b,c\}$. Let $A_0 = a$ and $\phi : a \mapsto abcab, b\mapsto acabcb, c\mapsto acbcacb$. So
$$\begin{aligned}
A_0 = a& \\
\phi(A_0)= a&bcab\\
\phi^2(A_0)= a&bcab\ acabcb\ acbcacb\ abcab\ acabcb\\
\phi^3(A_0)= a&bcab\ acabcb\ acbcacb\ abcab\ acabcb \\
a&bcab\ acbcacb\ abcab\ acabcb\ acbcacb\ acabcb\\
a&bcab\ acbcacb\ acabcb\ acbcacb\ abcab\ acbcacb\ acabcb\\
a&bcab\ acabcb\ acbcacb\ abcab\ acabcb\\
a&bcab\ acbcacb\ abcab\ acabcb\ acbcacb\ acabcb\\
\cdots\end{aligned}
$$
Note that $A_0$ is a prefix of $\phi(A_0)$. By a very easy mathematical induction, we can show that $\phi^n(A_0)$ is a prefix of $\phi(\phi^{n}(A_0))=\phi^{n+1}(A_0)$. (Thinking about this recurrence relation, I can feel the logic beauty of self reference. That is why I cannot help but writing down this example.) So we can specify a unique infinite string, $\omega$ whose prefix can be $\phi^n(A_0)$ for any $n$. $\omega$ is the wanted string. For a proof that $\omega$ is a string without consecutive repeated substrings, reader can check Axel Thue, Uber die gegenseitige Lage gleicher Teile gewisser Zeichenreihen; Norske Vid. Skrifter
I Mat.-Nat. Kl.; Christiania; 1912 page 1–67.

In fact, as HendrikJan pointed out, the language of repetitive strings over three or more symbols is not context-free. This fact is proved in Repetitive strings are not context-free by R.Ross and K.Winklmann, RAIRO - Theoretical Informatics and Applications, 16 (1982) 191-199.

Since the complement of a regular language is also a regular language, the complement of the language of repetitive strings is also regular, too. In fact, if the alphabet has one or two symbols, they are finite languages. All non-repetitive strings over $\{a\}$ are $\{\epsilon, a\}$. All non-repetitive strings over $\{a,b\}$ are $\{\epsilon, a, b, ab, ba, aba, bab\}$.

Exercise. Show that an infinite language that does not contain a square is not context-free.

Question. Is the language of almost repetitive strings defined below regular? If no regular, is it context-free?
$$ RR = \{w: w=uyxyv \text{ for some }u, y, v \in \Sigma^*, y \text{ is not empty}, x\in\Sigma\}$$