artist has asked for the
wisdom of the Perl Monks concerning the following question:

Hi,
Anyone has easy idea as how to find patterns in a given string where only number of patterns are known? One pattern can appear any number of times in the string. String contains nothing but patterns. Number of patterns is always more than 1.

Yow. I'm not sure who's crazier - you for suggesting this might be something one would want to do, or me for trying to do it ;)

Because It's There, as the man said.

Having struggled with it a bit I realised one thing about the question itself, which is that we can't say there are only three patterns. In fact there are a lot more - "hell", "hel", and "he" to name but the most obvious additions. That's unless we want to match against a dictionary, in which case it's just a matter of processing power.

Assuming we are interested in patterns rather than specific words I think the following does it. I should say at the outset that the clever bit in this comes from japhy's regex book which is referred to in this node.

This throws up 31 patterns, with up to four occurrences each. (BTW, in case $window doesn't make sense, I assumed (A) there must be at least two occurrences of each pattern, otherwise it wouldn't really be a pattern; (B) each pattern must be at least 2 chars and (C) there must be at least 2 patterns.)

hello,hi,othello,brake,rake,raker,rash,ash,hash,ohio,
the,lob,bra,hell,era
# I get this using substr counts:
ak,ake,aker,akera,as,ash,el,ell,ello,er,era,
he,hel,hell,hello,ke,ker,kera,ll,llo,lo,
ra,rak,rake,raker,rakera,sh

I would expect any of the following:
hello,hi,othello,brake,rake,raker,rash,ash,hash,ohio,
the,lob,bra,hell,era

I don't see how you could expect some of those strings, because some only appear once in the string (e.g. "othello", "ohio"), so you really couldn't call them a "pattern" unless you're matching against a dictionary file.

My solution near the top of this thread sort of assumes that the string is a contiguous series of patterns (one of the original constraints was "String contains nothing but patterns"), so it only finds "hello" from your test string, but if you change this line:

# From this
if (/\G(.{2,})(?=.*?\1)/g) {
# To this
if (/\G.*?(.{2,})(?=.*?\1)/g) {

Ok. Here's a better version. While I haven't
benchmarked it, my feeling are that it's a hog, but I
bullet proofed several areas. It's less than a hog
than my earlier post. I'm posting the new version
for an easier compare.

I would say this is a daunting task at the very best, probably not at all suited for perl. Humans can pick up patterns like that with no problem (providing they know the language the patterns are written in), but how does a computer "know" of the word hello? what differentiates "hello" from "hell"? Why is "oh" not part of the pattern??
There needs to be a much stricter rulebase for this to even consider trying it in perl I would imagine, possibly a known dictionary, or small set of specific patterns to look for..
Anyway, my 2 cents..
-Syn0

I'd like to show a failed attempt at this interesting but ill-posed problem. I say ill-posed because the expected output is not deducible from the input alone, but expects patterns which form English words.

The LZx family of compression algorithms depend on finding and cataloging repeated substrings. I had the notion to use a modified LZW compression routine to find a list of candidate patterns. Only the dictionary is built, and I generate no compressed stream (take that, Unisys!). The dictionary keeps frequencies rather than unique numeric identifiers.

I shift in arguments or set to a default, then clear out control characters. $depth provides for multiple scanning of the string, part of why this approach is flawed. As a stream-oriented algorithm, LZW is does not predict frequent substrings at the beginning, and is greedy about tacking extra characters onto a candidate. We'll see these effects later in the output listing.

Per LZW, we prime the dictionary with our alphabet. We set up a current working string, $j, then scan the input string one character ($_) at a time. If $j.$_ has been seen, we go on to the next character. If it has not, we increment $j's count, add $j.$_ to the dictionary, and reset $j to $_.

Print the collected substrings, filtering out the single characters. I make a crude attempt to sort by desirability of a pattern, accounting for both frequency and length. A Data::Dumper spill of the dictionary can be uncommented for closer study.

Clearly, the limited view of the data taken by a stream oriented algorithm is not good enough to recognize the two occurances of 'world' disjoint to 'hello''s four. The lookaside and capture facilities of perl's regex engine are superior for this task.

Props to jepri, who motivated me to look at the LZ clan a while back. I didn't use his code for this; these warts are all mine. I originally was going to try a similar trick in a cryptanalytic tookit, but this exercise has showed me that I need to find a better idea. I think I'll find several in the other replies here.

Here is my effort. It finds all the patterns and also does a
quick dictionary lookup for real words. If you don't have a
dictionary text file you can select from a wide variety at the
National Puzzlers' League -- Word Lists

This little guy will do want you want.
The string can't have spaces, or you'll have to use a different seperator rather than a space. If you wanted it to be bullet proof, you would have to use arrays and do the regexp compares on every element.

WARNING: If there is a pattern that is never repeated, this will go into an infinite loop

1. I am not looking specifically for dictionary words.
2. One Pattern cannot be part of another pattern.
3. There could be multiple answers.
4. A Pattern may show up only once.
5. A Pattern may contain single character only
6. A Pattern may contain space also.

In case of multiple answer there could be 'techniques' we can apply to obtain the possible best
such as: minimum sum of length of patterns.

Well I came up with a solution but for bizarre reasons decided to obfu it, so I posted it as Pattern Matching Obfu, you will have to change the string $S as appropriate to your requireements. Your clarification of the constraints on the problem lead to some interesting angles, some that I suspect are unintended. Most especially that irrelevent of the short pattern solution there are likely to be very many long patterns, each of which _ONLY_ match once.

My algorthm, in a rather humourous fashion found the following solutions, amongst many others, that meet your critera in a very short amount of time (the | is the seperator between sub patterns):

Your rules are not specific enough to formulate an answer.
How do you define what is part of a pattern. Is this "Igohellohellohellohi"
a string that contains 3 'hello' or 4 'hi' or 6 'l'.....
Perhaps concluding this was the real task?

Here is something that I believe almost satisfies your requirements (just needs a bit more work which I'm not ready to do at the moment, and its not thoroughly tested). It doesn't do very well without a min and max length for each pattern, so maybe if this was wrapped in a sub which adjusted the min and max to various sizes, and evaluated the results on each pass by some heuristic, it could do fairly well with all of the requirements (and that'll have to wait 'till later):

Update: Greatly simplified. Wondering if I'm doing
someone's homework. Noticed that its very similar to nardo's
approach, but cleaner, I think, and slightly different behavior
due to the newest problem definition. Great minds think alike :)

The problem I posted, is actaully an exercise on segmentation section of the OpenLab on http://www.a-i.com.(You will need to register) I have extended it to some other critera such as 'spaces allowed' to meet more general problems.
I tried runrig's solution and it doesn't work when number of patterns is 6, for the condition that one pattern cannot be part of another pattern.

I am trying to solve this problem myself also, what I am looking for is good design to begin with.

Artist.
(My computer doesn't keep the login for more than one page,
Please let me know if you know the soltuion).