Let s be the input string.
Let n be the length of s.
Find and yield all substrings of s of length 1.
Find and yield all substrings of s of length 2.
...
Find and yield all substrings of s of length n.

Even without doing any formal analysis of the algorithm, we can intuitively see that it is correct, because it accounts for all possible cases (except for the empty string, but adding that is trivial).

[ BTW, what about uses for this sort of program? Although I just wrote it for fun, one possible use could be in word games like Scrabble. ]

The code for this new algorithm is in program all_substrings2.py below.

BTW, I added the empty string as the last item in the message_lines list (in the usage() function), as a small trick, to avoid having to explicitly add an extra newline after the joined string in the write() call.

Here are some runs of the program, with outputs, using Python 2.7 on Linux:

(pyo, in the commands below, is a shell alias I created for 'python -O', to disable debugging output. And a*2*y expands to all_substrings2.py, since there are no other filenames matching that wildcard pattern in my current directory. It's a common Unix shortcut to save typing. In fact, the bash shell expands that shortcut to the full filename when you type the pattern and then press Tab. But the expansion happens without pressing Tab too, if you just type that command and hit Enter. But you have to know for sure, up front, that the wildcard expands to only one filename (if you want that), or you can get wrong results, e.g. if such a wildcard expands to 3 filenames, and your program expects command-line arguments, the 2nd and 3rd filenames will be treated as command-line arguments for the program represented by the 1st filename. This will likely not be what you want, and may create problems.)

The count of substrings for each succeeding run (which has one more character in the input string than the preceding run has), is equal to the sum of the count for the preceding run and the length of the input string for the succeeding run; e.g. 10 + 5 = 15, 15 + 6 = 21, 21 + 7 = 28, etc. This is the same as the sum of the first n natural numbers.

There is a well-known formula for that sum: n * (n + 1) / 2.

There is a story (maybe apocryphal) that the famous mathematician Gauss was posed this problem - to find the sum of the numbers from 1 to 100 - by his teacher, after he misbehaved in class. To the surprise of the teacher, he gave the answer in seconds. From the Wikipedia article about Gauss:

We can also run the all_substrings2.py program multiple times with different inputs, using a for loop in the shell:

$ for s in a ab abc abcd
> do
> echo All substrings of $s:
> pyo al*2*py $s
> done
All substrings of a:
a
All substrings of ab:
a
b
ab
All substrings of abc:
a
b
c
ab
bc
abc
All substrings of abcd:
a
b
c
d
ab
bc
cd
abc
bcd
abcd

Some remarks on the program versions shown (i.e. all_substrings.py and all_substrings2.py, in Parts 1 and 2 respectively):

Both versions use a generator function, to lazily yield each substring on demand. Either version can easily be changed to use a list instead of a generator (and the basic algorithm used will not need to change, in either case.) To do that, we have to delete the yield statement, collect all the generated substrings in a new list, and at the end, return that list to the caller. The caller's code will not need to change, although we will now be iterating over the list returned from the function, not over the values yielded by the generator. Some of the pros and cons of the two approaches (generator vs. list) are:

- the list approach has to create and store all the substrings first, before it can return them. So it uses memory proportional to the sum of the sizes of all the substrings generated, with some overhead due to Python's dynamic nature (but that per-string overhead exists for the generator approach too). (See this post: Exploring sizes of data types in Python.) The list approach will need a bit of overhead for the list too. But the generator approach needs to handle only one substring at a time, before yielding it to the caller, and does not use a list. So it will potentially use much less memory, particularly for larger input strings. The generator approach may even be faster than the list version, since repeated memory (re)allocation for the list (as it expands) has some overhead. But that is speculation on my part as of now. To be sure of it, one would have to do some analysis and/or some speed measurements of relevant test programs.

- the list approach gives you the complete list of substrings (after the function that generates them returns). So, in the caller, if you want to do multiple processing passes over them, you can. But the generator approach gives you each substring immediately as it is generated, you have to process it, and then it is gone. So you can only do one processing pass over the substrings generated. In other words, the generator's output is sequential-access, forward-only, one-item-at-a-time-only, and single-pass-only. (Unless you store all the yielded substrings, but then that becomes the same as the list approach.)

Another enhancement that can be useful is to output only the unique substrings. As I showed in Part 1, if there are any repeated characters in the input string, there can be duplicate substrings in the output. There are two obvious ways of getting only unique substrings:

1) By doing it internal to the program, using a Python dict. All we have to do is add each substring (as a key, with the corresponding value being anything, say None), to a dict, as and when the substring is generated. Then the substrings in the dict are guaranteed to be unique. Then at the end, we just print the substrings from the dict instead of from the list. If we want to print the substrings in the same order they were generated, we can use an OrderedDict.

(Note: In Python 3.7, OrderedDict may no longer be needed, because dicts are defined as keeping insertion order.)

2) By piping the output of the program (which is all the generated substrings, one per line) to the Unix uniq command, whose purpose is to select only unique items from its input. But for that, we have to sort the list first, since uniq requires sorted input to work properly. We can do that with pipelines like the following:

Hit the ground running with my vi quickstart tutorial. I wrote it at the request of two Windows system administrator friends who were given additional charge of some Unix systems. They later told me that it helped them to quickly start using vi to edit text files on Unix.