Character universe introduction

The character universe is the central topic to understanding how and why regldg
does what it does with character and meta-character classes. The character
universe is the group of characters that regldg is allowed to use to make the
words of the output dictionary.
Lets look at an example. Given the regular expression . , what do
you expect the output of regldg to be? Well, it is clear that each word output
will be only one character long, but which characters will those be? Could they
be only uppercase letters? Could they include lowercase letters too? Numbers?
Other symbols on the keyboard? Or, maybe they could be all the possible characters
from the ASCII and extended ASCII character sets (0-255)? The set of possible
characters is called the character universe, and you can decide what it should
be for each run of regldg.

There are two command line options to set the character universe. Firstly, there
are a number of common, pre-defined character universe sets. These are:

To use any one of these pre-defined character universe sets, specify the -us NNN
or --universe-set=NNN option on the command line, where NNN is the number of
the universe set.

You've probably noticed that the number of each pre-defined character universe
set is a power of 2. This is so that you can combine universe sets simply by
adding their numbers. If you want the character universe to have letters (upper-
and lowercase), numbers, and punctuation, you can specify universe set number
23 (1 + 2 + 4 + 16).

The second way to specify the character universe is explicitly in a character class.
You can put whatever characters you'd like in it using the formats shown in
regldg's regular expression
capabilities. You can specify the character class on the command line with
option -u [UNIVERSE] or --universe=[UNIVERSE]. Be sure to start
the character class with [ and end with ]. It will be parsed
exactly like a character class, so if you make it a negated character class, it
will be negated from the default character universe (universe set 7), or, if you
already specified a different universe on the command line with -u or -us,
it will be negated from that universe.

Here's an example: use regldg to generate all possible combinations of two-letter
words using only the characters A, B, and C.

> regldg "--universe=[ABC]" ".{2}"
AABACAABBBCBACBCCC

To show that regldg is not afraid to be complex, let's do the same thing
using the negated character class method. First, we set the character universe
to uppercase letters. Then, we take out D-Z (leaving A-C). Finally, using
the remaining characters, we output all possible two-letter words.

While parsing a regular expression, there are two areas where the character universe
must be controlled. regldg allows the control in both of these areas to be strict (on)
or lax (off) for each run, determined by a single command line option.

The first area where controlling the character universe is important is in the explicit
entry of characters. If the character universe is set to the uppercase letters A-Z,
and a regular expression contains a space, should it result in an error (strict), or should it
be allowed in only that place (lax)? This can be controlled by using the command line option
-uc N or --universe-checking=N. Setting this value to 1 will
enable strict checking of explicitly entered characters.

The second area where controlling the character universe is important is character
and meta-character classes. If the character universe is set to the digits 0-4 only,
and a regular expression contains a \d meta-character, should the resulting
character class contain only the digits 0-4 (strict), or should it contain all the digits
according to the full specification of \d (lax)? This behavior can also
be controlled by using the command line option -uc N or --universe-checking=N.
Setting this value to 2 will enable strict checking of the contents of
character and meta-character classes.

To enable both types of strictness, 1 + 2 = 3, so set -uc 3 or
--universe-checking=3. To disable both types of strictness, use
-uc 0 or --universe-checking=0.

Lets see these in action with an example. If you are using a character universe
of only the letters A-E, and generating all possible words, but you want each
word of output to start with Z, you'd like to be able to use
regldg -u "[ABCDE]" "Z.*". Here we go:

It did allow us to start words with Z, which wasn't in the character universe,
but why are we getting ASCII characters starting from 0? The problem is the .
metacharacter. As explained above, regldg allows you to make metacharacters retain
all their characters (lax checking), or have the classes they represent thinned
according to the current character universe (strict checking). (Technically: the
character or meta-character class can be intersected with the character universe.)
In the above example, the . metacharacter was allowed to represent all
ASCII values 0-255, so we didn't get only the expected ZA, ZB, and ZC. Since we
want . to represent only those charaters in the current character universe,
we should turn on this type of strict character universe checking by adding a 2
to the --universe-checking option. So, using this information: