The Breaking of Cyber Patrol® 4

Abstract

Several attacks are presented on the "sophisticated anti-hacker
security" features of Cyber Patrol® 4, a
"censorware" product intended to prevent users from accessing
Internet content considered harmful. Motivations, tools, and methods are
discussed for reverse engineering in general and reverse engineering of
censorware in particular. The encryption of the configuration and data
files is reversed, as are the password hash functions. File formats are
documented, with commentary. Excerpts from the list of blocked sites are
presented and commented upon. A package of source code and binaries
implementing the attacks is included.

The market is full of "parental control" applications, and as has been
shown in earlier essays [DFR98,
DFR99], they are mostly of low
technical quality. When we now look at Cyber Patrol
(v4.03.005), we find that this trend continues. Some things are a little better than what
we're used to from products like CyberSitter
and NetNanny, but mostly, things are
the same.

We will begin by presenting our goals for the evening, with a small
digression on the "how" and "why" in a section we call
Metodus operandi, followed by a quick
overview of our target for the day. We start getting
technical as we do a quick presentation of a cipher
found in the target. After that little snack we move on to the main course,
something a little bit more complicated, the
cryptanalysis of a hash function. At this point we have
completed the first goal, and so we move on to take a closer look at how
CP manages its URL database. Putting the technicalities
behind us we make some observations about the product as
a whole. As we take our last bite out of the target, we treat ourselves to
a dessert; some source & binaries to drive our point
home. All good things must end, so also this pleasant evening. We draw our
conclusions, say our goodbyes and part
with only a handful of references to follow up on.

People often ask "How can I learn to do these things?". The
answer is the same as for all other things you want to learn: you must study,
and you must practice. This, of course, was not the answer the questioner
was hoping for. There is a certain amount of "magic" in learning this crude
form of reverse engineering. There are no books to learn from, and most of
the material on the web is quite bad. There are some excellent essays from
some excellent practitioners out there, but while the knowledge possessed
by the authors is often great, their presentation is often sub-par, which
can make the material inaccessible to the beginner.

There's really no substitute for practice, but what you must
have is a firm ground to stand on. This means good knowledge of binary
arithmetic and of the platform you are to work with, both the processor and
the operating system. While these things can actually be learned by means
of books and classes, most people are probably self-taught in these matters.

It's been said that to be good at creating ciphers, you must be good
at breaking them. Similarly, we'd like to say that to be good at reverse
engineering, you must be a good programmer. So, if you haven't already,
learn a language or two.

Now, let's say you have all the background knowledge you need. Then
you need some tools. You will need a good debugger. There are a couple
out there, but for the Win32 platform most people are using the one
included in their development environment, such as the one in Microsoft's
Visual Studio, or one of NuMega's
offerings, of which the crown jewel is the systems debugger
SoftIce.
SoftIce has been the tool of choice of crackers and device-driver developers for over half
a decade now, and it is a very competent product. In addition to a couple
of programming languages and a good debugger, you will want a good hex-editor.
Here the choices are plenty, at least on the "WinDos" platform. One good hex-editor
which is often mentioned is HIEW. It's a favourite primarily because it
supports some disassembly functions and because many of us feel comfortable
in the Norton Commander-esque interface. Speaking of disassembly, you
will probably want a good disassembler, and here there is really only one
product worth mentioning by name, and that is IDA.
There're quite a few other disassemblers out there, but IDA - which
disassembles ELF executables too by the way - is the most advanced and
most competent of them.

In addition to the big three, there are some other utilities that
you might want to add to your toolbox. We're thinking mainly of such
tools as regmon and
filemon by
SysInternals.

Add in a little paper, a good pencil and possibly a calculator, and you are
all set.

Now, this is not meant as an essay for teaching real beginners about
reverse engineering, cryptanalysis or any such matters, but feedback from
the essay on NetNanny indicates that the pieces on how the reversal was
done were really appreciated by you readers, and so we will try to
incorporate some of that into this essay too. Hopefully, those of you
not interested in these matters will find it easy skipping the
sections in question.

Let's start from the beginning. Before we even install a product we
must have some set of goals we want to achieve. For Cyber
Patrol the goal was to break the authentication scheme and to extract the
URL database, documenting the structures in the progress, thus facilitating
interoperability. These constitute practical goals. You will also
find less pragmatic goals for the launching of an attack, such as the
inquisitive desire to learn the internals of someone else's product, the
thrill of doing something you are not supposed to be able to do, and the
recognition you might gain for being the first one to explore unchartered
territory. We can call these goals of personal gratification.
More interesting for the majority of people are probably the political
goals, to expose any hidden agenda that might be lurking behind the product
and to fuel the discussion around it, in this case the discussion around
censorware. For us, the primary motivation has been the possible political
implications.

Installation is straightforward. You will note, however, that you are not
asked to supply an installation path. This is a typical example of producers
taking the easy way out. Rather than going through with the little extra bit
of effort, they chose to take the easy route - by forcing all their
customers to install the software into C:\PATROL no matter what.

Now, before we speak some more on how we can achieve our goals, let's
go on a short tour of the program. For reference, here's a screenshot of
the main interface. As can be seen, a
large part of the main interface is devoted to time management. For each
day in the week you can - with a 30 minute granularity - control the hours in
which a user is allowed to use the Internet. You can set the maximum amount
of time "online" allowed per day and calendar week.

To the upper right, you'll find a panel for controlling the filters in
Cyber Patrol. It's fairly straightforward, but let's run through the
alternatives anyway.

Lets you specify things that are never to be
allowed to be transmitted over the Internet, such as your address,
phone number and the like. The clipboard will be monitored too. The
"Carlin-7" mentioned are shit, piss, fuck, cunt,
cocksucker, mother-fucker, and tits. See also [ACLU96]

Here you can specify up to sixteen 16-bit windows applications
that should not be allowed to be run. Not very useful if you're running a
32-bit operating system though.

Also in the main interface, under "categories" the following
categories are available for filtering:

Violence / Profanity

Partial Nudity

Full Nudity

Sexual Acts / Text

Gross Depictions / Text

Intolerance

Satanic or Cult

Drugs / Drug Culture

Militant / Extremist

Sex Education

Questionable / Illegal & Gambling

Alcohol & Tobacco

Reserved 4

Reserved 3

Reserved 2

Reserved 1

It is, as we will show in the section on file formats and
the URL database, not a coincidence that there are sixteen categories in
this list. The last four entries, named "reserved", are greyed out
and cannot be toggled on or off by the administrator. In addition,
"Reserved 4" and "Reserved 3" are selected, and thus
cannot be unselected. We'll comment more on this later, but rest assured
that an opportunity for foul play lurks right there.

Now, let's review our goals. First, we want to break the
authentication, so let's talk about that. There are three levels of
access in Cyber Patrol. You could be the administrator, which gives
you full access to both the Internet, naturally, and to administering
CP. Below the administrator is a "deputy". A deputy is someone
who's given full access to the Internet, bypassing CP, but is not allowed
to do administration. Everyone else is simply a "user" and must
abide by whatever filtering and rules CP enforces.

After installing CP it will install its hooks and remain loaded in
the background. To disable it you must authenticate against
it. If you want more than one person to have administrative privileges,
then you must give them the administrator password; similarly for deputies.
As the administrator you can add users, which is simply a matter of
assigning a password to go with the username and then setting up the
restrictions you want to apply to him or her. A user can then authenticate
against CP by right-clicking on the CP icon in the taskbar and select
his or her name from a list of all users configured. After you've chosen
your username you'll be prompted for the password. It's generally
acknowledged that giving out account names like this is bad security since
it makes it easier to attack them, for example by guessing likely passwords.

Now, to break the authentication we must simply follow the flow from
where we enter our password, to where the valid passwords are retrieved and
compared against ours. Or, so goes the theory...

Real life isn't nearly as neat and perfect as our theory would have it.
In order to break the authentication scheme, i.e. extract the
passwords, we must first locate them. The obvious way, and so our theory
says, is to trace the path from entry down to comparison. In the case
where retrieval was done earlier and thus is not in the path, we can
actually backtrack once we've found out what our password is compared
to. We would do that by setting a breakpoint on the memory occupied by
the data our password was compared against, then - by restarting the
application - gaining access to the point where that data is loaded. This,
however, was not the path we took in this reversal. We were a little
bit less methodical, but maybe a little bit more intuitive.

We quite simply disassembled the target and browsed it for suspicious
code. Using the debugger to locate something like this can be very
effective in terms of time spent, but in the grander scheme of things you
will want to spend some time browsing the disassembly just to get a
feel for the program. The debugger is great for following a
specific path and to learn exactly what gets loaded into every register,
but it's not a very effective way of learning to know the program more
generally, because it's hard to track different code paths
with it.

So there we were, browsing our disassembly of CP.EXE for suspicious
code. So what constitutes suspicious code, anyway? This of course
depends on the context and the type of application, but as a general rule
you will want to slow down when you encounter xors used in any
other way than with the same register as both source and destination, as
in xor eax, eax, or against an immediate value not a power of
two. We're looking for clusters of bit-manipulation instructions; shifts,
ands, ors, xors, rotations and compact loops. These are all signs that you
might be dealing with some sort of encryption routine.

So, lazily browsing the code was how we ended up in a routine we call
cp4crypt(), which we will now examine in detail. In reality we're
speaking of two different routines, one doing the encryption and one
doing decryption. We'll use the term cp4crypt to mean these two
collectively, or the cipher algorithm they express.

The first thing one wonders when one finds a suspicious routine is
"For what is this routine used?". An important question, and
to answer it we use our next most powerful tool, the debugger (the most
powerful tool of course being your brain, and don't ever let anyone tell
you otherwise). By inserting the special opcode 0xCC, or INT3 as it is
also known, into the suspicious code, we can make the debugger active
when the routine is run. Typically you will want to enter this into the
entry-code of the routine.

To give you an idea of how "suspicious" code can look, here
is the piece of code we found:

The sequence of instructions that start with the ror
(rotate right), followed by xor reg,mem (which is really a
memory fetch) and then a memory store in mov [ebx],al triggers
our interest. We also notice the two tight loops. This is a school book
example of Intel x86 encryption/decryption code, similar in complexity
to the things you might find in old DOS viruses.

For those of you unfamiliar with the architecture it should be noted
that this particular Intel instruction set is read with the
destination to the left
and the source to the right, so the xor operation fetches the
byte addressed by ("pointed to by") the register ebx
and xors that byte with the value in the register al, and then
stores the result back into al.

The section of code above is a decryption routine. When the whole thing
is translated into a higher level pseudo-code it reads something like this:

First the key is initialized to be the low eight bits of the ciphertexts size.

Then we make two passes in which we:

Rotate (not shift) the key one bit to the right.

The key is exclusively-or'ed to the current ciphertext byte to give the corresponding plaintext byte

The key is set to be the now decrypted byte

There are numerous issues with this algorithm which make it very weak. First
of all we have the key size, which - weighing in at only eight bits - is, shall we say,
very short of being impressive. Of course, since we have the code for
deriving the key (that ctsize & 0x000000FF operation) a longer key
wouldn't be much use anyway.

Obviously this was never intended to be strong, but even so, someone,
some where, spent time developing this scheme. There are certainly simpler
ways of doing weak encryption than this, but this is actually a perfect
example of the kind of homebrew encryption you are likely to find in
commercial applications.

To continue, the key update schedule is weak in that the key only
depends on the previous byte. This means that plaintext repetitions
longer than the number of rounds used will show through in the
ciphertext. The number of rounds used by CP, as described above, is two.
A competent cryptographer should be able to break this in mere hours
(and by breaking, we mean recovering the plaintext and with it the
algorithm), having access only to ciphertext. In real life we also have
some assumed plaintext in that the file 'cyberp.ini' is encrypted using
this scheme. Such a file probably starts with "[" or ";"
and ends with 0x0d,0x0a, to mention but a few hooks. The use
of two rounds strengthens the cipher somewhat from a ciphertext only
attack, but... this is academia, not really relevant to our real life
situation, in which we don't have to recover the algorithm by
means of cryptanalysis of ciphertext. We can simply locate and translate
the algorithm as it is implemented in the target. Much easier, much
faster, though maybe a little bit less fun and rewarding.

We introduced you to the decryption code because it's somewhat simpler
to explain compared to the encryption code. The reason for this is that we
must, in effect, encrypt the buffer backwards, and we must also treat it
as a circular buffer, and even then we need to do a fixup at the
end to key the encrypted data. Because of these complications we'll
present some code that is a little closer to an actual implementation
than pure pseudo-code:

Now, the buffer must be seen as circular due to the way the key (which
is set to the previously decrypted byte when decrypting) carries over from
one round to the next. It's possible to write this into the loop above,
but in our case that would only make the explanation more difficult to
follow. Anyway, after the loop we connect the two ends of the buffer:

key = Buf[BufferSize-1]
rotate key right
Buf[0] = Buf[0] ^ key

After that little tie-in we do another round of the backward encrypting
loop above, after which we must do one final fixup for the first byte by
keying the whole buffer with the lower eight bits of the buffer's size.

key = BufferSize & 0x000000FF;
rotate key right
Buf[0] = Buf[0] ^ key

There, now Buf[] will be encrypted. See the section on
sources and binaries if you want to look at an actual
implementation of these algorithms.

Now, for the question we posed, for what is this encryption used?
cp4crypt() is used to encrypt some of Cyber Patrol's data files, such
as the file cyberp.ini which contains configuration and user
information, and also the administrator and deputy passwords. This
cipher also protects the raw URL database, i.e. the files cyber.not
and cyber.yes. It is also used on the file user.lst which
contains some configuration information, most notably any additional URLs
the local administrator(s) may have added.

Let's talk about the passwords. The cyberp.ini contains a
main section, "Cyber Patrol", under which the two passwords are
stored in the keys "HQ PWD" and "DEPUTY PWD". The
data of both these keys is encoded as a hexadecimal string representing
eight bytes, or 64 bits if you prefer. It can look something like this:

HQ PWD=3AD6AF0CB33D8A87
DEPUTY PWD=9A3740C7019A5AA1

The deputy password is in fact the password encrypted using
cp4crypt(), so it is a simple matter of decoding the hex-string into
binary and then decrypting the string using the key 0x08, the maximum
length of a password, and voilá; instant unrestricted access to the
Internet. Are you impressed? We're impressed.

This file also contains any additional users that have
been configured. First the main section contain keys of the form
"UserNameNN=name" where NN is a number. Associated with these keys are
sections of the form "[UserNN]", which contains not only that users
configuration but also his password (in the key "password"), again,
encrypted using cp4crypt().

But here things become interesting, because the administrator password
is not encrypted in this way, the data for the administrator password is
in fact a 64-bit hash value. Analysis of the hash function follows.

Throughout this section we'll use vaguely C-like notation,
with some of the extensions common in cryptographic circles. In
particular:

= is assignment

== is test for equality

^ is bitwise XOR

| is bitwise OR

& is bitwise AND

^=, |=, and &= are assignment variations of the above, just like in C

<< and >> are unsigned shift left and right

<<< and >>> are bit rotate

** means exponentiation (2**4 is two to the fourth power, which is
sixteen)

All numbers are unsigned, and bits are numbered from 0 (least
significant). When we get to the mathematical part, we'd ideally like to use
some notation that doesn't map into ASCII. There we'll use reasonable
approximations:

a[1], a[2], etc. for subscripts on variables... not the same as array
indexing, but you can think of it as being like array indexing.

a = b (mod n) for "a is congruent to b modulo
n"; it'd be nicer to use the proper math "congruent" symbol, which is like
an equals sign with three bars.

Cyber Patrol uses a technique called "hashing" for its HQ
password. Instead of simply storing the password in its
configuration files, it processes it through a "hash function"
to produce a code called a "hash", and it stores the result.
The hash function is supposed to have these properties:

The same password will always produce the same hash.

Two different passwords will not produce the same hash.

Given the hash, it's impossible to determine the password that produced it.

These properties guarantee that the program can recognize the correct
password when it's typed, but people like ourselves can't determine the
password when we reverse engineer the program, even if we obtain the hash.
In practice, even a very strong hash function cannot guarantee these
properties perfectly - for instance, we could always try all possible
passwords until we found the one that gives the correct hash. If the hash
is shorter than the input password, then it's absolutely guaranteed that
some two inputs will have the same hash. But strong hash functions do
exist, which come close enough to the theoretical ideal for most purposes.
Cyber Patrol's hash functions are not like that. The deputy
password isn't even truly hashed; it is encrypted with the cp4crypt()
routine already discussed.

The hash function for the Cyber Patrol HQ password looks like this in
pseudocode:

The first thing we notice here is that every input character gets bit 5
set in line (4). That seems to be case conversion - it forces all
alphabetic input characters to lowercase. It has some other effects as
well (making certain punctuation characters equivalent to each other) but
if the passwords are expected to always be alphanumeric, that shouldn't
make much difference. One significant issue (which we'll revisit later) is
that it gives attackers a bit of known plaintext: we know that bit 5 of
every input character is 1. We can also guess that bit 7 of every input
character is 0, since it's tricky to type characters with that bit set
to 1 on an ordinary keyboard.

Next let's look at line (5). This one may seem a little confusing:
we shift 8 bits out of hash1, look them up in a table (using the input
to screw with the index of the table), and XOR the result back into hash1.
The table, if you look at it, seems at first glance to be random garbage.
Here we just have to depend on our experience to enable us to recognize
the code: it happens that this code is a standard idiom for hash
calculation.

To be precise, it's a 32-bit linear feedback shift register, also known
as a CRC (cyclic redundancy check). This kind of algorithm is a traditional
way of checking data integrity in things like file transfers and compressed
archive files. Most serious programmers have written it, or something very
much like it, before. The definitive explanation of CRCs is
in [RNW93]. Most people
who have to write CRC code consult that file for instructions; it's a
reasonable bet that Microsystems based their code on it. Line (5) of our
pseudocode is pretty much verbatim from one of the examples in that document.

The code in Cyber Patrol isn't even just any CRC - examination of the
table shows that except for a nonstandard initialization,
it is the de facto standard CRC32 algorithm, used
in ZModem, PKZip, Ethernet, and FDDI (among other places). CRC32 has proven
to be a very successful error detection measure in these systems. Because it
provides pretty good bit diffusion (i.e. changing one bit in the input
changes lots of bits in the output) CRC32 is also a reasonable hash function
just as a hash function. If you were building a hash table data structure,
CRC32 wouldn't be a bad way to generate your indices. It is not, however, a
cryptographically strong hash function suitable for password hashing. CRC32
has fatal cryptographic holes - it not only provides no security, but also
gives us a lot of hints for breaking the rest of the hash. We'll show how to
exploit the holes below.

Before we do, let's look at hash2, the other half of the 64-bit hash.
This seems to be something homemade, created by throwing together several
different techniques in an effort to provide cryptographic strength. That
isn't a good way to build cryptographic systems, but in this particular
case, it seems to have resulted in something that is at least better than
the CRC half of the hash.

There are three basic parts to the hash2 calculation: rotate the previous
value, mix in the input character, and mix in the current value of hash1.
The rotation gives us a little bit of what cryptographers call avalanche
- bit changes get shifted around to eventually affect most of the word.
Using addition and subtraction instead of XOR is important because the
carry bits allow some information to flow from one bit position to the next.

The designers are treating hash1 as a random number, and hoping to confuse
things by mixing it in at every stage. It's not really very random at all,
but combined with the carry bits and the rotation, there is a little bit of
security here. In fact, in our attacks we haven't bothered to do much to
hash2, cryptographically speaking; the results on hash1 are strong enough
that we can get away with ignoring most of the holes that may exist in hash2.

We might well speculate about the reasons for designing the hash function
this way. It looks like Microsystems chose the CRC32 algorithm because they
knew it was a standard algorithm, and then they knew that 32 bits wasn't
enough for a cryptographically strong algorithm, so they added some stuff.
They knew that bit rotations are popular in cryptographic algorithms, and
they knew that addition and XOR are popular, so they put in some of
each.

If they knew more than a little about how CRC32 worked, they'd have used a
64-bit CRC instead of extending the 32-bit version with homebrew stuff. If
they understood the security implications of CRC32, they wouldn't have
used it at all. So the general pattern here is that the pieces are there,
but they aren't understood nor put together systematically. If you have
a copy of the Jargon File[JRG00], the entry for
"cargo cult programming"
might be educational.

Now, let's talk about cryptographic strength. Strength of cryptographic
algorithms is measured by the order of magnitude of the fastest or most
space-consuming attack. An algorithm is strong if the best attack takes too
much time or space to be practical; it's also preferable that the best
attack be a straightforward brute-force attack, because any other attack
indicates a security hole which might easily become worse as the theory of
cryptanalysis improves. An attack's difficulty is the amount of time or
space it requires, whichever is worse, typically measured as a power of two
(like "2**64").

This table is designed to give a very rough idea of the scale of the
numbers involved. The exact time to attack a system will depend very much on
the system, the optimizations, the hardware, etc. (what computer scientists
call "constant factors"); as a result, the stuff in this table could well be
off by a factor of a thousand or more in either direction. Where we're
going, that isn't going to matter.

Difficulty

Resources required

2**0

Pencil and paper

2**8

Pencil and paper and a lot of patience

2**16

Warez d00d with a Commodore 64

2**32

Amateur with an average PC

2**40

Smart amateur with a good PC

2**56

Network of workstations, custom-built hardware

2**64

Thousands of PCs working together for several years

2**80

NSA, Microsoft, the Illuminati

2**128

Cubic kilometers of sci-fi nanotech

2**160

Dyson spheres

Some real-life examples: the EFF's
"Deep Crack" machine uses custom VLSI to solve a problem of
difficulty 2**56 in a few days. The distributed.net
RC5 effort is attacking a problem of difficulty 2**64 and expects to finish
within a few years. The maximum keyspace of symmetric cryptography
exportable without a license from the United States, before they made the
rules more complicated recently, was 2**40. Standard practice in most crypto circles is to use symmetric encryption
keyspaces of 2**112 or 2**128. Those aren't directly comparable with the
cryptographic hash sizes we're talking about, but the point here is the general
size of the numbers involved.

There are two ways in which you could break a cryptographic hash
function: you could find an input that will produce a specified output, or
you could find two inputs that produce the same output. For a hash with n
bits of output, assuming it's cryptographically perfect, the difficulty of
finding an input for a specified output is 2**n time, negligible space, and
to find two different inputs with the same output, 2**(n/2) time and space.
So, the two basic attacks we could apply to the HQ password (assuming it was
perfect) are:

1. Brute-force search: try inputs until one gives the right
output, finds an input for a chosen output in 2**64 time and 2**0
space.

2. Birthday Paradox attack: try inputs, remembering their
hashes, until we find two with the same output. Finds such a pair in 2**32
time and space.

The prompting code in Cyber Patrol limits password inputs to 8 characters,
and we bash them to lower case and can assume the high bits are zero, so
there are in fact only 48 bits of unknown input to the hash. That reduces the
time for a brute-force search:

3. Brute-force search with assumptions about input: finds
password for a given hash, in 2**48 time, 2**0 space.

That's starting to get into the range where it's feasible to attack. In
fact, this attack lends itself to a time/space tradeoff because the hash is
unsalted. Let's talk about Unix and salt for a minute. The Unix operating
system uses a similar concept of password hashing. It doesn't store the
user's password in the password file - only a hash (based on the DES
algorithm) of the password. If it were a pure hash of the password, then a
determined cracker could compile a codebook of all possible passwords and their
hashes, and just look up each hash in the list. Also, the cracker could go
down a list of users and recognize who had the same password (because they'd
have the same hash) even without knowing what the password was.

To prevent such attacks, Unix has a concept of "salt". Before
hashing the password, the system chooses a random number called the salt. It
then hashes the salt along with the password, and stores the salt and the
resulting hash in the password file. Now a given password can hash to a lot
of different values, depending on the salt. The attacker can't compile a
codebook, or at least the codebook would have to be a lot bigger,
because its size is multiplied by the number of possible salt values. The
attacker can't identify pairs of users with the same password, because
they'll probably have different salts and so different hashes.

Cyber Patrol doesn't use any salt. As a result, it's vulnerable to a
codebook attack, which as you can see is really just a time/space
tradeoff of the brute force search above:

4. Codebook attack: 2**48 time and space to compile the
codebook once and for all, 2**0 time to look up each password.

All those attacks apply to any hash function with this amount of input
and output, no matter how cryptographically strong it may be; to avoid
them, Microsystems would have to use a longer, stronger hash function.
We like SHA1; it's got a 160-bit output and no known attacks better than
brute force, so it takes 2**160 time to attack by brute force search,
and 2**80 time and space for the much weaker birthday paradox attack.
Of course, the prompting routines in Cyber Patrol would have to be
modified to accept longer keys; with 48-bit passwords, about all they
could do would be add salt, and pray.

The first cryptographic hole that's apparent in the Cyber Patrol hash is
that the per-character processing is bijective, for a given input character.
What that means is that if we know the input character and the output, we
can be absolutely sure of what the state was before that character. All we
have to do is reverse the steps; we won't dwell on how to do that here
because we have better things to do, but it's not hard to figure out.

The bijective nature of the hash function makes it possible to do a
meet-in-the-middle attack. To find an input for a given output, we hash
about 2**32 guesses for the first half and store them. Then we hash
backwards from the known output, about 2**32 guesses for the second half
of the input. After doing that amount of work we expect to find about one
pair of a first half that goes to an intermediate value with a second half
that comes from the same intermediate value, and then we've broken the hash.
This is about the same amount of work as the classic birthday paradox attack,
but it allows us to find an input for a chosen output instead of just two
inputs for the same output. It's enough to indicate that the Cyber Patrol
hash is a lot weaker than a theoretical perfect cryptographic hash of the
same size. But as we'll soon see, we have much stronger attacks on it.

5. Meet-in-the-middle attack: input for a chosen output in 2**32
time and space.

Before we can talk about the difference between CRC32 and cryptographic
hash functions, we have to learn some math. Let's talk about modulo
arithmetic. Suppose you have an equation involving integers, addition,
subtraction, and multiplication, like this:

12 + (34 * 56) = 1916

In modulo arithmetic, we have a number called the modulus, and we say that
any two numbers are considered equal (or "congruent") if they have the same
remainder when you divide them by the modulus, or in other words, if they
differ by exactly a multiple of the modulus. It's an amazing fact that if
you take any equation like the one above and replace all the numbers with
their remainders when you divide by some modulus, you get a congruence
that works. For instance, let's take all the numbers modulo 9; we can do
that by adding up all the digits, and then if the result is two digits or
more, adding up those, etc. This is the old accountant's trick of checking
computations by "casting out nines":

Note that 17 (or 8, which is the same thing) is what you get if you take
the digits in 1916 and add them up. All the numbers in the mod 9 congruence
come from adding up the digits in the equation above.

In modulo n arithmetic, there are really only n numbers: 0 up to n-1.
Anything else is congruent to one of those. To understand CRC32 we're going
to use the second most boring kind of modulo arithmetic: modulo 2. In our
mod 2 universe there are only two numbers, 0 and 1. Are you starting to see
how this could be relevant? Incidentally, the very most boring kind of
modulo arithmetic is mod 1; everything equals zero. Anyway, let's look at
the addition and multiplication tables for modulo 2:

+

0

1

*

0

1

0

0

1

0

0

0

1

1

0

1

0

1

Notice anything? The addition table is the same thing as what we like to
call ^ (XOR) and the multiplication table is the same thing as what we
like to call & (AND). This connection is important because it means we can
take programs which do bit twiddling, and apply the mathematics of
addition and multiplication to figure out what's going on.

The CRC32 algorithm simulates a linear feedback shift register. We won't try
to draw the diagram here, but the concept it simple: you shift bits out of the
register one at a time, and you have a mask that's as long as the register,
and you XOR the stream of bits you shift out with the stream of input bits,
and every time the result is 1, you XOR the mask into the register. Why
this is useful is best explained in
[RNW93]; we'll just
take it for granted that in communications error checking, this is a
sensible thing to do. The 8-bit shift, and the magic table, are just a
clever way of doing a whole byte at once. The result is the same as if
you did the bits one at a time.

What's important cryptographically is that after you've performed one cycle,
every bit is either equal to the bit next to it (if its mask position was 0),
or it's equal to the XOR of the bit next to it, the bit that just got
shifted out, and the input bit (if its mask position was 1). Let's say we
have a short register, four bits long, with the mask value 1011. Let a[0]
to a[3] be the previous state, b[0] through b[3] be the next state, and x
be the input bit. Then we have:

Notice how we're using the + symbol to denote bit XOR; in modulo 2, it's
the same thing. The most important thing to notice here is that every
output bit is a XOR of some combination of the input bits. If you take
two output bits and XOR them together, you find that the result is still
simply a XOR of some of the input bits:

Note that we can get rid of all the terms that are repeated twice or an
even number of times, because those are like multiplied by an even number,
and all even numbers are equal to zero in modulo 2 arithmetic. OK, it
takes some getting used to, but it really is easy once you see it.

With the CRC32 algorithm, there are lots more variables than in that
example, but this same general property holds: the CRC32 bits are each the
modulo 2 sum of some combination of the input bits. Actually, in strictly
correct CRC32 it's a little more complicated because the register doesn't
start at 0, so you have to XOR in some extra bits that depend on the length
of the input. But fortunately for our discussion, Cyber Patrol uses a
nonstandard variant of CRC32 which initializes the register to 0.

We can figure out a congruence for each output bit by feeding inputs into
CRC32 which consist of only one bit set to 1 and all other bits set to 0. If
an output bit is set for that input, then the input bit that was 1 is
included in the congruence for that output bit. Once we've tested all the
input bits this way, we have the entire congruences.

To save space on the page, we will write out our congruences with all the
variables included, multiplied by 0 or 1, like this:

and then leave out everything except the coefficients (the 0 or 1 we
multiply by):

000111000
001000001
010011010
100011100

That block of bits has all the information from the complete set of
congruences. It's what mathematicians call a "matrix", but there's
no need to think too hard about what that means; just call it an abbreviated
set of congruences. Each row is equivalent to a congruence, and we can add
congruences to each other (the same way we did earlier) by XORing one row
into another. If we shuffle the rows around and XOR them enough, we can
solve for whatever variables we want - it's just like the simultaneous
equations we learned in high school. For example, if we desperately want to
know the value of a[1] in the example, we can say:

We wrote code to figure out the set of congruences for the CRC32 algorithm
as implemented in Cyber Patrol, using all 32 output bits and 96 input bits
(12 characters, which is overkill, but bits are cheap). Here's the resulting
matrix. Input bits come first, first bits 0 to 7 of the last byte, then
bits 0 to 7 of the second-to-last byte, and so on; after that are the 32
outputs, again in order from 0 to 31. (This ordering of bits was easiest
to program; it doesn't really matter.)

Notice the diagonal stripe at the right-hand side of the matrix. That stripe
reflects the fact that we have one congruence for each output bit, in order.
Now, by flipping and XORing rows (the exact technique we used is called
Gauss-Jordan elimination, which is just a formalization of the "system of
equations" technique you do in high school) we shuffled that stripe over to the
left, to get a set of congruences that tells us the last bits of the input based
on the output and the other input bits:

The diagonal isn't perfect here for a couple of reasons. One, we didn't
solve for bits 5 and 7 of any input byte, because we don't have a choice for
the values of those bits. We can't allow the congruences to force impossible
values on them. Two, we found that some pairs of input bits are actually
equivalent to each other. There's no way to solve for both of such a pair -
you just have to choose a value for one or the other of them. But the point
is that from this "solved" or "inverted" matrix, we have
a set of congruences where we can choose all but 32 bits (including all the
output bits of the CRC) and then just by evaluating the congruences, we get
the values for the other input bits to make the whole thing work.

This is why CRC32 is not a cryptographic algorithm: anyone
who knows a little matrix algebra can generate inputs for a chosen output in
minimal time. We extracted the columns from our solved matrix and put them into
tables in cph1_rev.c. The reverse_crc() function
takes parameters specifying the input length, some choices for the
"free" input bits, and the desired output value. Then it fills in
the "forced" bits for the chosen input length, and goes through
its tables XORing together all the columns that correspond to 1 bits in the
partial input. Then the result contains all the values for the rest of the
input bits. It unpacks them and returns the completed input. The bit packing
order is also in tables. Some checking of the result may be necessary
because if you ask for a very short CRC input, there may well be no such
input for the output you requested.

Now we can attack the Cyber Patrol hash just as if it were
a strong hash with only 32 bits of output:

6. CRC-breaking Birthday Paradox attack: choose an output for
the CRC32, generate inputs with that CRC32 value, save their hash values,
until you have two the same. Finds such a pair in 2**16 time and
space.

7. CRC-breaking brute force: choose an output for the CRC32,
generate inputs with that CRC32 value until you find one with the hash you
want. Finds an input for a chosen output in 2**32 time and 2**0
space.

But because we know that the input password is only 8 ASCII characters,
bashed into lowercase by the OR with 0x20, we can improve the attack still
further. With 8 input characters there are 48 bits unknown, and of those,
we can derive 32 from the set of congruences. That leaves us only 16 bits
unknown. We can simply test all possible values for those 16 bits, by
brute force. This is the attack implemented by
cph1_rev.c.

8. CRC-breaking brute force, with assumptions about the input:
finds input for a chosen output (assuming it really came from an 8-character
ASCII password) in 2**16 time, 2**0 space. Shorter passwords take much less
time, so we can test all shorter passwords first, without increasing the
cost noticeably.

Our code for this attack isn't particularly optimized, and we haven't
attacked the rotate-and-add part of the hash at all, but even so we can
now reverse the function in a fraction of a second. At this point it's
appropriate to categorize the Cyber Patrol HQ password hash function as
blown wide open, thus fulfilling our first goal.

The primary function of Cyber Patrol is to do filtering based on a
database of URLs that are not allowed. CP can also be configured to work
with the inverse policy, denying everything but a small set of allowed URLs,
a very strict policy.

The source of the URLs that are denied is a file called cyber.not.
Conversely, the file cyber.yes is used for the "allowed" URLs.
Currently the not file is about 600Kb, with the yes file
only about 50Kb. Roughly, this means there are less than one tenth as many
allowed URLs as there are denied ones. These files are actually processed
into another format and file called cyber.bin for use by the
different Cyber Patrol modules. We will not discuss the format of this file
since the raw files are much more suited to our needs.

Before we broke the encryption on the cyber.not file, we did some short-range
correlation measurements on it, to check for polyalphabetic encryption (like the
classic "simple XOR" algorithm). The idea here was to measure, for various
values of k, how often the byte at offset x was identical to the byte at offset
x+k. We used a simple Perl script to compile these statistics.

We found that very often, after some byte value would occur, the same byte
was repeated again six bytes (or, somewhat less often, seven bytes) later. That
would point towards some kind of modified polyalphabetic cipher with a key
length that switched between six and seven bytes, or some very weak cipher that
didn't conceal repeats in the plaintext, and record sizes of six and seven bytes
in the plaintext. It turned out to be the latter, but we didn't find that out by
cryptanalysis. Reverse engineering yielded the answers more easily, as described
earlier.

However, those six- and seven-byte repeats seemed interesting, especially
once we had decrypted cyber.not, reversed its header format, and split it out
into its component tables. The first table we figured out was the last one in
the file, the table of blocked newsgroups.

When reversing a file format it makes sense to start at the easy end,
and with the cyber.not file that means the table used for newsgroup masks.
What made this the obvious starting location is that it's very obvious -
from a fast ocular inspection - what it is for. A section
might look something like this (some unprintable characters replaced by
question marks):

Some quick looking around would show that "ED" marks "End-of-Data" and
consequently "SD" ought to mark "Start-of-Data". The first thing to do is
try and see what constrains a record. Either there's a length count, a special
marker, or the records are fixed length. They are not fixed length, that much
is apparent, but they all end in an asterisk, so maybe that is a record
terminator? Could be, but the first string, the "Copyright..." ends with
0x0d. To make sure, we must see if we can find a length marker, a common way to
handle variable sized records, and also the preferred method in many ways. Now,
because there are three "non-character" bytes between SD and the first
string, we try this pattern on the other strings, and it checks out. Thus, we
can guess that a record starts with three bytes, followed by a variable number
of characters. A way to verify this result is to look at the end of the table,
to see what kind of data ends the last record.

Now, for the length, that would have to be one of the three bytes, so it
shouldn't be too hard to locate, if indeed it exists. The first record starts
'22 00 04' and then the string. The length of the string is more than zero
or four, and definitely less than 1024 (0x400). By counting we see that
the string is 31 characters, or 0x1F in hex. Very close to 0x22, and where
could the difference come from? But of course, the three prefix bytes.

This leaves one or two bytes per record unaccounted for. How many depend
on how large the length specifier is. Looking around we find examples of
records in which the second byte of the records is something besides zero,
which would make the length way too long, and thus we can conclude that we
are working with a length byte and not a word. Now them, what about the
remaining two unaccounted bytes? As you may remember from the
overview we mentioned that there are sixteen
different categories available for filtering. The two remaining bytes are
actually a blocking mask - a word whose bits maps to those sixteen categories.
We call these records TNotNewsEntries.

Having found the start and end, and thereby the size, of one table mean
we have some usable information for reversing the next-easiest part of
the file, the header. Our header looks like this
in binary:

Seeing how "SD" marks the start of data, we can calculate the size of
the header as 0x2C bytes, meaning the first table (of which there are at
least two, one being the table for newsgroup masks) ought to start at 0x2C,
the location of the "SD" marker. Experience tells us that headers often
contains pointers (offsets), lengths, and counts (of records). Let's see if
we can match any of those types into the header.

We begin to look for pointers into the different tables. We know that
the first one start at 0x2C (the "SD" marker, remember?) and it just so
happens that we have two viable candidates, first one at offset 0x02 and
one at 0x1A. The first one is off by two, so it's a little less likely to
be the one we want. Too little to go on, but we also know where the
newsgroup table starts: 0x65AAE. Wouldn't you know, there it is at offset
0x24 of the header. Now, let's see if we can find any lengths in there.
We begin by measuring the size of the newsgroup table. It runs from 0x65AAE
to the end of the file at 0x68065. Subtract and you get 0x25B8, and there
it is in the header, following the offset. We can then verify our findings
by checking them against the record of the table starting at 0x2C.

Now we have a little more to work with, so we try and determine the
size and the location of the first of the headers table-records. We know
that one of the table entries end at 0x23 at the latest, because that is
where we have our offset to the newsgroups table. We can also be sure
that offsets 0x1E through 0x21 belong to the other table, because
those bytes contain the length of the first table (whose contents are so
far unknown). That leaves only two bytes between them. Implicit in this
is a limit to the size of a record, because we know that the newsgroup
entry ends at 0x2B at the latest, so it can run from 0x22 through 0x2B (
ten bytes) or 0x24 though 0x2B (eight bytes). And therein lies the answer
to the question of the record's length, because we can see that the newsgroup
record ends with the length field, and for that to hold the other record
should too, which would orphan the two bytes between them. In this way we
can come to know that a record is in fact ten bytes, consisting of two
bytes (meaning unknown), a longword (table offset) and a longword
(table length). Working backwards we can fill out the entries as this:

This clashes with the structure as we know it, and so we assume that
there are only three records, the data before them having some other
structure. Looking, again backwards, we notice that the word following the
first table entry is 0x0003, which could mean that it's a count of the
number of tables, right? By checking against another file with the same
structure, the hotlist.not, we could see that this assumption was correct.

The little bit left of the header is not as important as locating the
table entries and their count, but it seems like the 0x2A at offset 0x02
is the header size, assuming the header starts at 0x02 and the two bytes
in front of it being not related to it. The "CH" seems to be a marker,
the hotlist.not contains "HH" instead. Without more files to compare to,
or time-consuming debugging of the executable, the few bytes left unaccounted
for will remain a "mystery".

We learned several important things from the newsgroups list. First,
Microsystems likes putting length bytes on things. Second, the blocking mask
0x000E (corresponding to "Partial Nudity", "Full Nudity", and "Sexual Acts /
Text") is the most popular one. It appears that that's the generic "porn" label
which they slap on everything that looks like it might be porn, whether it
technically applies or not. Both these facts were useful in attacking the other
two tables in cyber.not.

The first table mentioned in the header is the biggest one. At over half a
megabyte, it makes up most of the bulk of the cyber.not file. As our previous
measurements indicated, this table includes a lot of repeats at a distance of
six or seven bytes. Character frequency counts revealed that the top three
characters in table 1 are:

0x00 (106280 times)

0x0E (65483 times)

0x07 (25212 times)

We know that they like using blocking mask 0x000E, and the bytes making up
that number are the top two most frequent bytes in the table. We know they like
length bytes, we know there's some kind of structure in here with a size of
seven bytes, and 0x07 is the third most frequent byte value. This looks
promising. Let's look at a hex dump. This dump was generated with the Linux
od -Ax -txC command; offsets are from the start of table 1 as specified
in the cyber.not header.

This may appear quite formidable to someone unaccustomed to reading hex
dumps, but careful examination reveals some interesting things. First of all,
the sequence "0e 00" occurs quite frequently. It's reasonable to suppose that
that might be the blocking mask for a page or site. Another common one is
"07 0e 00". When that one occurs, there are often four more bytes
and then those three again. These patterns are easier to see when one
examines more of the dump than the short sample here.

It's reasonable to guess that the 07 is a length byte, just like in the
newsgroup list. But that doesn't explain why we get so many repeats at distance
six. The byte value 0x06 is only the 39th most common value in table 1, even
though there are far more repeats at distance six than seven. So not everything
can be tagged with a length byte, or there's something else we don't
understand.

Further skimming of the hex dump revealed inspirational passages like
this one:

Here we've obviously got our generic porn mask of 0x000E, alternating with
four unknown bytes, the last of which often seems to be incrementing - but not
always. Scanning across the table, we saw that when this kind of six-byte
structure occurred, the four mystery bytes seemed to more or less increment
smoothly from the start of the table to the end. But it was always the last byte
that incremented first, and then the second-to-last, and so on. In other words,
the field is being stored in "big endian" byte order, the exact opposite of the
"little endian" byte order conventional on PCs. Why would a PC software package
bother doing something in big endian when it's running on a CPU designed for
little endian?

At this point we had to depend on intuition. There is one thing that's 32 bits
long and big endian everywhere, even on a PC: that is an IP address. Some
computers like big endian and some like little endian, but it is standard for
all Internet protocols to use big endian regardless of what kind of system
they're running on - so that they'll all be able to talk to each other. An added
bit of evidence is that the actual values of this four-byte field
seem to be distributed the way one would expect IP addresses to be distributed.
Lots of them start with bytes like 0xCF, which puts them right in the popular
part of the Class C IP address space. So, let's write the decimal equivalents of
the supposed IP addresses next to the hex dump:

Notice that these are not in numerical order; 216 is not normally considered
to come between 21 and 22. However, considered as decimal representations, these
addresses are in strict alphabetical order. This list is the kind of thing you
might get if you took a text list of URLs and passed it through a sort utility
designed for text. A little examination reveals that these six-byte structures
in table 1 are strictly in this "text IP" order across the entire table. As a
final confirmation that these numbers are intended to represent IP addresses,
just point a Web browser to a few. Almost all are porn sites.

At this point we had figured out that there were a lot of blocking masks
interspersed with IP addresses in the table, and also a lot of seven-byte
structures starting with a length byte and a blocking mask. But the remaining
four bytes of those seven-byte structures were apparently not sorted, nor IP
addresses, and there were still some bytes that didn't fit into either kind of
structure. So we wrote a Perl program to dump out the known structures and label
the unknown parts.

The next step was simply to stare at the output and look for patterns. We saw
that the six-byte and seven-byte records often occurred in blocks of lots of the
same kind all together. The unknown part often seemed to consist of the byte
0x0B followed by a blocking mask and eight bytes of garbage. We guessed that
that might be a third record type, so we added it to the dumper program, and
noticed that the remaining unknown sequences often seemed to consist of 0x0F, a
blocking mask, and then twelve bytes of garbage. From this we inferred a general
pattern: a length byte (always 3 plus a multiple of 4), a blocking mask, and
then some amount of garbage, always a multiple of four bytes.

Between this and the six-byte IP/mask pattern, almost all the contents of
table 1 fit some kind of structure. But there were still a bunch of zero bytes
hanging around. A reasonable guess was that these signalled some kind of
"end of structure" condition. It only took a little more intuition to realise
that of the "length byte" records and the "IP address" records, one logically
went inside the other. Unfortunately, we guessed that the "IP address" records
went inside the "length byte" records, and that confused us for quite a while.
Here's part of the output from our dumping program at this stage:

In this dump, the four-digit numbers in parentheses are abbreviations for
"IP address" records, showing only the blocking mask part. We had
already figured out, although it's a break with the tradition set elsewhere
in the file, that in the six-byte IP address records, the blocking mask comes
at the end instead of the start. Not shown in this dump is the enormous
variability in the number of IP addresses apparently associated to each
"length byte record"; some had dozens, many had none at all.

Also, although it looks okay in this fragment, there's a critical problem
of how to recognize which records are which. The dumping program would guess
what looked like a plausible IP address, but it sometimes guessed wrong and
produced junk until it happened to randomly re-synchronize. It appeared that
IP records with a blocking mask of 0x0000 helped signal "OK, length byte
records coming now", and a length byte of 0x00 (not shown here) signalled
the start of a list of IP address records, but these things raised problems
because it appeared that in a list of IP addresses, there would always be one
more address than there were blocking masks. Where would the blocking mask
for the last IP address come from?

Late one night, under the influence of a couple bowls of MSG-saturated Korean
instant noodles ("kimchee" flavour), we realised what we should have
seen all along. The "IP address" records are actually the major
records, and the other records go inside them, as children of a parent IP
address. This makes more logical sense, given the purpose of the file; the
package blocks either an entire IP address, or one or more subsections of an
IP address. Then the rest of the structure fell out easily.

The basic record contains an IP address and a blocking mask. If the blocking
mask is nonzero, it applies to that entire IP address. If the blocking mask is
zero, then there are a number of subrecords, each consisting of a length byte, a
blocking mask, and one or more four-byte unknown fields. A length byte of 0x00
terminates the list of subrecords and signals a new IP address.

Now, what about those subrecords? Well, they obviously represent some kind of
subdivision of an IP address - like, for instance, a directory full of Web
pages. Here's an entry from table 1, decoded by a more sophisticated Perl
program that also incorporated reverse lookups of the IP addresses:

This particular entry stood out partly because bc1.com is an ISP local to
one of us. We have friends with pages on that system (although not, as far as
we could tell, at the particular URLs blocked by Cyber Patrol). It also stood
out because all the subrecords start with the same four-byte sequence. That's
a pattern that appears in lots of other entries, too; there will often be a
site where several subrecords start with the same four-byte sequence. Here's
a good example (it's long, so we've left out part):

Notice how the four-byte values seem to be grouped together in an
hierarchical structure. Just like directories... It seemed a reasonable guess
that in fact, that's what they were. If they wanted to block a URL like
http://www.foo.com/bar/baz/, maybe they'd do it by creating a record with the
IP address of www.foo.com, and a subrecord with some representation of the
strings "bar" and "baz".

We said "some representation of the strings". What, exactly, does
that mean? Well, it would be quite reasonable to suppose that these four-byte
fields are hashes, similar in nature to the password hashes. They could feed
each URL component into a hash function, store only the hashes, and then have
enhanced security as well as various efficiency advantages.

We figured out the exact nature of the hash function with the aid of the
bc1.com entry. As you can see above, every subrecord for that server starts
with the hash value 0xD2A152F4. If you look on the corresponding Web site,
you find that it's an ISP's server for user home pages, all of which are
stored in a "users" subdirectory. And it just so happens that in
the nonstandard CRC32 variant that was used as half of the HQ password hash,
the hash of the string "users" is 0xD2A152F4. Problem solved.
We've designated this structure TNotURLEntry.

Above we explain the cryptanalysis of CRC32 in considerable detail, and we
show how to construct, in negligible time, an input that will generate any
output of our choice. As with the passwords, Cyber Patrol doesn't use any salt
for its URL hashes, so we can recognize where there are duplicate directory
names even without reversing the hashes, and get extra value for each hash we
reverse because the same reversal will be valid for all other occurrences of
that hash.

Unfortunately, there is what might be called an "information
theoretic" problem with reversing these hashes. There are many possible
directory names that could generate the same CRC. We can never be absolutely
sure which of several equivalent (same CRC) URLs was actually meant to be
blocked. In the case of the HQ password, we could use the other half of the
hash output to recognize which one was correct, but here, that doesn't work.
In a perverse way, shortening the hash has actually increased its security.
But one good thing for us as attackers is that of the many possible strings,
only a few will be meaningful. Given the choice between "sex" and
"dkbgl~3.a7df", few would argue with our choice of "sex".
For the small number of hashes which are hashes of very short strings, we
can guess that the short strings are really correct - there are so few possible
strings of five or fewer characters, that they're almost certainly right.

But for most hash values, the CRC32 reversal isn't really very helpful.
For any given hash it generates a long list of possibilities, most of which
are garbage. Instead of sorting through them, we fell back on the old reliable
dictionary attack. We took a list of words and hashed them all, and then
started modifying them by tacking tildes onto the start (to make it look like
user home directories), adding letters to the start and end, adding
".htm" and ".html" to the end, and so on.

The source file "cndecode.c" implements this attack on the
cyber.not file, as well as incorporating decryption code, some prettier
output formatting, and (for systems where this works) reverse DNS lookups.
It uses a hash table, and remembers the reversal of each hash for use on
future occurrences of that hash, in an effort to be as efficient as is
reasonable, although the prime emphasis was on expediency in programming
over squeezing out the last CPU cycles.

As a last resort, if it can't find a hash in the dictionary, the cndecode
program goes through all the possible reverse-CRC values up to a configurable
limit, assigning scores to them based on how plausible they seem, and then
chooses the best. That takes a relatively long time (significant fraction of a
second) per hash, and it doesn't really work very well, but it does catch a few
that aren't caught by the dictionary attack. Here's a sample of the output:

As this shows, URLs tend to be sorted within a given IP address. The ones
that aren't in sorted order are probably ones for which the reverse-CRC didn't
guess the right reversal. A more sophisticated version might attempt to detect
the sorted order, and force the reverse-CRC to choose a reversal which would
fit into the sorted order, but the amount of work involved would probably be
more than it's worth.

This entry also shows something else we haven't talked about yet -
"alias" IP addresses, which are the apparent purpose of the
one remaining table in cyber.not. The structure can be seen in the
TNotIPEntry. These aliases are just that. Each entry
consists of a root IP and one or more aliases to that one. The root
IP corresponds to entries in the URL table, and any resource banned under
the root IP will also be banned under its aliases. These aliases may or may
not resolve to the same machine; the assumption here is that these IPs are
serving the same pages.

Let's talk briefly about hash collisions. The chance that any two
randomly chosen URL components will happen to have the same hash is one in
2**32, which is not very likely. This is true even with the uneven
distribution of URLs, because CRC32 is a reasonably good hash just as a
hash, for all its cryptographic weakness. So at first glance, it
doesn't seem like there'll be a big problem of different URLs having
the same hash.

But the birthday paradox comes into play, too. With 2**32 possible
hash values, there starts to be a serious chance of collisions as soon as
the number of hashes gets past 2**16, which is 65536. It's certainly easy
to imagine that a large ISP could have more than that many user home pages
at the same location in their URL tree. Then two or more different
sites would have the same URL as far as Cyber Patrol is concerned, and any
block on one such page would hit the others. Given the current size of
the Net and the size of cyber.not, there probably aren't any real examples
of this kind of problem in the cyber.not file. But there is very little
safety margin. A 64-bit hash would remove any suggestion of collision
risks, at the cost of a considerable increase in filesize.

Of course, using a 64-bit hash would improve our ability to attack
the cyber.not file too, by reducing the number of possible URLs for each hash
value. Remember how having the second half of the HQ password
hash made it so much easier to unambiguously reverse the hash?
Information theory makes this tradeoff unavoidable: the fewer possible
collisions, the easier and more unambiguous dictionary attacks will
necessarily become. Given that bytes in cyber.not are somewhat expensive
(because the file has to be transferred to all the users in updates all
the time), the choice of a 32-bit hash is probably reasonable, even though
it has some small risk of creating false blocks.

A more practical security measure would be to salt the URL hash. In
the section on the HQ password we described how salting that hash would
make dictionary attacks on the password much harder. With the URL hashes
that becomes all the more significant, because with the URL hashes we
aren't attacking just one hash value. We're attacking a few tens of
thousands of hash values all at once. So anywhere we can recognize that
two hashes are the same, that's a win, and any time we hash a dictionary
word, we can easily check it against all the hash values in cyber.not all
at once.

If every URL in cyber.not had been hashed with a different salt
value, then we would have to hash an entire dictionary for every URL instead
of just hashing one dictionary for the entire file. That would raise our
time for a dictionary attack from a few CPU minutes to a few CPU months -
we could still do it, possibly by recruiting a network of volunteers to
compute cooperatively, but not as easily as the present attacks.

They wouldn't even need to make cyber.not any bigger to get the
benefit of salted hashing - they could just use the offset of each URL in
the cyber.not file as its salt value. Salt doesn't have to be random or
secret, it just has to be different for each hash. They would also have
to upgrade the hash function to one that isn't linear like CRC32; with
CRC32, we could simply figure out the hash of the salt, XOR it out, and
then have an unsalted hash to attack normally. A much more secure
approach, which wouldn't make cyber.not any bigger, would be to take the
offset and the URL, hash them together with SHA1, and then take the bottom
32 bits of the result.

But even that wouldn't raise the difficulty of attack above the
level of competent amateurs, and indeed, there is no way to make this kind of
hashing scheme any more secure. There just aren't enough possible URLs on
the Web; it's too easy for attackers to guess all possible URLs and test
them to see which ones would be blocked. Unix sysadmins accept the fact
that attackers can test passwords offline, and attempt to educate their
users to choose hard-to-guess passwords, but censorware companies cannot
ask all objectionable Web sites to choose hard-to-guess URLs. So they
ultimately cannot defend themselves against this form of attack. With
salt in the hashes, though, they could make it a lot harder for us.

Next, the cyber.yes file contains "positive option" URLs; when
the software is configured to its strictest setting, only these URLs
will be permitted. There is also a list of newsgroups at the end that
seems to be in identical format to the one in cyber.not. A quick scan
of the decrypted file with a text lister showed that it's full of
fragments of ASCII text, like this (dump generated, amusingly enough,
by Richard E. Morris's good old DOS-based HEXEDIT program):

These look like URL fragments, but they also look sort of haphazard.
In fact we theorized at one point that they might be stray garbage from
memory allocation calls. However, they do have a purpose, and once we
had the format of the cyber.not file, the cyber.yes file became easy to
figure out.

The same correlation-counting program that we ran on cyber.not showed
similar results on cyber.yes, with strong correlation at a distance of six
characters, but unlike cyber.not, no sharp peak at seven characters. This
suggested that the format for the main table in cyber.yes would be very
similar to that of cyber.not. Examination of the hex dump showed
similar stretches of six-byte repeats with a field incrementing in big
endian.

A little trial and error revealed that the format is essentially
identical: records with IP addresses and two-byte "mask-like"
fields. We say mask-like because it's not clear that they serve the same
function as the mask fields in cyber.not. When the mask-like field is zero,
there follows some number of variable-length URL records, terminated by a zero
byte. There are two significant differences in the subrecord format.
First, the URL is in plain text instead of being hashed. As a result, the
variable length can assume a less restricted set of values. Second, the
"mask" field appears to have a different significance. Here is a
sample record from cyber.yes:

The hexadecimal column is the field that in cyber.not would be the
blocking mask. Here, it's not clear what it is. It could be some kind of
anti-blocking mask, of categories NOT to block, but then it's surprising
that it would be in sorted order (a pattern that persists in other records
too), especially when the URLs are also in alphabetical sorted order.
Other possibilities for this field include some kind of time stamp, a
serial number, an index pointer, an authentication token or hash, or
random memory garbage. The "mask-like" fields on IP addresses
similarly show little apparent design, except that (just as in cyber.not) a zero
value indicates the presence of URL subrecords. The newsgroup list
has mask-like fields too, and there's no immediately obvious meaning to
the data in them.

At this point we should note the overall file structure of cyber.yes.
Unlike cyber.not which had an elaborate header, the header on cyber.yes
consists of just three bytes: one version number (or possibly encryption
key fixup), and two bytes giving the length of the URL table. We
discovered this by working backwards from the URL table until we found
that all the bytes in the file except the first three made sense as
part of the URL table. The newsgroup list follows immediately after the
URL table and continues until the end of the file, in the same format
as the cyber.not newsgroup list except with unknown data where the
blocking mask would go. Unlike the tables in cyber.not, both tables in
cyber.yes are just bare data, with no "SD" and "ED"
delimiters.

This file structure is interesting because it seems stripped down or
simplified from the structure of cyber.not. It would be reasonable to
guess that the cyber.yes format was a quick hack retrofitted onto the
product subsequent to the more carefully-designed cyber.not table. It's
also possible that the cyber.not format proved too complicated and
cyber.yes is an example of a "leaner and meaner" file format, still
keeping to the same design principles as cyber.not and likely re-using a
lot of code originally written for cyber.not.

Following are the relevant structure tables. This concludes the section
on reversing the file formats.

The problem here is the Table Type field which we have too little data
to fill in with any certainty. We can build the following table from the
files we have analysed so far, built around the types that have occurred
and the type of data they pointed to.

With all these technical things resolved, let's look at the data itself. First
a table of statistics pulled from two different CyberNOT files:

Cyber Patrol URL Database Statistics

Bit

Category

1999-04-29

2000-02-20

Change

0

Violence / Profanity

1201

1407

+206 (17%)

1

Partial Nudity

46538

72236

+25698 (55%)

2

Full Nudity

45013

70248

+25235 (56%)

3

Sexual Acts / Text

47769

74009

+26240 (54%)

4

Gross Depictions / Text

1414

2273

+859 (61%)

5

Intolerance

259

337

+78 (30%)

6

Satanic or Cult

129

197

+68 (53%)

7

Drugs / Drug Culture

197

306

+109 (55%)

8

Militant / Extremist

187

204

+17 (9%)

9

Sex Education

201

270

+69 (34%)

A

Questionable / Illegal & Gambling

1347

1928

+581 (43%)

B

Alcohol & Tobacco

783

1155

+372 (48%)

C

Reserved 4

48

3

-45 (1500%)

D

Reserved 3

0

0

0 (0%)

E

Reserved 2

0

0

0 (0%)

F

Reserved 1

0

0

0 (0%)

Total URL
masks

52315

79899

27584 (52%)

We can see that of the roughly 80000, entries about 90% fall into one or more of the
pornography categories. The Learning Company have a page on their site describing their
criteria for categorizing entries.
At the end it states: "Note: Web sites which post "Adult Only" warning
banners advising that minors are not allowed to access material on the site are automatically
added to the CyberNOT list in their appropriate category.". This may give the
impression that sites are automagically added as soon as they appear on the web, which certainly
isn't the case. They are most probably using a web spider to pick these up. These spidered sites
probably make up the bulk of the URLs flagged in all of categories 1, 2 and 3, which is the
dominant set of flags by far. By monitoring these statistics for a longer period of time one
could deduce how effective the spider is in finding new sites. The oldest cyber.not we have available
is dated 1999-04-29. By comparison it contains only 52315 entries, but the ratio of "porn"
rated sites is the same, about 89%, with 46538, 45013 and 47769 entries flagged for categories
one, two and three respectively. Most of the other categories are up by between a hundred
and three hundred entries, but the porn categories, suspected mostly to consist of spidered
sites, are up by about 25000 entries each for the period (about 38 weeks).

There is a function in CP where a user can use a form to report new URLs for
consideration of inclusion into the CyberNOT. It would be interesting to know how
many of the URLs added come in this way. It would be possible for users to team
up and exchange URLs on their own, bypassing The Learning Company, which is charging
for these CyberNOT updates. By patching the CP executable it could be made so that
this report form is posted to another server, which could also host updated CyberNOT lists.
It would take a little work to set up, but not too much. The most difficult aspect would
probably be to reach out to active Cyber Patrol users and convince them that this would
be worthwhile, especially since it would require a certain amount of momentum to be
worthwhile at all. With this threat, it's logical to assume that The Learning Company
and other censorware vendors will use even more security-through-obscurity in future
products, to deter the threat of having one of their sources of income bypassed.

Near the start of this essay we mentioned the "reserved"
blocking categories. Cyber Patrol, in addition to the twelve documented
blocking categories, has an additional four (labelled "Reserved 1"
through "Reserved 4") which are greyed out. Reserved 3 and
Reserved 4 are selected by default, and so cannot be disabled - even by
the administrator.

Any sites placed in one of those two categories will be blocked no
matter what. We found three examples on the now current CyberNOT list.
All three are in Japanese. They were each blocked in Reserved 4 and no other
categories; we could not find any examples of blocks on other
reserved categories.

Tsutomu Notani's home page,
which based on the pictures appears to include some content about horse
racing, and thus (presumably) gambling. No other blockable content is
immediately apparent.

There are a few entries in the CyberNOT list that are blocked under all
non-reserved categories. For instance, the anti-censorware site of
Peacefire is listed as containing
"Violence / Profanity, Partial Nudity, Full Nudity, Sexual Acts / Text,
Gross Depictions / Text, Intolerance, Satanic or Cult, Drugs / Drug Culture,
Militant / Extremist, Sex Education, Questionable / Illegal &
Gambling, Alcohol & Tobacco". That's not such a surprise; blocking
Peacefire has become traditional among censorware manufacturers.

The other sites blocked under all categories seem to be translation and
anonymizer services; any site where you can type in a URL and it will
present you a copy of that page. That's probably no big surprise either,
because such sites can be used to circumvent censorware. So it may be
reasonable that sites like anonymizer.com
should be blocked under all categories; potentially, they do make
available the entire range of human thought. Not all these blocks are
carefully applied, however; the
"STOP KITTY PORN"
page (which features a picture of a very bored-looking house cat) is blocked
under all categories apparently just for containing a link to anonymizer.com.
Here, as elsewhere, the blocking list doesn't seem to be updated very frequently.
The server at 207.55.200.2 (whose
reverse-DNS resolves to "www.live4u.com", although that doesn't resolve in the
forward direction) seems to be an ordinary portal site, with no obvious
translation service, but it's blocked for everything except sex education.

Of course, the most interesting things we could find on the blocking
list would be sites about political or social issues. Other censorware
packages have gotten in a lot of trouble, for instance, by blocking sites
like the National Organization of
Women, and a great many gay and lesbian sites. The CyberNOT list seems
relatively free of that kind of political agenda, which could be a good or
a bad thing depending on your point of view. If the software is to be
installed in public libraries, it's good that it won't block these
politically-important sites. Of course, it would be better if it didn't
block any sites at all. On the other hand, if you were a parent who
considered feminism or homosexuality to be unimaginably horrid subjects,
then you might feel ripped off by Cyber Patrol's not blocking the
high-profile sites.

Let's take a closer look at the category intolerance. While they
do block smaller sites, such as this one on atheism,
which we feel is relatively benign, they also block such high profile a site
as www.godhatesfags.com and part of
American Family Organization, whose views on
homosexuality cannot be described as anything if not intolerant. AFA is one
organization pushing for the installation of censorwares in US libraries.
One can only assume they'd prefer one of Cyber Patrol's competitors.

Some other sites in this category:

Matthew R. Galloway's homepage.
Contains the word "Voodoo" in a reference to voodoo-cycles.com,
and a pretty famous joke file entitled Top 10 Reasons Why Beer Is Better Than Jesus.
No #1 being "If you've devoted your life to Beer, there are groups to help you stop.", BTW.

Misha Verbitsky's old homepage.
Seems perfectly ordinary. Some papers, a couple of usenet archives. Note that this page
was frozen several years back, so whatever it was censored for, is still there.

The Justice on Campus Project's mission is to preserve free expression and
due process rights at universities. Our online archive includes reports on
disciplinary charges, speech codes, and censorship on college campuses around
the country. The Project was one of 20 plaintiffs in the ACLU's successful
challenge of the Communications Decency Act.

This site contains nothing but the
text "Welcome!". If that's enough to be branded a "Satanist", we can expect a rapid
growth in bans. If nothing else, this is another example of how the bans grow
outdated as time goes by, but The Learning Company doesn't seem to care much.

webdevils.com - "Experiments with sound",
a site which has nothing to do with religion, or lack of it. Guess the hostname
was enough in this case.

There is one political issue the CyberNOT list doesn't shy away from:
that of nuclear disarmament. All sites relating in any way to war, bombs,
explosives, or fireworks, both for and against, seem to be eligible for
blocking as "Militant / Extremist". Most are also classed as
"Violence / Profanity" and "Questionable / Illegal &
Gambling", whether those categories seem to apply or not. For
instance:

Founded in 1981, the Nuclear Control Institute (NCI) is an
independent research and advocacy center specializing in problems of
nuclear proliferation. Non-partisan and non-profit, we monitor nuclear
activities worldwide and pursue strategies to halt the spread and
reverse the growth of nuclear arms. No Bomb! In particular, we focus
on the urgency of eliminating atom-bomb materials ---plutonium and
highly enriched uranium---from civilian nuclear power and research
programs.

Is that an extremist position?

A personal site
including a lot of different material, apparently blocked for something
called "The Nazism Exposed Project". From the blocked page:

Nazism, fascism and extreme nationalism are today at its highest
peak since the destruction of Hitler's dictatorship in 1945.
Today, all over the world, fascists and extreme nationalists win millions
of votes on their simple racist solutions to very complex problems
of the society. In the streets, Nazi boneheads are spreading fear
by using murderous violence and terror. These fascist groups blame
the cultural and ethnic minorities for the problems in our society.
These individuals, and their political leaders, are a threat to our
democracy, and to everything that is decent.

The former location of the
American Airpower Heritage Museum - an apparently-legitimate museum of US
combat aircraft. Blocked as "Violence / Profanity, Militant /
Extremist, Questionable / Illegal & Gambling".

Some sites that may be blockable under a few categories are also
blocked under a great many other categories. For instance:

Teen Babe of the Month; it's a porn
site, but it appears to be a perfectly ordinary porn site. Blocked under
all categories except sex education.

http://www.xs4all.net/~stones/, a
link (not the actual site itself) pointing at a warez search engine. That
would presumably qualify as "Questionable / Illegal", but it's
flagged for everything except sex education.

http://www.danland.engelholm.se/,
a personal home page. Some content relating to warez, but nothing else
blockworthy is immediately apparent. Blocked for everything except sex
education.

The Marston Family Home
Page, with the usual round of pictures of Mom, Dad, the kids, the
dog, etc. Entire directory blocked for "Militant / Extremist,
Questionable / Illegal & Gambling", apparently just because of
this paragraph in young Prescott's section:

In school they teach me about this thing called the Constitution but I
guess the teachers must have been lying because this new law the
Communications Decency Act totally defys [sic] all that the
Constitution
was. Fight the system, take the power back, WAKE UP!!!!!

You go, boy.

It is obvious on examining the list that many entries haven't been
updated or checked in a long time. Many sites that are blocked now give
404 not found errors, or redirects to new locations that are not blocked.
Changes to Web sites may also account for some of the inappropriate
category labelling. Here are some samples of sites that seem inadequately
reviewed:

an empty page blocked in all
categories except sex education, and a
404 not found page blocked in all categories including sex
education. There are many others like these.

Another student home page
at imsa.edu, blocked as "Violence / Profanity, Militant / Extremist,
Questionable / Illegal & Gambling". Consists solely of a link to the
author's resume, which is perfectly ordinary.

A personal home page at
world.std.com. The part about his wife is nauseatingly sweet, but
doesn't really fit most people's definitions of "Gross Depictions /
Text, Militant / Extremist, Questionable / Illegal & Gambling",
which is what it's blocked for.

These are just a few examples of sites that Cyber Patrol is banning, or was.
It is not unthinkable that they might lift a few after this is published. We've
only scratched the surface as far as checking on the sites that are banned.
Going through even a few hundred takes a lot of time, and with almost 80,000
bans in effect, the work required to check them all would be enormous. We
don't have time to do it, but since The Learning Company is making money
from the supposed correctness of the list, they ought to be able to find
resources to check the list from time to time.

We know they are banning 80,000 or so URLs, but most censorware packages also
have a database of words that are not allowed to exist in incoming pages, because
it's the only way to really approach being effective in banning new pages
on the ever evolving and growing Internet. Cyber Patrol doesn't do that,
and so its IP and URL bans are its only real line of defence. If
you can find a site that The Learning Company have not, then there's very
little stopping you from browsing it. There is the function that can filter
a site based on substrings in the URL itself, but that is it.

Cyber Patrol is actually fairly efficient in blocking sites if you don't
know how to search effectively. If you simple search one of the major search-engines
then you will probably draw a blank, because it's very likely that that
is the exact kind of search used by The Learning Company to bait their web-spiders.
However, finding a few pages with obscene banners and thumbnail pictures is no big
problem. We could locate this one
and this one in short order. One somewhat
effective method is to search for non-English language pages. The spider might
not be effective in locating and parsing these for automatic inclusion in the
CyberNOT. You could for instance look for a Swedish site, and locate
www.smygis.com,
which is not - as this is written - blocked in any way. If you really want porn,
Cyber Patrol might slow you down a little, but it won't cut you off entirely.

Apart from checking for "unauthorized" modifications to cyberp.ini, CP's
"advanced anti-hacker security" consists of a new %windir%\system\system.drv
that checks for the existence of the modules PROGIC, PROGICS and TS. These
are represented by the files IC.EXE, ICFIRE.EXE and TS.DLL, all in
the %windir%. The original system.drv is cleverly hidden away as
%windir%\system.386.

The modules are loaded in two ways: first there is a load entry
in the win.ini file, and second, there's a entry in the registry at
HKCU\Software\Microsoft\Windows\CurrentVersion\Run called "FltProcess",
which will load %windir%\system\msinet.exe, which in turn will load the
Cyber Patrol modules. After replacing the system.drv, which in the CP-version
will halt loading of Windows if it doesn't find it's modules, and ask you
to call their support number, you can safely do away with the registry entry,
the load-key in the win.ini and any of the numerous binaries. Because of the
many files CP installs to your system, we suggest you use the normal
uninstaller instead. Not that it does a very good job of removing its
system files, but there you go.

Optionally, if you come across an installation running unregistered, you
can use the backdoor password omed to uninstall, or simply to gain
administrator access.

We have developed a set of software for getting around Cyber Patrol. People
oppressed by Cyber Patrol will want to take a look at
CPHack, a Win32 binary which will decode
the userlist for you, and also let you browse the different banlists.

Also available is C source for two command-line programs
illustrating the cryptographic attacks on cyber.not
(cndecode.c) and the HQ password hash
(cph1_rev.c). These programs were written under Linux
and are not guaranteed to work anywhere else.

A complete package with this essay, the binaries, and various sources
and related files are available as cp4break.zip
(~360Kb).

This tool is not particularly hard to use, but some comment on its use
could be in order. First of all the author would like to state that this is
a hack(1), which is reflected both in the state of the source and the user interface.
The basic functionality is to let you load and browse the information
of a cyber patrol .not file and/or the user information contained in a
cyberp.ini file. Simple select which you want to load using the file menu.
Also in the file menu are functions for importing and exporting hosts. By
importing hosts you are reading a text file containing lines of IPs and
their corresponding hostnames into the treeviews. Export, of course, does
the opposite.

Continuing we have the functions "Export dictionary" which will traverse
the treeviews and write out all words that have been assigned to URL-hashes.
"Export unresolved IPs" does just that; it could be used to distribute the
work of doing reverse-lookups. The final export function is "Export URL
hashes", which will export any hash that has not been assigned a word, the
logical inverse of the "Export dictionary" function.

Maybe the most useful functions are the last ones, "Generate report", which
will output a HTML document reflecting the data you have loaded. Be sure
to check out the "Configuration" tab before doing that though, and the
somewhat mysterious "Cull dictionary by hash". The last function will
take the main dictionary (as defined in the configuration tab), and create
a new dictionary containing only the words with hashes contained in a
.not file you have loaded. A bit of explanation on this: It was thought by
the author that a lazy dictionary attack would be enough. This lazy
approach is what you get if you select one of the attacks available by
right-clicking a node. However, this proved quite slow when used with
large dictionaries (15Mb or so), as it only looks at one URL at a time.

The problem here is that CPHack will try - for each node - lots of words
from the dictionary with hashes that doesn't exist in the database at all.
As a quick hack on the hack, this function was implemented, which will
take all the hashes in the database and attack them all at once. The
downside is that no references are kept as to which exact nodes the found
hashes belong to, so you will only get a new optimized dictionary to use in the
lazy attack, you won't get a instant update to the treeview. While desirable,
it would take too much time and effort - at this point - to implement
correctly. A good implementation would traverse the nodes you have selected,
creating a ordered list of unique hashes, attached to which would be lists
of all associated nodes. When the hash of a word is found in this ordered
list of hashes, the correct chain of tree nodes could be quickly traversed
and nodes updated to reflect the hit. Until this is fixed, you should cull the
dictionary first, and use the output with the lazy attack, to "assign"
all words into the database.

The main interface contains the five
sections "Users", "Newsgroups", "URL database", "IP Aliasing" and "Configuration".
A quick rundown follows.

If you load a cyberp.ini the "Users" tab will display the names and
passwords of the users therein, including the passwords of the innate
administrator and deputy accounts.

After loading a CyberNOT file, the "Newsgroups" tab will display all filters
defined therein. To the rights is a panel of checkboxes which you cannot
operate, but will reflect the masks applied to the newsgroup entry you select
in the listview.

Next we have the "URL database" tab, which contains a treeview where
you can browse the database. It should be noted that the relative long
loading time of a CyberNOT file is due to the way the treeview works, with
insertion into a branch - apparently - being O(n) and not about O(1) in regard
to the number of siblings of a new node. Anyway, you can browse the view
in the normal manner of things. There are three different types of nodes,
the first being called internally a "net node". This is simply a root node
containing all entries for IPs of a "A net". Below these are "IP nodes" which
are the IPs that are banned by the database. Some of these have children
of their own, being "URL nodes" which contains the hashes of specific
paths and resources being banned. You can right-click on any one of these
three types of nodes for additional context sensitive functionality, such
as "Open", "Lookup" and "Dictionary attack". As with the newsgroups
tab, there is a panel of checkboxes which will reflect the masking status
of the IP or URL you select. At the bottom is a quick search bar where
you can do case sensitive string searches.

There's not much to say about the "IP Aliasing" tab, but here too you can
right-click for additional functionality.

Finally we have the configuration tab where you define the different
dictionaries you want to use, and a number of other things which are
self-explanatory, except maybe for the "Lock found URLs". This function,
if enabled, makes sure that once a word has been found to match a hash and
been attached to it in the treeview, then it will never get replaced even
if another possible candidate is found.

This program is entirely self contained. It will not write to the
registry, and it will not create files anywhere but in the its own
path, unless you say it can.

On the good side, we note that Cyber Patrol is - technically - somewhat
better than NetNanny and CyberSitter, the two other censorware packages we have
intimate knowledge of, but there is still far too much 16-bit code for it to be
really stable and earn a good grade.

We see no evidence of a clear political or religious agenda
behind Cyber Patrol, though as citizens of highly secularized countries we
might feel that many of the bans in the "Satanist / Cult" category
are unreasonable. Their criteria document
says "Satanic material is defined as: Pictures or text advocating devil worship,
an affinity for evil, or wickedness." and "A cult is defined as: A
closed society [...] Common elements may include: [...] influences that tend to compromise
the personal exercise of free will and critical thinking." LaVey
Satanism - for instance - isn't about any of the things in the full definition,
and atheism certainly isn't, but such sites are included in the CyberNOT.

The evidence points to the CyberNOT list not being properly updated to remove
old and outdated entries. As many as 50% of the IPs in the list doesn't even
resolve! When evaluating a product with a ban list, you should not look at
the number of entries, but the number of current entries. Simply
collecting new entries, and using the ever growing (but outdated) list
of bans as an argument in the sales game, is much easier than actually
putting in work to ensure the list is up to date and accurate.

The old classic tactic of entering critics into the banlist continues, with
the banning of Peacefire in almost every category available. When the producers
are knowingly banning a site in clearly the wrong categories,
then what kind of trust can you put in them and their products? None. We must
continue to reverse-engineer these products so that consumer rights can be protected.
Will we ever find a censorware company who are not lying to us with
these false bans?

The absence of filtering based on content keywords is surprising, but welcome. The
technology does not exist to make content-based filtering really functional.
The problem of recognizing content and making choices based on context is a hard one,
suitable for research by the AI-labs. But it is a two-edged sword. The price
of leaving this error-prone functionality out is that it makes Cyber Patrol
less effective in blocking pages not previously processed by The Learning
Company.

After all this, the feeling is that CP is just another censorware package. It tries
hard to come across as effective - the magical technical solution to a
non-technical problem - but when push comes to shove, it yields to the power
of the human mind. If you thought putting this between your children
and the Internet would protect them from "dangerous" ideas, then you'd
better think again.

We would like to thank all the fine men and women working for civil
liberties all over the world.

Matthew would like to thank: the goddess Pele for favours received,
and the Canadian government for supporting my cryptographic interests
in several ways. Greetings to all the people I hang out with in
sci.crypt, alt.kids-talk, talk.bizarre, and the VLUG and Voynich
mailing lists.

Eddy would like to thank: Robert Risberg, Kristoffer Andergrim,
Mattias Aspman, Gunnar Rettne, and all of my friends around the world.
Special regards to all the intelligent, knowledgeable and humorous folks
of R20 of the Fidonet - you know who you are.