Catty v3 - artificial futility
------------------------------
Copyright (C) 2004 by Michal Zalewski
http://lcamtuf.coredump.cx/catty.shtml
1) What is this?
----------------
Catty is a novelty AI bot that does not, in fact, make a slightest effort
to implement or simulate intelligent behavior. Whereas other bots attempt
to analyze language, react to keywords, or do other tricks, Catty does not.
Instead, the bot uses web pages around the world - as indexed by Google -
as a source of various text it replies with. The exact sentence to be used
is found by applying a trivial, language-blind word matching algorithm.
This is rather unheard of in the world of AI, but common in the world of
politics.
As a result, you talk to the Internet. Catty's responses are sometimes
incoherent or offensive - just the way the web is - but most of the time,
you will be surprised, amused, or feeling you are having a meaningful
conversation. As a general rule, you can expect Catty to stand out in that
its responses are not canned, repetitive, nor predictable.
Catty v3 works on Linux and *BSD systems (and should be fairly portable),
and its learning module requires a set of GNU tools (awk, sed, grep, tr,
bash), and a properly compiled Lynx browser.
2) What's new and why?
----------------------
Catty v1 (2001) was a trivial sentence-matching bot, originally written
for IRC and then ported for WWW. It used fairly mediocre algorithms, and a
poor HTML parsing / text selection algorithm. Its knowledge database was,
due to performance concerns, limited to circa 200 000 sentences.
Catty v2 (2002) used a far more sophisticated set of algorithms, own memory
allocator and string comparison routines, and a far better text matcher
that favored run-ons of matching words, thus responding more coherently.
Its database peaked at 2 000 000 sentences. It became a popular destination
on my website, and some folks even used it to write movie scripts ;-)
Catty v2+c (2003) used a slightly improved scoring algorithm, and, most
importantly, did not fully reset sentence scores in each iteration,
implementing somehting that resembled staying on topic.
Catty v3 (2004) uses a more structured database of sentences (grouping them
by "subjects" and trying to stay on topic). It also deploys a much more
selective sentence collector. Algorithms are far from perfect at this
point, but it seems to work even better than v2.
3) How do I build and use it?
-----------------------------
Simply issue 'make'. When the code is compiled, you can execute
./catty3 and talk to the bot, prefixing every line of your input with ':'.
The reason for this requirement is that the bot is primarily intended
to be used on the web; if you include a unique keyword prior to :, bot's
response would contain that very keyword at the beginning. This allows
you to easily feed queries and process responses for a number of
concurrent clients from a CGI script, while a single instance of Catty
is running in the background.
4) How do I teach it?
---------------------
Firstly, you need a fairly recent version of lynx in $PWD or in /usr/bin,
custom-compiled by modifying src/LYGlobalDefs.h so that MAX_COLS is defined
as 30000, rather than default 999. IT IS STRONGLY RECOMMENDED TO RECOMPILE
LYNX WITH THIS TWEAK.
Then, assuming you have all the aforementioned GNU utils in place, you
should run ./cronman. The script should go through every entry in cron/*
directory, and attempt a Google search on the issue, adding results
to the knowledge base.
A proper output of the script should look the following way:
[+] Attempting to learn about 'you think would be on': 25 hits.
+ http://www.colinthompson.com/page6.htm: 20
+ http://www.crescatsententia.org/archives/2004_03_22.html: 180
+ Entry count goal achieved, bailing out...
`- Total: 200 unique entries.
If you see errors instead, chances are, you are missing some tools.
Catty v3 adds certain strings to the cron/* directory as it chats with
the user, to later broaden its horizons. You can, however, force the
bot to learn about a specific issue by doing:
touch cron/text+to+be+looked+up
Note that plus signs must be used in place of spaces. You are also advised
to use all-lowercase, text-only (no punctuation, etc) keywords. The
number of words used must be less than MAXKEY (6 by default, see config.h).
Use only phrases that are likely to be found by a web search engine, and
ones that bear some relevance to the subject of webpages to be found.
WARNING: Catty will issue a single Google lookup per every text to be
searched for. Although this is usually not a lot of traffic, it is also
against Google ToS to run automated lookups (hence lowering their
advertisement click-through ratio). We should be using Google API instead
- but I am yet to find a sane interface that could be used from a shell
script :-(
5) How do I start with a blank database?
----------------------------------------
The default knowledge database used by Catty v3 is a result of learning
it off the transcripts for Catty v2. If you are uncomfortable with the
quality or maturity of Catty's responses, made any changes to the HTML
parsing engine, or just want to build a bot oriented at a specific topic
or language, you should start with a blank database.
To remove all database entries, do the following:
echo -n >data/knowledge >data/learned >data/visited
At this point, Catty v3 will refuse to start. You need to manually add
several 'seeds' for the database, creating cron/* entries manually (as
described in section 4). You should then run ./cronman and let the bot
learn.
The bot needs to index 1000-2000 topics (around 5000 web pages, 100 000
sentences) on average to be eloquent. With only a couple phrases, it
will remain hopelessly clueless, and will resort to generic excuses most
of the time.
Because entering thousands of topics is rather impractical, one way to grow
the database is to manually feed Catty v3 around 20 topics, then run the
resulting database itself through the bot:
grep '^ ' data/knowledge | awk '{print ": " $0}' | ./catty3 >/dev/null
This command may be repeated a number of times, until a count of subjects
in cron/* is at around 100 or such. At this point, those should be briefly
reviewed if possible, and a next cycle of learning (./cronman) should be
started.
When it finishes, a next run of "loopback feeding" should yield several
hundred phrases to search, and a next learning cycle can be initiated -
until a desired number of topics is indexed.
6) What can I tweak?
--------------------
There are several parameters you might want to adjust:
* Google URL line in ./cronman script - by adding extra parameters to
the www.google.com/search?q=... invocation (such as specifying a
language, changing the number of returned hits, etc), you can narrow
or widen page selection criteria. This is particularly useful if
you want the bot to speak only a single language.
* AIMFOR variable in ./cronman script - this variable controls the
optimal number of sentences the bot will attempt to collect per topic.
Keeping it low would make the bot more casual in its responses, keeping
it high would usually provide it with more in-depth knowledge.
* POPWORDS variable in ./cronman script - this variable must be kept
lower than MAXPOP in config.h. See MAXPOP below.
* IGNORE variable in ./cronman script - this line lists generic
common words that should be ignored when creating a list of most
prominent words for every subject (this list is later used to find out
what topic the user is talking about).
* MAXTOPICS in config.h - the maximum number of topics we plan to have.
If knowledge database contains more, it will be refused. The default,
10000, is probably rather hard to exceed, but you might want to lower
it for memory-conservative applications.
* MAXPHRASES in config.h - the maximum number of sentences we allow per
topic. This should be kept above AIMFOR x 2, quite simply - if isn't,
you risk the bot will refuse to run.
* MAXWORDS in config.h - the maximum number of words per sentence we allow.
The default is very generous. If any sentence has more than MAXWORDS
words, it will be simply truncated silently.
* MAXWLEN in config.h - the maximum length of a single word we are
expecting to see. See MAXWORDS.
* MAXKEY in config.h - the maximum number of keywords allowed per topic.
All entries created in cron/* by Catty v3 will have less than MAXKEY
words; all manually created entries must also follow this rule.
* MAXPOP in config.h - the maximum number of popular keywords indexed per
topic. Popular keywords are used to better approximate what subject the
user is talking about. This must be more than POPWORDS defined in
./cronman!
* LINEBUF in config.h - the maximum acceptable line length for most
operations (including reading sentences, popular words, and user input).
* REPEAT_LOG in config.h - the number of sentences that must be said by
Catty v3 in between before an already used sentence can be repeated.
* KEEP_CTX in config.h - the number of sentences we want to keep to
maintain context of current conversation; used only to score individual
phrases (topics are scored in a different way, with a residual score
fall-through).
* KBASE, EXCUSES in config.h - pathnames to data files, obviously.
* MAXEXC in config.h - the maximum number of excuses allowed.