(¯`·.¸ Adding engines to WebFerret ¸.·´¯)The guts of a search engines parserby Laurent
(very slightly edited by fravia+)
published at searchlores in February 2001

An incredible deed. Webferret's own updating protocols reversed. I have the pleasure of having
met Laurent -a mighty PHP wizard- InRealLife, and indeed he 'has the reversing force',
that's for sure. Read this text, which is but
a part of an even more important project that
he is developing on his own with remarkable speed and competence: an automated bot
for searching (à-la-inference) various homepage providers. I note incidentally that
people at Webferret should thank Laurent for the ready-made google script :-)
And when you have finished reading this essay, re-read it. And then
work on your own: you'll love the possibilities that this approach will open to you, and to
your own searches...

Your comments and suggestions (and further reversing and disassembling) would be welcome.

(¯`·.¸ Adding engines to WebFerret ¸.·´¯)
by Laurent

Introduction

While gathering informations related to an ongoing project, I started to study and slightly reverse WebFerret, which
seemed an interesting source of information and ideas for the above mentioned project.
Although that project wasn't
aimed at improving WebFerret, I thought that the discoveries I made could be worth
an essay on their own.
The point that actually interest me is to figure out how WebFerret manages the query
building and the results parsing
of the differents search engines it support. Indeed, given that Webferret is a
software that runs locally on your
machine and given the fact that search engines often do (or at least at may) change their pages
layout, there must be
a way for WebFerret to keep updated to the last specifications.
I can't imagine that I would have to download a new
version each time a slight change would affect just one single Search Engine.
This made me think that the results page parsing
algorithm cannot just be 'hardcoded' in webferret.

Investigations

A quick check of WebFerret's options shows that it have built-in support
for proxies. That's a very interesting idea.
Let's launch our favorite local proxy software (I had proxyplus
at hand), tell WebFerret to connect via localhost:4480,
run a simple webferret search session and ... oh oh, what's that ? Here
is the proxyplus log file :

See those lines highlighted in red? A http POST request to http://vorlon.ferretsoft.com/update. Could it be so easy ?
let's point our
browser to that page. Ahi! a "404 Not found" :-( It would have been too nice. Anyway,
the proxyplus logs tells that the
server reply was a 200 OK, so there must be something. The trick is that the '/update'
script will return a 404 to try
to hide itself when it doesn't receive a valid request (which is btw a good 'protection' idea, imho).
So what? Give up? Certainly not! We won't be stopped by that, won't we?
I have somewhere in my little tools box
an http client/server code that should help me. Ok, let's shape that server
code to our today's purpose. Don't forget to
map vorlon.ferretsoft.com to 127.0.0.1 through our beloved HOSTS file and run webferret
again. BINGO!! here is the
actual POST request send by WebFerret:

Here comes the first discovery: WebFerret implements a
malicious 'phone home' feature (cfr the "malwares" lab). It sends back home your
name, country and company. I say malicious because this isn't needed at all !!
Ok, you have been warned. But the interesting things are elsewhere.
Between the 'SASF' and 'FerretSoft', some binary
data is also being sent.
Well, let's remember that and keep it for later. The complete request send is available hereLet's now shape our little client source code so it will send the exact same
request to the actual vorlon
server. Let's grab it's answer and ... BINGO!! look at this vorlon reply :

I didn't paste the whole answer cause it would make this essay
unreadable. For those interested (and you should better be if you'r gonna build your bots
on this :-)
the whole reply is available here. You better download that file and view it with a
good editor cause your browser probably won't render it correctly.

Ok, you certainly guessed it now: The whole bazar is stored in the windows
registry. A quick search for 'Excitegrammar'
in the registry confirm it.

So, what's left? Well, I spoke above about some binary data being sent
along with your private details to the vorlon server.
It becomes quickly apparent (especially when you compare that POST
request with one sent by an old version -3.0200- of Webferret) that the version number,
revision and patch level are included,
respectively at offsets FE, FF/100 and 109 in those files. This allows the
/update script to send back only the
necessary updates to your current version of WebFerret. And this, as opposed to your Name,
Company and Country, isn't
malicious at all, quite the contrary.

Reversing

Well, this is exactaly what I was looking for. In the registry I can find all the informations WebFerret uses to
build an url query and to parse the results for each search engines it supports.
At first sight, it seems they uses a mix of regular expressions with embedded scripts.
For example, take this : <a href=*.('http://[eh; tb; >url|*.]')*.\"> . It seems clear that what this do is
to
match the result page against <a href=*.('http://[*.]')*."> and then to assign the content of [ ] to an url
variable (>url), after some unknow 'eh; tb;'

I'll skip my experiments (they were quite boring, much more than what
you are actually reading, which is already
passably boring) and deliver you my findings on a silver plate:

eh : is a function that strips any html tags that may be included in the
matched string.

tb : is a function that trim the matched string (remove leading&trailing spaces).

Beside the scripts in themselves uses a sort of 'vsl' (very simple langage) syntax.
$ represents the working value, + means "append", - means "prepend". < means
to get the value of a variable while > means
to set the value of a variable. So, the line : "$search=; <+urlquerytext;
$+&c=web&start=0&showSummary=true&perPage=50; >urlquery" is actually
the following script (with the correct
explanations at the right) :

$search=;

value <= 'search='

<+urlquerytext;

add the value of the variable urlquerytext to value
If urlquerytext='fravia', value will be 'search=fravia'

$+&c=web&start=0&showSummary=true&perPage=50;

add the given string to value
value will now be : 'search=fravia&c=web&start=0&showSummary=true&perPage=50'

>urlquery

Assign the current value to a variable named 'urlquery'

Although I figured out the meaning of most of the functions/syntax, I'm convinced there
are much more juicy things to learn
inside WebFerret itself (like functions that are implemented but not yet
used for any search engines). Alas! My
reversing capabilities doesn't go that far and i'm lost in the
dissasembly (especially when it comes to something
written in C++ with classes and so on, which is the case for WebFerret). So, if
anyone of you already did that work or
is going to investigate this further, I would love to hear about it,
as this is actually what does interest me the most (I
suppose you already guessed what i'm trying to do :-)

Practical application

Ok, this is the second discovery and probably what some of you were looking for:
how to add more engines to
webferret. Well, should be quite easy if you followed me up to now: Just
write a little registry patch file.
As an example, we'll add google to the list of engines supported by WebFerret.
Here is what should be added to the registry :

First,we have to add our new engine to the list of installed one
(backdrawn: see below). Next we define some new Google
specific terms: It's Name, URL, Home URL, Query type, request method, Query command,
Query operands and finally the
parsing grammar. I won't enter into details, most of those values are
self explanatory. However, some have still
unknow
meanings to me. The QueryType, for example can take values like lip, lpp, sa, sap...
But I have no clues what this
means, so some experiments on this would be welcome.
The Method indicates
if WebFerret must use a POST (00000001) or GET
(00000000) method.
The problem here is that we can't merge that directly into the registry.
Some types, like the strings or
numbers need first
to be converted. Either you do this by hand or you write a quick script
to handle this task for
you. Anyway, once the convertion is done, you should end up with something like:

Save this registry patch to whatever you fancy and it's ready to be
merged into the registry.
For your convenience, this
file is available here
Now some backdrawn. The problem is that the list of installed engines is put
into a single key value. That means that
whenever a new update will be retrieved from WebFerret home server, our modified
list will be overwritten and thus all
our new engines will be losts. One solution to this is to simply prevent
WebFerret to retrieve any update information by
simply adding it to your hosts file. This, however will bring some
troubles when some engine will require an
updated
grammar or whatever else. I'll leave this problem to you. There are certainly different possible
solutions. You could for example, write
your own little proggie that will check if there is any new update available
from time to time or you can re-apply your
patched new search engines whenever you noticed an automated vorlon
update occured. As my primary goal wasn't to use WebFerret as an
actual tool but more as a source of inspiration, I didn't went any
further in this direction.
Note also, that if you examine the registry you may find some
other things
that can help you fine-tune WebFerret to
your requirements :-)

Conclusion

First let me be clear: I'm not stating that you should use
webferret nor that adding a new engine to
WebFerret is something really worth doing per se.
I personnaly never used WebFerret before nor probably will I ever use it in the future.
The
purpose of this
essay was simply to show you first that
even without any software reversing knowledge you can twickle software to do
what you want it to do. Second, I
tried to show you that there is a lot to learn
by studying some interesting
targets. If I didn't studied WebFerret i would probably still be trying to
figure out how to write a unniversal parsing
script. WebFerret gave me much inspiration on this topic.
I can now apply what I have learned in this context
to what was my original primary target: writing a sort of unniversal parser.
I now know
that some regular expression + some 'very simple language' scripts could be very helpful.
If everything goes fine, I
could end up with something worth publishing again very soon. So stay tuned :-)
As always, but here more than ever, feedbacks and critics, suggestions on this topic are really welcome. You can reach me
at phplab@2113.ch.
Thank you for reading this essay, hope it was worth it.