Software Design Using C++

Web Search I (Simple Web Search in Linux)

Introduction

This is the first of a two-part case study on web search.
It is specific to Linux, and assumes that you have an account on a Linux web server,
or that you at least have an account on a Linux computer where you can set up a bunch of
files that imitate what might be present on a web server. You also need to have a
C++ compiler on your Linux system. Our case study assumes that the compiler is g++, though
others are likely to work as well.

Have you every wondered how the well-known Internet search engines work?
They allow you to find documents scattered across the Internet that match your search conditions, and they
generally do so very quickly. Creating a powerful search engine like this is beyond what we can
do here, but as a start we can do some simple searching of documents on one web server.
Perhaps you will be motivated to look further into search engines and someday work on the next
generation of these!

A Simple Search

Let's begin with a pretty simple web search. It will only handle HTML files on your own web server.
We will assume that we have (or can create) a list of the HTML files on this web server (or perhaps
just those under some particular directory within the webroot). This list of HTML files is in
a text file that has the complete pathname for one of these HTML files on each line. A really
short example of such a file, which we will name weblist, is shown here:

You can use this sample weblist file if you wish, though it would be more interesting to
use a weblist file containing real data for HTML files on your Linux web server.
One way to do this is to manually edit your weblist file, typing in line by line the pathnames
of various HTML files on your server. This would be too tedious for handling a lot of data, but
would be reasonable for a sample weblist file that is only a few dozen lines long. If you want to
process a fair amount of data, you might be able to use the following bash script:

#! /bin/sh
#
# Creates weblist, a text file containing the list of html files
# found anywhere in the directory tree rooted at the directory
# given as the first (and only) parameter to this script.
#
# Warning: only a few error checks are made. Other things can still go wrong.
if [ $# != 1 ]
then
echo "ERROR: One parameter needed, the name of the web folder to start at in finding html files"
exit 1
elif [ ! -d "$1" ]
then
echo "ERROR: $1 is not a directory"
exit 2
fi
find "$1" -name '*.html' > weblist
exit 0

To use this script, copy it into a new file on your Linux system. Name the file makelist and
set its permissions to 700, which you can do with the command:

chmod 700 makelist

To run makelist, cd to the directory containing the makelist script, then run it as in this
example:

./makelist /www/carlson/cs125

The command-line parameter can be anything reasonable on your system. The /www/carlson/cs125
is probably useless on your server. The idea is that makelist produces the list of complete
pathnames for all HTML files located anywhere under the folder given as the one parameter.

OK. We have our weblist file, as above. The actual search program will be a simple command-line
program that prompts the user for the target string for which to look. The program then should do a
sequential search through the weblist file and print all the lines where the target string was found
within the line. It would be nicer to get a URL instead of a pathname, but we will worry about
this in a later example.

Note that this will be a rather primitive search program in various ways. The user interface is
clumsy. The search simply looks for a certain substring in the pathnames for the HTML files, which
may or may not produce useful results. For example, in the sample weblist above, searching for cs125
would give a match on every line. However, searching for homework or for math would give fewer lines,
and a search for syll would give just one line. Searching for view would yield just a few lines,
and the results might not at all be what you want, since the filenames use the word review, which has
a rather different meaning than the substring view. Note well that our search program does not look at
the contents of the HTML files, though that could give a more helpful search program.
The output to our simple search is also limited in that all we get is a list of filenames.
There is no description of any of these files.

In spite of these shortcomings, let's continue. These limitations are deliberate as they
allow us to create our first search program pretty quickly.

Before doing any actual coding, let's be sure that we have the program's specifications and design pretty well
set. This is to be a command-line program (in Linux) that prompts the user for a target string. The program
then does a sequential search of the text file weblist (with the format shown above) for all lines that contain
the target string as a substring. All matching lines are printed as the output of the program. In addition,
the weblist file is assumed to be in the same directory as the search program, so that it is easily available.
Note that our program does just one search, though we could easily modify it to do repeated searches.
The rough algorithm to do the sequential search through the weblist file is shown in the following pseudocode:

open weblist file for reading;
get the target string for which to search;
read a line of the file;
while we read more than the empty string
{
if target string is a substring of this line
output this line;
read the next line of the file;
}
close weblist file;

Since we will need to do a number of string operations, both in this and in later search programs,
a collection of string functions is supplied for you in the following two files.
This collection of functions works with C-style strings of type StringType. Such a string is just an array
of StrMax characters. StrMax must be chosen to be large enough to handle the types of strings to be used
in our search program. (We could have chosen to use type string,
but for the current program let's use StringType.) We will not explain these
functions in detail, but do read through the comments on each function so that you know what it does and
how to use it. If you are interested in how these functions are implemented, read through the code inside
of the functions carefully.

The functions in this collection that appear to be of use for our current project are MyGetLine (which we
can use to read the user's target string from cin and to read from the text file stream a line of the weblist
file) and SubstringPresent (which we can use to see if the target string is a substring of a line from the
weblist). We could use the usual C++ getline function, but let's use the MyGetLine function given here, partly
because we can read its comments and code, and hence know exactly what it does (and does not) do. Note that
stringhelp.h uses namespace std and includes header files that we are likely to need (see below).
Thus we will not need to put these into the cpp file for our project; we will only need to do an
include of stringhelp.h.

The search function pretty much follows the pseudocode given above, except that the opening and closing of
the weblist file are handled in the main function. The code for the search function is shown below. Notice
how MyGetLine returns the number of characters it read. This can nicely be used to control the loop for the
sequential search. Similarly, MyGetLine's return value gives you the length of the target string,
so that there is no need to use the usual strlen function. If this length is not positive, we print
an error message and then wish to stop the program. Since we want to be sure to close the file before
ending the program, we do a return (not an exit) to get out of the search function. The main function then closes the
file and ends normally.

The suggested Linux command for compiling the program,
g++ simplesearch.cpp stringhelp.cpp -o simplesearch -s,
uses the g++ compiler to compile simplesearch.cpp and stringhelp.cpp into an executable ("output" file) named
simplesearch. The -s option strips off the symbol table, which is not needed unless you want to run a
debugger. Stripping the symbol table results in a smaller executable.

Before attempting to run the simplesearch program, be sure that the executable permission is turned on.
You can manually turn it on as follows:

chmod 700 simplesearch

Next, be sure that the weblist file is in the current directory, along with the simplesearch executable.
Then use ./simplesearch at the command line to run the program. Hopefully your program will prompt
you for a target string and then print the matching lines, as expected.

A Better Search

Let's modify our search program so that it is a little more useful.
In particular, let's have it handle repeated searches instead of just one. We will have the user
enter the empty string as the target when the user wishes to stop the program.
Let's make the output more helpful, too, by giving the URL corresponding to each matching pathname
instead of the pathname itself. For example, if your weblist file contains these 2 lines:

As you can see, the main function uses a while loop to allow the user to do repeated searches. The loop
is controlled by the length of the target string entered by the user. (MyGetLine nicely returns this length,
as we saw earlier.) The loop ends when the user's most recently entered target string has length zero.

There is a new function, PrintURL, which converts the pathname for a matching HTML file into a proper URL.
It does this by printing the standard start for every URL on your web server and then skipping over the /www
at the start of the pathname. Note that we use a FOR loop to print the pathname (without the initial /www) and
that we print the characters one at a time by using the put function. Other ways of implementing the same
functionality are, of course, possible. Read through the code to see the details.

Homework

How about a related project for you to try? Here is your challenge:

Change search.cpp so that it uses the keywordfile as on our Linux system.
(Saint Vincent students should contact Br. David for the location and other
information on this file. Other readers might manually construct a keywordfile that fits the format
shown below. (It would be best to have one that matches some of the actual data on your Linux web
server, but even fake data could be used). You might also be able to get your Linux system administrator
to adapt and use the getmeta script to create a keywordfile of the proper type.
This script requires a file called getmeta.subdirs whose contents must
be adapted to indicate the subdirectories of webroot that you want to index on your web server.
It also requires a compiled C utility. The source code is removespace.c
and the directions for compiling it are contained within the source code file.
Note that the getmeta script must be adjusted so that key variables are initialized to values
that fit your web server situation. Even the scripting itself might need to be adjusted
depending on the details of your version of Linux and other system-specific information.
The compiled removespace program and the getmeta.subdirs file should be located in the
same directory as the getmeta script.

In any case, this keywordfile contains on each line information about one html file.
The format of each line is as in this example:

A starting /www is assumed. Thus the pathname for this html file is /www/html/ed.html
The section between the | symbol and the first # is the description of this file.
In our example, the description is: K-12 Education Links
The strings between # symbols are the keywords (keyphrases) that can be used to look
up this html file.

Look for the target string in the keywords section only.
It is difficult to decide whether or not to allow a partial match. For example, in the
above, should a target string of "Web" result in the display of this web page or should
the complete keyphrase "Education Web Sites" be needed (or "Education Links", "K-12", or
"Education"? If you do not allow partial matches, then a target of "Education Web Site"
will not find this html page, as the final 's' is missing in the target. On the other
hand allowing partial matches would find this page with a target of "ink".

So, let's decide this question for you: do NOT allow a partial keyword (or keyphrase)
match. Insist on an exact match between the target string and a string between two # symbols. To do this,
create a new target string that consists of a # symbol,
the old target, and another #. Then use SubstringPresent to look for this new target.
Use GetSubstring to extract the pieces of a matching line of the
keywordfile. Thus, the path should be extracted (and converted to a URL), and
the description should be extracted so that it can be printed separately.

For example, if you search for "Education Links" the above html file is a match. You should
thus print something like the following. Notice that the description is printed first and then the URL.

K-12 Education Links http://cis.stvincent.edu/html/ed.html

After completing this project, if you want another challenge, try rewriting the same project using the
usual string type instead of the StringType that I created.