Table of Contents:

1. Introduction

This report describes the result of a short project in the Autumn of 1996
at the bioinformatics group, Dept. of Informatics, University of Bergen, Norway.
The project was funded by a grant from the Norwegian Research Council
(grant 111032/410 Bioinformatics).

Pratt(Jonassen et al, 1995;Jonassen, 1996)
is a tool for discovering patterns in sequences.
This program takes as input a set of $n$ unaligned sequences, a set of
parameters (defining constraints on the class of patterns that can
be discovered), and outputs patterns matching some minimum number $k$
($k$ is chosen by the user) of the $n$ input sequences. The patterns
are ranked according to their information content (a measure of the
strength of the patterns). Pratt is able to discover patterns of
the type used in the PROSITE
database.
Pratt has earlier been made available to the research community
by making the source code (ANSI C) available via anonymous ftp.
The motivation for this project was:

To make it possible for potential users to try out Pratt over the WWW,
and in this way making it easier for new users to try out the tool.

To develop a more user-friendly interface to the program.

We think that the project has been successful in that PrattWWW
(the developed WWW based interface to the Pratt program) seems to have
the required functionality.
PrattWWW allows the user to use a form based WWW page to give sequences and parameters
to Pratt. PrattWWW uses the parameters to start the Pratt program, and when Pratt finishes
it presents a new WWW page with the results from Pratt. The presentation of the results
includes the plain text-file output as provided by Pratt, and a graphical
presentation given by a Java applet PatSeq.
PatSeq is an interactive graphical tool for visulisation
of patterns in a set of sequences. It shows the patterns and the location of
each pattern in each of the sequences. One difficulty
is that different patterns may have overlapping matches in a sequence.
This has been handled by drawing overlapping patterns vertically offset from eachother.
We believe that PatSeq makes it much easier to interpret the output from the Pratt program.
We plan to further develop PatSeq to make it a powerful and general
tool for visualistion of patterns in sequences. One limitation of PatSeq
at the moment, is that it is not able to handle more than 20 patterns. Therefore
we limit the number of patterns that can be handled by PrattWWW to 20.

The structure of this report is as follows. In Section 2
we describe the
logical structure of PrattWWW, and in Section 3
we describe the implementation.
The Appendix gives more low-level technical information about the actual programs.

2. Logical structure

The figure below shows the main components of the PrattWWW system.
The WWW pages are shown on the top, and the programs (Pratt, the CGI
script, and PatSeq) below, and the information flow is shown with
arrows.

2.1. Input (HTML page)

The user sets the attributes and parameters with a form. He/she should know
a bit about the biological background (Sequences, Patterns), and can choose
values for the following parameters:

In the extended version, the user can set all Pratt parameters.
(Jonassen et al, 1995;Jonassen, 1996)
The possibility is there, but it is not implemented in the processing part
yet. For some parameter it is necessary to hand over more values. This
must be programmed later.

The user also needs to say which source Pratt should use to get the
sequence data. There are 3 possibilities:

Pratt takes sequences from a file, stored at the server (this option is intended
for local users).

The user cut-and-pastes or types any sequences in a textarea field and
Pratt uses this data as sequence input file.

The user inputs (using method 1 or 2) a Pratt output file that he/she
has obtained earlier, and PrattWWW will visulalise the results in this file
using the PatSeq applet.

We also intend to allow the user to use more directly a file on his/her own
computer (implemented using file-uploading), but this has not been implemented yet.
File-uploading is a new feature in HTML3; when it is well documented and
working this will be added to PrattWWW.
When the user has chosen all the parameters, he/she can
press the "go" button, and the CGI script takes over control.

2.2. Processing

The script takes all parameter information from the Web page. If necessary,
it stores the pasted sequences in a file. If it is supposed to run Pratt,
it constructs a Pratt command line and saves it in a prattcommand file.
Afterwards it executes this file. The next step for every alternatives
is, to extract the data from the Pratt output file, interprete them and
write all necessary information for the presentation in arrays.

2.3. Output

The output is splittet in two parts;
the textual display (generated by the CGI script programmed in Perl),
and the graphical presentation (generated by the Java applet PatSeq).

2.3.1. Textual Presentation

The first step of the output part is to create and print the head
and the title of the output WWW page. The script prepares the structure
of the result page

some general information and the graphical presentation of the Result
- main output

textual presentation of the Result - original Pratt output

query form - possibility to send a new query

and prints the result of the interpretation to the webbrowser. The general
Information are a few sentences to say something about the query. The original
Pratt output is stored in a file on the server, it is printed for checking
the results. With the query form the user can modify his query or create
a new query and start Pratt again.

2.3.2. Graphical Presentation - PatSeq

The CGI script hands over all relevant data about the result to the
JAVA applet and calls it. The applet creates with these information an
image with 3 parts.

The Patterns List - it lists the patterns discoved by Pratt;
each pattern is printed using PROSITE pattern notation, and the
status, length, shape and colour of the graphical representation of the pattern
is given. The status of a pattern is either ON or OFF,
and a pattern is drawn in the sequence list if it has status ON.

The Ruler - this helps to identify the sequence position of each pattern in the
sequences. A ruler is given above and below the list of sequences.

The Sequences - it prints the name and the lenght of all sequences.
For each sequence it draws a line proportional in length to the lenght
of the sequence, and for each pattern occuring in the sequence, it draws
a box or an ellipse on top of the line in the appropritate position. The colour
and shape of the pattern representation uniquely identifies the pattern
in the patterns list.

There are some possibilities for interaction between the user and the
applet.

When the user

clicks on an icon in the patterns list, the status of this pattern
will be changed. If it was previously ON then the pattern status
will become OFF, and the pattern will not be shown in the sequence
list. In this way the user can chosse which patterns he/she would like
to see in the sequence list.

moves the mouse over a pattern in a sequence, he will see the pattern
identification number and the name of the sequence at the status line.

clicks on a pattern in a sequence, he will see the name of the sequence
and the PROSITE format of the patterns.

3. Implementation

For the tool are used 3 different languages. These are connected; Perl is
used to generate HTML code for the output page, and Perl controls the Java
applet:

3.1. HTML (input and output page)

The first type is a HTML page which is
stored on the server. This is the Input page (see Section 2.1).
It contains a form. With this form, the user can build the query for Pratt
as he want, while he sets the parameter for the query. The form refers
to the CGI script.

The second type is more complex. After the
processing, the browser will show the result at a HTML page. This page
is created during the processing.
The basic structure of the result pages of different queries is standard
(see Section 2.3):

General information.

Graphical representation - PatSeq.

Original result.

Form for a new query.

The content of the single parts depends on the query.
The general information contains the running time of Pratt, the number of matched
sequences and the file where the sequences are stored.
The image with the result is presented by the PatSeq JAVA applet.
The original result is a preformated plain ASCII text as generated by Pratt.
The form is a copy of the INPUT page.
The user can reach the single parts also with local links.

3.2. Perl

The perl script is structured modular (see
Appendix C1). The
Main module contains the variables declaration. The script takes the parameter
from the Input page (see Section 2.1HTML) to get the
values. They are stored in an array for the options and variables for the other parameters.
This is realised by the modul 'ReadParse' (created by Steve E. Brenner).
The next steps will be executed only if necessary. Dependend on the the
input information - where the sequences comes from, the script creates
the sequence file and the pratt_command file. They contain respectively
the sequence information and the Pratt command line.
Then the script executes the pratt_command file.

Afterwards it creates the structure of the output
HTML page (see Section 2.3).
It calls two of the single parts (detailed result information, new form)
submodules. For the presentation of the original it prints the
Pratt result file (output.dat) as preformated text.

Now comes the first step in the processing
part. It loads the Pratt output file and defines a variable for it.
The format of the data file is ASCII. The script splits the data in lines
and stores them in an array. Then it executes a 'while' loop for every
line. It splits the line in single fields, delimited by space characters
and looks for keywords. Each keyword defines a place where the script finds
information. The script stores this information in a number of variables and
arrays (see Appendix C2).

When this part is finished, the inividual output
of the query will be created (see Section 2.3).
First the basic information about the
query will be printed. Then the script takes the information about the
patterns and sequences and stores this information in a set of new arrays
(see Appendix C3).
The information in these arrays are handed over to the
JAVA applet PatSeq.

Next, the script calculates some variables that are
important for the dimensioning of arrays in the JAVA applet, and
for the physical size of the individual image parts to be generated
by the applet (see Appendix C4).
Amongst other things, we need to calculate the overlaps of the patterns
in each sequence in order to find out how much vertical space is needed
for the graphical representation of each sequence.
The script calls the JAVA applet with all the necassary data,
the parameter transfer is realised with two loops for each sequence and pattern.

3.3. Java

In the JAVA applet there are 3 parts of the result to draw.

An overview about the different patterns

Detailled images of all sequences

Rulers to show the dimension of the Sequences (Top, Bottom)

The code for JAVA applet consists of 3 main parts:

the variables definition,

the variables initialisation, and

the main program execution. This can be further divided into three subparts;

the module 'paint',

the modules for the event handling,

and help modules.

The variables definition should be clear, all necessary variables are
be defined. They are organized by the purpose (Pattern, Ruler, Sequence).

In the variables initialisation, the JAVA script gets all parameter which
are handed over from the CGI script. The program executes the loops with
the running variables (see Appendix D1)
to get the data about the different patterns and sequences
(see Appendix D2).
It is necessary to work with flexible loops, because
there can exits different number of Sequences and Patterns in every
query. There are also some other necessary variables
(see Appendix D3)
for the applet.
Initially the status of all patterns is set to ON, hence all
patterns will be shown in the sequences.
The colours (10 different) are initialised using a separe module.

One event handler involves changing the status of individual patterns, and
calls a helping module in order to recalculate the coordinates for the drawing
of the patterns in the sequences. When new patterns are switched on, new overlaps
might result, and consequently patterns may have to be 'pushed down'. Analogously,
when patterns are switched on, less vertical space may be needed for each sequence,
and hence the view becomes more compact.