This article wants to show some of the advantages of Perl
programming on Unix, for extraction of the biological
information of the DNA, RNA and proteine sequences Databases.
They can be used in comparative processes or analysis. The
Human Genome project and the DNA clonation techniques have
accelerated the scientific progress in this area.
Daily generated
information in this field outgrows often the capability of
processing this information from an evolutive viewpoint.

The fast proliferation of the biological information on
different genomes (dowry of genes of an organism) is driving
bioinformatics as one fundamental discipline for the handling
and analysis of these data.

_________________ _________________ _________________

Bioinformatics

Bioinformatics was born when scientists began to store the
biological sequences in a digital format and the first programs
to compare them arose. For a long time bioinformatics was
limited to the analysis of sequences. Nevertheless, the
importance to establish the structural form of molecules caused
that computers become an important tool for investigation in
theoretical biochemistry. Each day there is more information
and more collections of data on the 3D conformation of
molecules. Genes changed from being studied in an individual
way to be studied over the complete or an extense part of them.
It's now easier to understand how they behave between them,
the the proteines and how they organize in the metabolic
pathways. Every time we are more conscious of how important
is to organize the data.

Each one of the described activities has at least two faces
from which they are interesting. On one hand the biological
interest is to know the relations between life molecules, and
on the other hand the assembly becomes an interesting
software design problem to solve. The necessity is to combine
and to integrate the biological information to obtain a global
and effective vision of the biological processes that lies
there. We have also noticed ourselves of the necessity
to combine the different areas in computer science to come to an
effective solution. One is management of data bases, also data
integration; efficient algorithms, powerful hardware - grids,
multiprocessors, etc.

Perl

Larry Wall began the development of Perl in
1986. Perl is an interpreted programming language, ideal to
manipulate texts, files and processes. Perl allows to quickly
develop small programs.
It could be said that Perl is an optimized mixture of
a high-level language (for example C) and a scripting language
(for example bash).

Perl programs can run on several operating
systems / platforms. However, where Perl was born and where it
has spread is under the UNIX operating systems.
Perl fully exceeded its initial scope thanks to the impulse
that it received trought the immediate use as a web applications
language. Before Perl was
usedawk,thirst
andgrep were the tools to analyze files and to
extract information.

Perl reunited the possibilities of these UNIX tools in a
single program extending and modernizing each one with more
functionality.

Perl is a free programming language and it is possible to be
run in any of the operating systems that are generally present
in the biological research laboratories.
Under UNIX and MacOSX it comes pre-installed, in others is
necessary to install perl. It is enough to obtain it from the
site:http://www.cpan.org for
the system that we are using.

The programs in Perl under Linux are called with the name of
the file that contains the instructions to execute.
The instructions are keep in a file and Perl is
invoked with the name of the file as argument.

Another frequent method is to
keep the Perl instructions in a file but without invoking perl with
the file as argument.
For that we must make two things: (a) to put
a special comment at the first line of the program:

#!/usr/bin/env perl
print "Hi\n";

and (b) store the file and assign it the UNIX
properties for execution:

% chmod +x greetings.pl

Once made this, the file program can be used
by just calling it with the file name.

Perl File Management:

When we have a database of molecular sequences in text
format, we can make in Perl a sequence search tool. In this
example we see how to search for a proteine sequence in a
database with SWISS-PROT format (db_human_swissprot), using its
id code.

#!/usr/bin/perl
# Look for aminoacid sequence in a database
# SWISS-PROT formated, with a given id code
# Ask for the code in the ID field
# and it assigns it from the input(STDIN)to a variable
print "Enter the ID to search: ";
$id_query=<STDIN>;
chomp $id_query;
# We open the database file
# but if it isn't possible the program ends
open (db, "human_kinases_swissprot.txt") ||
die "problem opening the file human_kinases_swissprot.txt\n";
# Look line by line in the database
while (<db>) {
chomp $_;
# Check if we are in the ID field
if ($_ =~ /^ID/) {
# If it is possitive we gather the information
# breaking the line by spaces
($a1,$id_db) = split (/\s+/,$_);
# but if there is no coincidence of ID we continue to the following
next if ($id_db ne $id_query);
# When they coincide, we put a mark
$signal_good=1;
# Then we check the sequence field
# and if the mark is 1 (chosen sequence)
# If possitive, we change the mark to 2,to collect the sequence
} elsif (($_ =~ /^SQ/) && ($signal_good==1)) {
$signal_good=2;
# Finally, if the mark is 2, we present each line
# of the sequence, until the line begins with //
# is such case we broke the while
} elsif ($signal_good == 2) {
last if ($_ =~ /^\/\//);
print "$_\n";
}
}
# When we left the while instruction we check the mark
# if negative that means that we don't find the chosen sequence
# that will give us an error
if (!$signal_good) {
print "ERROR: "."Sequence not found\n";
}
# Finally, we close the file
# that still si open
close (db);
exit;

Search for aminoacid patterns

#!/usr/bin/perl
# Searcher for aminoacid patterns
# Ask the user the patterns for search
print "Please, introduce the pattern to search in query.seq: ";
$patron = <STDIN>;
chomp $patron;
# Open the database file
# but if it can't it ends the program
open (query, "query_seq.txt") || die "problem opening the file query_seq.txt\n";
# Look line by line the SWISS-PROT sequence
while (<query>) {
chomp $_;
# When arrives to the SQ field,put the mark in 1
if ($_ =~ /^SQ/) {
$signal_seq = 1;
# When arrive to the end of sequence, leave the curl
# Check that this expression is put before to check
# the mark=1,because this line doesn't belong to the aminoacid sequence
} elsif ($_ =~ /^\/\//) {
last;
# Check the mark if it is equal to 1, if possitive
# eliminate the blank spaces in the sequence line
# and join every line in a new variable
# To concatenate, we also can do:
# $secuencia_total.=$_;
} elsif ($signal_seq == 1) {
$_ =~ s/ //g;
$secuencia_total=$secuencia_total.$_;
}
}
# Now check the sequence, collected in its entirety,
# for the given pattern
if ($secuencia_total =~ /$patron/) {
print "The sequence query.seq contains the pattern $patron\n";
} else {
print "The sequence query.seq doesn't contains the pattern $patron\n";
}
# Finally we close the file
# and leave the program
close (query);
exit;

If we want to know the exact position where it has found the
pattern, we must make use of a special variable `$&'. This
variable keeps the pattern found after evaluating a regular
expression (would be necessary to put it just after the line `
if ($$secuencia_total>= ~/$$patron>/)
{`. In addition is possible to combine with the
variables ` $ ` ' and ` $ ´ ' that store everything in
the left and right of the found pattern. It modifies the
previous program with these new variables, to give the exact
position of the pattern. Note: Also you can find useful the
length function, that gives the length of a
chain.

# Only we need to change the if where the pattern was found
# Now check the sequence, collected in its entirety,
# for the given pattern
# and check its position in the sequence
if ($secuencia_total =~ /$patron/) {
$posicion=length($`)+1;
print "The sequence query_seq.txt contains the pattern $patron in the following position $posicion\n";
} else {
print "The sequence query_seq.txt doesn't contains the pattern $patron\n";
}

Calculus of aminoacid frequences:

The frequency of the different aminoacid in proteins is
variable, as a result of its different functions or favourite
surroundings. Thus, in this example, we will see how to
calculate the aminoacide frequency of a given sequence
of aminoacid.

Now we are going to make the following step that follows the
flow of information in a cell, after the transcription. One is
the translation, by which a sequence of ARN coming from a gene,
that was of DNA, passes to proteins or aminoacid sequences. For
that we must use the genetic code, that is based on which
triplets of ARN/ADN correspond to an aminoacid. The sequence
that we are going to extract of a card of a gene
ofEscherichia coli, in format EMBL and soon we will
verify the translation with the existing one in the card. For
this example, it will be necessary to introduce the associative
variables of arrays or tables hash. In the program we should
consider than only is needed the codificarte area, included in
the 'FT CDS field.