[WSSA16] Analysing Protein Sequences and Letter Sequences of Words

Introduction

Proteins play crucial roles in the living cell and are involved in various vital processes. Protein sequences are not random, just as random sequences of letters and spaces rarely form words or sentences. Rather, protein sequences are "designed by evolution" i.e.the polypeptide chains, which did not fold in a biologically relevant time into a functional state and made the host organism less competitive, disappeared due to natural selection. Natural protein sequences are the result of many trials and errors. Important to notice that folding into specific structures in a biologically relevant time is a necessary but not a sufficient condition for an amino acid sequence to be a protein sequence.There is more information encrypted in protein sequences that needs to be understood. Our aim is to find out the regularities in the protein sequences that are absent in the randomly generated sequences of amino acids.

Results and discussion

Data importation (transmembrane protein sequences)

Sequence length statistics

Sequence structure statistics

Random amino acid sequence generation

Random sequence structure statistics

Guessing whether given sequence is a transmembrane protein sequence
The code was also implemented for analyzing sequences of letters in English, German, Italian, Russian, Arabic and Hebrew words.

We decided to take sequences of the transmembrane proteins from the human protein data that is available in Mathematica 11.

In protein sequences there are evolutionary preserved regions that are important for some function, thus it is important to determine the sub-sequences that are frequently occurring in protein sequences. This was achieved through CharacterCounts function in Mathematica 11:

(*Finds the most frequently occurring subsequences.Probably those are
evolutionary preserved and hence are important parts of the
protein*)(*the computaion time takes around 1 min*)
tab = Table[
Sort[Merge[CharacterCounts[Flatten[trseq], n], Total],
Greater], {n, 1, 20, 1}];

Then it was visualized as a table and bar chart:

(*Generates a hierarchical table of most frequently occurring
subsequences of a given length.The length of the subsequence can vary
between 1 and 20*)
Manipulate[Dataset[Take[tab[[i]], {1, 20}]], {i, 1, 20, 1}]
(*Generates a hierarchical table of most frequently occurring
subsequences of a given length.The length of the subsequence can vary
between 1 and 20*)
Manipulate[
BarChart[Take[tab[[i]], {1, 20}],
ChartLabels ->
Placed[ToString /@ Automatic, Axis, Rotate[#, Pi/2] &]], {i, 1, 20,
1}]

From the above images it can be seen that in the sub-sequences that are 6 amino acid long KTGTL and DKTGTL occur 51 times. This means that those sequences are important for some function. In case of random sequences the most frequent sub-sequences that are 6 amino acid long occur occur 3-4 times (check the notebook attached).
Our next step was to find the similarities between sequences. This was quantified by calculating Damerau–Levenshtein distance. Damerau–Levenshtein distance is the number of operations that are required to get one sequence from the other. There are 4 types of operations allowed: insertion, deletion, substitution and transposition(exchange of the positions) of the neighboring symbols. For example the Damerau–Levenshtein distance between "cat" and "bet" is equal to 2 since there are two substituents required, namely substituting "c" by "b" and "a" by "e".

In the rainbow color encoded array plot the violet corresponds to the 0 Damerau-Levenshtein distance, as the color changes towards read the Damerau-Levenshtein distance increases. In order to get deeper understanding of the statistical data of protein sequences displayed above we went on and generated random amino acid sequences:

Next we have performed similar operations over this random amino acid sequence data as we did for the transparent protein sequences. Below is rainbow color encoded array plot of the random amino acid sequence data with some statistical characteristics.

Words in different languages are in some seance similar to protein sequences. Both are products of evolution, but these evolutions have different purposes and driving forces. Using the same codes we have analyzed words in different languages. The word data was obtained as follows (for example for German words):

It can be noticed that the length of words in the Semitic languages that are shown here are relatively shorter.
Next we are showing the array plots for Damerau–Levenshtein distances calculated between 10000 words for each language.

It shows that the Damerau–Levenshtein distances between Russian words and Arabic words are large in comparison English, German, Italian and Hebrew. What would be your interpretation of the image above?