As in human language modeling, success in biological language modeling will be measured by the capacity for efficient (1) retrieval, (2) summarization, and (3) translation:

1. When we desire to enhance the performance of a specific human ability, we can retrieve all the relevant biological information required from the vast and complex data available.

2. We can summarize which proteins of a pathway are important for the particular task, or which particular part of a key protein is important for its folding to functional 3-D structure. This will allow modifications of the sequences with the purpose of enhancing the original or adding a new function to it. Successful existing examples for this strategy include tagging proteins for purification or identification purposes.

3. Finally, we can translate protein sequences from the "language" of one organism into that of another organism. This has very important implications, both for basic sciences and for the biotechnology industry. Both extensively utilize other organisms, i.e., the bacterium E. coli, to produce human proteins. However, often proteins cannot be successfully produced in E. coli (especially the most interesting ones): they misfold, because the environment in bacterial cells is different from that in human cells (Figure F.9). Statistical analysis of the genomes of human and E. coli can demonstrate the differences in rules to be observed if productive folding is to occur. Thus, it should be possible to alter a human protein sequence in such a way that it can fold to its correct functional 3-D shape in E. coli. The validity of this hypothesis has been shown for some examples where single point mutations have allowed expression and purification of proteins from E. coli. In addition to the traditional use of E. coli (or other organisms) as protein production factories, this translation approach could also be used to add functionality to particular organisms.

Misfolding of human proteins in bacterial expression systems is often a bottleneck.

Implications for Communication Interfaces

The ability to translate highlights one of the most fundamental aspects of language: a means for communication. Knowing the rules for the languages of different organisms at the cellular and molecular levels would also allow us to communicate at this level. This will fundamentally alter (1) human-human, (2) human-other organism, and (3) human-machine interfaces.

1. Human-human communication can be enhanced because the molecular biological language level is much more fundamental than speech, which may in the future be omitted in some cases as intermediary between humans. For example, pictures of memory events could be transmitted directly, without verbal description, through their underlying molecular mechanisms.

2. The differences in language between humans and other organisms can be exploited to "speak" to a pathogen in the presence of its human host (Figure F.10). That this may be possible is indicated by the observation of organism-specific phrases described in "The Transforming Strategy." This has important implications for the fight against bioterrorism and against pathogens in general to preserve and restore human health. The genome signatures should dramatically accelerate vaccine development by targeting pathogen-specific phrases. The advantage over traditional methods is that multiple proteins, unrelated in function, can be targeted simultaneously.

= protein with n-gram, frequent in pathogen and rare in human Q = other proteins

3. Finally, there are entirely novel opportunities to communicate between inherent and external abilities, i.e., human (or other living organisms) and machines. Using nanoscale principles, new materials and interfaces can be designed that are modeled after biological machines or that can interact with biological machines. Of particular importance are molecular receptors and signal transduction systems.

Implications to Rationalize Empirical Approaches

The greatest exploitation of the sequence ^ structure/function mapping by computational linguistics approaches will be to rationalize empirical observations. Here are two examples.

1. The first example concerns the effect of misfolding of proteins on human health. The correlation between the distribution of rare amino acid sequences in proteins and the location of nucleation sites for protein folding described above is important because misfolding is the cause of many diseases, including Alzheimer's, BSE, and others, either because of changes in the protein sequence or because of alternative structures taken by the same sequences. This can lead to amorphous aggregates or highly organized amyloid fibrils, both interfering with normal cell function. There are databases of mutations that list changes in amyloid formation propensity. Studying the linguistic properties of the sequences of amyloidogenic wild-type and mutant proteins may help rationalizing the mechanisms for misfolding diseases, the first step towards the design of strategies to treat them.

2. The second example is in tissue engineering applications (Figure F.11). The sequence ^ structure/ function mapping also provides the opportunity to engineer functionality by rationalized directed sequence evolution. Diseased or aged body parts, or organs whose performance we might like to enhance, all need integration of external materials into the human body. One typical application is bone tissue engineering. The current method to improve growth of cells around artificial materials such as hydroxyapatite is by trial and error to change the function of co-polymers and of added growth factors. Mapping sequence to function will allow us to rationally design growth factor sequences that code for altered function in regulating tissue growth.