The Role of Converging Technologies Computational Linguistics and Biological Chemistry

Complex biological systems are built from cells that have differentiated to perform specialized functions. This differentiation is achieved through a complicated network of interacting biological molecules. The main action is carried out by proteins, which are essentially nano-sized biological machines that are composed of string of characteristic sequences of the 20 amino acid building blocks. The sequences of the strings are encoded in their entirety in the genome. The linear strings of amino acids contain in principle all the information needed to fold a protein into a 3-D shape capable of exerting its designated function. With the advent of whole-genome sequencing projects, we now have complete lists of all the protein sequences that define the complex function carried out by the sequenced organisms — hundreds to thousands in bacteria and tens of thousands in humans. Individual proteins and functions have been studied for decades at various levels — atomic to macroscopic. Most recently, a new field has evolved, that of proteomics, which looks at all the proteins in a cell simultaneously. This multitude of data provides a tremendous new opportunity: the applicability of statistical methods to yield practical answers in terms of likelihood for biological phenomena to occur.

It is the availability of enormous amounts of data that has also transformed linguistics. In language, instead of genome sequences, raw text stored in databases, websites, and libraries maps to the meaning of words, phrases, sentences, and paragraphs as compared to protein structure and function (Figure F.6). After decoding, we can extract knowledge about a topic from the raw text. In language, extraordinary success in this process has been demonstrated by the ability to retrieve, summarize, and translate text. Examples include powerful speech recognition systems, fast web document search engines, and computer-generated sentences that are preferred by human evaluators in their grammatical accuracy and elegance over sentences that humans build naturally. The transformation of linguistics through data availability has allowed convergence of linguistics with computer science and information technology. Thus, even though a deep fundamental understanding of language is still missing, e.g., a gene for speech has only been discovered a few months ago (Lai et al. 2002), data availability has allowed us to obtain practical answers that fundamentally affect our lives. In direct analogy, transformation of biological chemistry by data availability opens the door to convergence with computer science and information technology. Furthermore, the deeper analogy between biology and language suggests that successful sequence ^ function mapping is fundamentally similar to the ability to retrieve, summarize, and translate in computational linguistics. Examples for biological equivalents of these abilities are described below under "The Estimated Implications."