By mining in the gene dimension, we may find patterns shared by multiple genes, or cluster genes into groups. For example, we may find a group of genes that express themselves similarly, which is highly interesting in bioinformatics, such as in finding pathways.
■ When analyzing in the sample/condition dimension, we treat each sample/condition as an object and treat the genes as attributes. In this way, we may find patterns of samples/conditions, or cluster samples/conditions into groups. For example, we may find the differences in gene expression by comparing a group of tumor samples and nontumor samples.
Gene expression Gene expression matrices are popular in bioinformatics research and development. For example, an important task is to classify a new gene using the expression data of the gene and that of other genes in known classes. Symmetrically, we may classify a new sample (e.g., a new patient) using the expression data of the sample and that of samples in known classes (e.g., tumor and nontumor).

…

Every enterprise benefits from collecting and analyzing its data: Hospitals can spot trends and anomalies in their patient records, search engines can do better ranking and ad placement, and environmental and public health agencies can spot patterns and abnormalities in their data. The list continues, with cybersecurity and computer network intrusion detection; monitoring of the energy consumption of household appliances; pattern analysis in bioinformatics and pharmaceutical data; financial and business intelligence data; spotting trends in blogs, Twitter, and many more. Storage is inexpensive and getting even less so, as are data sensors. Thus, collecting and storing data is easier than ever before.
The problem then becomes how to analyze the data. This is exactly the focus of this Third Edition of the book. Jiawei, Micheline, and Jian give encyclopedic coverage of all the related methods, from the classic topics of clustering and classification, to database methods (e.g., association rules, data cubes) to more recent and advanced topics (e.g., SVD/PCA, wavelets, support vector machines).

…

Web mining can help us learn about the distribution of information on the WWW in general, characterize and classify web pages, and uncover web dynamics and the association and other relationships among different web pages, users, communities, and web-based activities.
It is important to keep in mind that, in many applications, multiple types of data are present. For example, in web mining, there often exist text data and multimedia data (e.g., pictures and videos) on web pages, graph data like web graphs, and map data on some web sites. In bioinformatics, genomic sequences, biological networks, and 3-D spatial structures of genomes may coexist for certain biological objects. Mining multiple data sources of complex data often leads to fruitful findings due to the mutual enhancement and consolidation of such multiple sources. On the other hand, it is also challenging because of the difficulties in data cleaning and data integration, as well as the complex interactions among the multiple sources of such data.

It was not a matter of her being warm and fuzzy, as you might expect from the usual characterizations of feminine thought—on the contrary, Anna’s scientific work (she still often coauthored papers in statistics, despite her bureaucratic load) often displayed a finicky perfectionism that made her a very meticulous scientist, a first-rate statistician—smart, quick, competent in a range of fields and really excellent in more than one. As good a scientist as one could find for the rather odd job of running the Bioinformatics Division at NSF, good almost to the point of exaggeration—too precise, too interrogatory—it kept her from pursuing a course of action with drive. Then again, at NSF maybe that was an advantage.
In any case she was so intense about it. A kind of Puritan of science, rational to an extreme. And yet of course at the same time that was all such a front, as with the early Puritans; the hyperrational coexisted in her with all the emotional openness, intensity, and variability that was the American female interactional paradigm and social role.

…

This was a major manifestation of the peer-review process, a process Frank thoroughly approved of—in principle. But a year of it was enough.
Anna had been watching him, and now she said, “I suppose it is a bit of a rat race.”
“Well, no more than anywhere else. In fact if I were home it’d probably be worse.”
They laughed.
“And you have your journal work too.”
“That’s right.” Frank waved at the piles of typescripts: three stacks for Review of Bioinformatics, two for The Journal of Sociobiology. “Always behind. Luckily the other editors are better at keeping up.”
Anna nodded. Editing a journal was a privilege and an honor, even though usually unpaid—indeed, one often had to continue to subscribe to a journal just to get copies of what one had edited. It was another of science’s many noncompensated activities, part of its extensive economy of social credit.

…

A key to any part of the mystery could be very valuable.
Frank scrolled down the pages of the application with practiced speed. Yann Pierzinski, Ph.D. in biomath, Caltech. Still doing postdoc work with his thesis advisor there, a man Frank had come to consider a bit of a credit hog, if not worse. It was interesting, then, that Pierzinski had gone down to Torrey Pines to work on a temporary contract, for a bioinformatics researcher whom Frank didn’t know. Perhaps that had been a bid to escape the advisor. But now he was back.
Frank dug into the substantive part of the proposal. The algorithm set was one Pierzinski had been working on even back in his dissertation. Chemical mechanics of protein creation as a sort of natural algorithm, in effect. Frank considered the idea, operation by operation. This was his real expertise; this was what had interested him from childhood, when the puzzles solved had been simple ciphers.

Reducing the cost of electricity in the management of data centers goes hand in hand with cutting the cost of storing data, an ever larger part of the data-management process. And the sheer volume of data is mushrooming faster than the capacity of hard drives to save it.
Researchers are just beginning to experiment with a new way of storing data that could eventually drop the marginal cost to near zero. In January 2013 scientists at the European Bioinformatics Institute in Cambridge, England, announced a revolutionary new method of storing massive electronic data by embedding it in synthetic DNA. Two researchers, Nick Goldman and Ewan Birney, converted text from five computer files—which included an MP3 recording of Martin Luther King Jr.’s “I Have a Dream” speech, a paper by James Watson and Francis Crick describing the structure of DNA, and all of Shakespeare’s sonnets and plays—and converted the ones and zeros of digital information into the letters that make up the alphabet of the DNA code.

…

Harvard researcher George Church notes that the information currently stored in all the disk drives in the world could fit in a tiny bit of DNA the size of the palm of one’s hand. Researchers add that DNA information can be preserved for centuries, as long as it is kept in a dark, cool environment.65
At this early stage of development, the cost of reading the code is high and the time it takes to decode information is substantial. Researchers, however, are reasonably confident that an exponential rate of change in bioinformatics will drive the marginal cost to near zero over the next several decades.
A near zero marginal cost communication/energy infrastructure for the Collaborative Age is now within sight. The technology needed to make it happen is already being deployed. At present, it’s all about scaling up and building out. When we compare the increasing expenses of maintaining an old Second Industrial Revolution communication/energy matrix of centralized telecommunications and centralized fossil fuel energy generation, whose costs are rising with each passing day, with a Third Industrial Revolution communication/energy matrix whose costs are dramatically shrinking, it’s clear that the future lies with the latter.

…

Its network of thousands of scientists and plant breeders is continually searching for heirloom and wild seeds, growing them out to increase seed stock, and ferrying samples to the vault for long-term storage.32 In 2010, the trust launched a global program to locate, catalog, and preserve the wild relatives of the 22 major food crops humanity relies on for survival.
The intensification of genetic-Commons advocacy comes at a time when new IT and computing technology is speeding up genetic research. The new field of bioinformatics has fundamentally altered the nature of biological research just as IT, computing, and Internet technology did in the fields of renewable-energy generation and 3D printing. According to research compiled by the National Human Genome Research Institute, gene-sequencing costs are plummeting at a rate that exceeds the exponential curves of Moore’s Law in computing power.33 Dr. David Altshuler, deputy director of the Broad Institute of Harvard University and the Massachusetts Institute of Technology, observes that in just the past several years, the price of genetic sequencing has dropped a million fold.34 Consider that the cost of reading one million base pairs of DNA—the human genome contains around three billion pairs—has plunged from $100,000 to just six cents.35 This suggests that the marginal cost of some genetic research will approach zero in the not-too-distant future, making valuable biological data available for free, just like information on the Internet.

The machines and technology coming out of the digital and genetic revolutions may allow people to leverage their mental capacity a thousand …
A million …
Or a trillionfold.
Biology is now driven by applied math … statistics … computer science … robotics …
The world’s best programmers are increasingly gravitating toward biology …
You will be hearing a lot about two new fields in the coming months …
Bioinformatics and Biocomputing.
You rarely see bioinformaticians …
They are too valuable to companies and universities.
Things are moving too fast …
And they are too passionate about what they do …
To spend a lot of time giving speeches and interviews.
But if you go into the bowels of Harvard Medical School …
And are able to find the genetics department inside the Warren Alpert Building …
(A significant test of intelligence in and of itself … Start by finding the staircase inspired by the double helix … and go past the bathrooms marked XX and XY …)
There you can find a small den where George Church hangs out, surrounded by computers.

…

This is ground zero for a wonderful commune of engineers, physicists, molecular biologists, and physicians …3
And some of the world’s smartest graduate students …
Who are trying to make sense of the 100 terabytes of data that come out of gene labs yearly …
A task equivalent to trying to sort and use a million new encyclopedias … every year.4
You can’t build enough “wet” labs (labs full of beakers, cells, chemicals, refrigerators) to process and investigate all the opportunities this scale of data generates.
The only way for Church & Co. to succeed …
Is to force biology to divide …
Into theoretical and applied disciplines.
Which is why he is one of the founders of bioinformatics …
A new discipline that attempts to predict what biologists will find …
When they carry out wet-lab experiments in a few months, years, or decades.
In a sense, this mirrors Craig Venter’s efforts at The Institute for Genomic Research and Celera.
Celera and Church’s labs are information centers … not traditional labs …
And a few smart people are going to be able to do …
A lot of biology …
Very quickly.

…

THE RULES ARE DIFFERENT IN A
KNOWLEDGE ECONOMY …
IT’S A SCARY TIME FOR THE ESTABLISHMENT.
Countries, regions, governments, and companies
that assume they are …
And will remain …
Dominant …
Soon lose their competitive edge.
(Particularly those whose leadership ignores or disparages emerging technologies … Remember those old saws: The sun never sets on the British Empire … Vive La France! … All roads lead to Rome … China, the Middle Kingdom.)
Which is one of the reasons bioinformatics is so important …
And why you should pay attention.
What we are seeing is just the beginning of the digital-genomics convergence.
When you think of a DNA molecule and its ability to …
Carry our complete life code within each of our cells …
Accurately copy the code …
Billions of times per day …
Read and execute life’s functions …
Transmit this information across generations …
It becomes clear that …
The world’s most powerful and compact coding and information-processing system … is a genome.

Much of the disruption is fed by improved instrument and sensor technology; for instance, the Large Synoptic Survey Telescope has a 3.2-gigabyte pixel camera and generates over 6 petabytes of image data per year. It is the platform of Big Data that is making such lofty goals attainable.
The validation of Big Data analytics can be illustrated by advances in science. The biomedical corporation Bioinformatics recently announced that it has reduced the time it takes to sequence a genome from years to days, and it has also reduced the cost, so it will be feasible to sequence an individual’s genome for $1,000, paving the way for improved diagnostics and personalized medicine.
The financial sector has seen how Big Data and its associated analytics can have a disruptive impact on business. Financial services firms are seeing larger volumes through smaller trading sizes, increased market volatility, and technological improvements in automated and algorithmic trading.

…

Big Data has transformed astronomy from a field in which taking pictures of the sky was a large part of the job to one in which the pictures are all in a database already and the astronomer’s task is to find interesting objects and phenomena in the database.
Transformation is taking place in the biological arena as well. There is now a well-established tradition of depositing scientific data into a public repository and of creating public databases for use by other scientists. In fact, there is an entire discipline of bioinformatics that is largely devoted to the maintenance and analysis of such data. As technology advances, particularly with the advent of next-generation sequencing, the size and number of available experimental data sets are increasing exponentially.
Big Data has the potential to revolutionize more than just research; the analytics process has started to transform education as well. A recent detailed quantitative comparison of different approaches taken by 35 charter schools in New York City has found that one of the top five policies correlated with measurable academic effectiveness was the use of data to guide instruction.

…

It may take a significant amount of work to achieve automated error-free difference resolution.
The data preparation challenge even extends to analysis that uses only a single data set. Here there is still the issue of suitable database design, further complicated by the many alternative ways in which to store the information. Particular database designs may have certain advantages over others for analytical purposes. A case in point is the variety in the structure of bioinformatics databases, in which information on substantially similar entities, such as genes, is inherently different but is represented with the same data elements.
Examples like these clearly indicate that database design is an artistic endeavor that has to be carefully executed in the enterprise context by professionals. When creating effective database designs, professionals such as data scientists must have the tools to assist them in the design process, and more important, they must develop techniques so that databases can be used effectively in the absence of intelligent database design.

This was what
started my fascination with computer science.
In addition to computers, as a kid I was also really excited about bioinformatics. I was really interested in the fact that you could take all genetic data, actually fit it into computers, and solve many problems that look unsolvable before
that, and reach medical discoveries. You could potentially build a combination
of a human and a computer together. I took part in the Technion External
Studies program when I was 15 years old, which allowed me to start taking
college level classes while still in high school. And once I started studying at
the Technion University, this was what I wanted to do—study bioinformatics.
Before my studies started, at the age of 14, I went to a research camp. At the
camp, each one of us selected research he or she wanted to lead—I chose
www.it-ebooks.info
Data Scientists at Work
to perform a research on how natural compounds affect the proliferation of
cancer cells, specifically prostate cancer cells.

…

So Konrad Kording1, a scientist at Northwestern, and
I started trying to build models to discover the structure and patterns in this
connectomic state. We have a paper that was just sent out for review on
exactly this idea of how—given this high-throughput, ambiguous, noisy, sometimes error-filled data—you actually extract out scientific meaning.
The analogy here to bioinformatics is really strong. It used to be that a ­biologist
was a biologist. And then we had the rise of genomics as a field, and now you
have computational genomics as a field. The entire field of bioinformatics
is actually a field where people who are biologists just sit at a computer.
They don’t actually touch a wet lab. It became a real independent field
partially because of this transition toward the availability of high-quality,
high-throughput data. I think neuroscience is going to see a similar transition.

…

In combination with Nordstrom.com, nordstromrack.com, and brick-and-mortar stores,
these ­channels result in a rich data ecosystem that the Data Lab uses to inform
business decisions and enhance the customer experience.
Shellman’s data science career began with an internship at the National Institutes
of Health in the Division of Computational Biosciences. It was here that she initially
learned and applied machine learning to uncover patterns in genomic evolution.
Following her internship, she completed a Master of Science degree in biostatistics
and a doctoral degree in bioinformatics both from the University of Michigan in Ann
Arbor. While at the University of Michigan, Shellman collaborated frequently and
analyzed many types of heterogeneous biological data including gene expression
microarrays, metabolomics, network graphs, and clinical time-series.
A frequent speaker and teacher, Shellman has presented at conferences such as
Strata and the Big Data Congress II, and also speaks regularly at meet-ups and gatherings in the Seattle technology community.

This dual property (regulated ﬂow)
is central to Protocol’s analysis of the Internet as a political technology.
Isomorphic Biopolitics
As a ﬁnal comment, it is worthwhile to note that the concept of “protocol”
is related to a biopolitical production, a production of the possibility for
experience in control societies. It is in this sense that Protocol is doubly materialist—in the sense of networked bodies inscribed by informatics, and
Foreword: Protocol Is as Protocol Does
xix
in the sense of this bio-informatic network producing the conditions of
experience.
The biopolitical dimension of protocol is one of the parts of this book that
opens onto future challenges. As the biological and life sciences become
more and more integrated with computer and networking technology, the
familiar line between the body and technology, between biologies and machines, begins to undergo a set of transformations. “Populations” deﬁned nationally or ethnically are also deﬁned informatically.

…

(Witness the growing
business of population genomics.) Individual subjects are not only civil subjects, but also medical subjects for a medicine increasingly inﬂuenced by
genetic science. The ongoing research and clinical trials in gene therapy, regenerative medicine, and genetic diagnostics reiterate the notion of the biomedical subject as being in some way amenable to a database. In addition to
this bio-informatic encapsulation of individual and collective bodies, the
transactions and economies between bodies are also being affected. Research
into stem cells has ushered in a new era of molecular bodies that not only are
self-generating like a reservoir (a new type of tissue banking), but that also
create a tissue economy of potential biologies (lab-grown tissues and organs).
Such biotechnologies often seem more science ﬁction than science, and
indeed health care systems are far from fully integrating such emerging research into routine medical practice.

…

If layering is dependent upon portability,
then portability is in turn enabled by the existence of ontology standards.
These are some of the sites that Protocol opens up concerning the possible
relations between information and biological networks. While the concept
of biopolitics is often used at its most general level, Protocol asks us to respecify biopolitics in the age of biotechnology and bioinformatics. Thus one
site of future engagement is in the zones where info-tech and bio-tech intersect. The “wet” biological body has not simply been superceded by “dry”
computer code, just as the wet body no longer accounts for the virtual body.
Biotechnologies of all sorts demonstrate this to us—in vivo tissue engineering, ethnic genome projects, gene-ﬁnding software, unregulated genetically
modiﬁed foods, portable DNA diagnostics kits, and distributed proteomic
computing.

SU’s mission is practical: “to assemble, educate and inspire leaders who strive to understand and facilitate the development of exponentially advancing technologies in order to address humanity’s grand challenges.”20 The academic tracks are geared toward understanding how fast-moving technologies can work together, and more than half of them have a direct impact on the field of longevity research. These tracks include AI and robotics; nanotechnology, networks, and computing systems; biotechnology and bioinformatics; medicine and neuroscience; and futures studies and forecasting.21 SU is a place where mavens speak to those who are superfocused on changing the world for the better. It is no surprise, then, that it also functions as an institutional “connector”—the third component needed to successfully spread a game-changing meme.
CONNECT ME
Peter Diamandis always seems to be on the phone or leaving a meeting to get on a phone call.

…

Craig Venter, and the Human Genome Project, an international public consortium backed with around $3 billion U.S. tax dollars.54 Both President Bill Clinton and Prime Minister of Britain Tony Blair presided over the press conference announcing that humanity now possessed “the genetic blueprint for human beings.”55 President Clinton proudly told the world that the capacity to sequence human genomes “will revolutionize the diagnosis, prevention and treatment of most, if not all, human diseases.”56 This new ability to look at the “source code” of humans particularly resonated with computer experts in Silicon Valley and around the world who spend much of their time designing code for computers. If the source code of humans can be identified, then it is not that much of a leap to think about re-engineering it.
Suddenly, biology became a field that computer geeks could attempt to tackle, which not only resulted in smart biohackers forming do-it-yourself biology clubs, but also increased the pace of advances in biology. Bioinformatics are moving at the speed of Moore’s Law and sometimes faster. To the extent that wealthy technology moguls influence public opinion and hackers seem cool, the context for the longevity meme is sizzling hot.
In a Wired magazine interview in April 2010, Bill Gates, America’s richest man, told reporter Steven Levy that if he were a teenager today, “he’d be hacking biology.”57 Gates elaborated, saying, “Creating artificial life with DNA synthesis, that’s sort of the equivalent of machine-language programming.”

…

Policy makers, activists, journalists, educators, investors, philanthropists, analysts, entrepreneurs, and a whole host of others need to come together to fight for their lives. We now know that aging is plastic and that humanity’s time horizons are not set in stone. Larry Ellison, Bill Gates, Peter Thiel, Jeff Bezos, Larry Page, Sergey Brin, and Paul Allen have all recognized the wealth of opportunity in the bioinformatics revolution, but this is not enough. Other heroes must come forward—perhaps there is even one reading this sentence right now.
The goal is more healthy time, which, as we have seen throughout this book, will lead to greater wealth and prospects for happiness. A longer health span means more time to enjoy the wonders of life, including relationships with family and friends, career building, knowledge seeking, adventure, and exploration.

Beyond genomics lies the burgeoning field of proteomics, which seeks to understand how genes code for proteins and how the proteins themselves fold into the exquisitely complex shapes required by cells.2 And beyond proteomics there lies the unbelievably complex task of understanding how these molecules develop into tissues, organs, and complete human beings.
The Human Genome Project would not have been possible without parallel advances in the information technology required to record, catalog, search, and analyze the billions of bases making up human DNA. The merger of biology and information technology has led to the emergence of a new field, known as bioinformatics.3 What will be possible in the future will depend heavily on the ability of computers to interpret the mind-boggling amounts of data generated by genomics and proteomics and to build reliable models of phenomena such as protein folding.
The simple identification of genes in the genome does not mean that anyone knows what it is they do. A great deal of progress has been made in the past two decades in finding the genes connected to cystic fibrosis, sickle-cell anemia, Huntington’s chorea, Tay-Sachs disease, and the like.

They believed that the then-unprecedented amount of molecular information available for a wide range of model organisms would yield vivid new insights into intracellular molecular processes that could, if simulated in a computer, enable them to predict the dynamic behavior of living cells. Within a computer it would be possible to explore the functions of proteins, protein–protein interactions, protein–DNA interactions, regulation of gene expression, and other features of cellular metabolism. In other words, a virtual cell could provide a new perspective on both the software and hardware of life.
In the spring of 1996 Tomita and his students at the Laboratory for Bioinformatics at Keio started investigating the molecular biology of Mycoplasma genitalium (which we had sequenced in 1995) and by the end of that year had established the E-Cell Project. The Japanese team had constructed a model of a hypothetical cell with only 127 genes, which were sufficient for transcription, translation, and energy production. Most of the genes that they used were taken from Mycoplasma genitalium.

…

Currently Novartis and other vaccine companies rely on the World Health Organization to identify and distribute the seed viruses. To speed up the process we are using a method called “reverse vaccinology,” which was first applied to the development of a meningococcal vaccine by Rino Rappuoli, now at Novartis. The basic idea is that the entire pathogenic genome of an influenza virus can be screened using bioinformatic approaches to identify and analyze its genes. Next, particular genes are selected for attributes that would make good vaccine targets, such as outer-membrane proteins. Those proteins then undergo normal testing for immune responses.
My team has sequenced genes representing the diversity of influenza viruses that have been encountered since 2005. We have sequenced the complete genomes of a large collection of human influenza isolates, as well as a select number of avian and other non-human influenza strains relevant to the evolution of viruses with pandemic potential, and made the information publicly available.

The main contribution of this work is to present the notion
of bipolarity that captures the level of conflict between the contributors to a page. Thus the work is
more directed at the problem of Wikipedia vandalism than the issue of authoritativeness that is the
subject of this paper.
3
Extracting and Comparing Network Motif Profiles
The idea of characterizing networks in terms of network motif profiles is well established and has
had a considerable impact in bioinformatics [10]. Our objective is to characterize Wikipedia pages in
terms of network motif profiles and then examine whether or not different pages have characteristic
network motif profiles. The datasets we considered were entries in the English language Wikipedia
2
on famous sociologists and footballers in the English Premiership 4 (see Table 1). The first step in
the analysis is to identify a set of network motifs to use.
3.1
Wikipedia Network Motifs
Our Wikipedia network motifs comprise author and page nodes and author-page (AP) and page-page
(PP) edges (see Figures 3 and 4).

I imagine in the near future, many people will have the same strange feeling I did, holding the blueprint of their bodies in their hands and reading the intimate secrets, including dangerous diseases, lurking in the genome and the ancient migration patterns of their ancestors.
But for scientists, this is opening an entirely new branch of science, called bioinformatics, or using computers to rapidly scan and analyze the genome of thousands of organisms. For example, by inserting the genomes of several hundred individuals suffering from a certain disease into a computer, one might be able to calculate the precise location of the damaged DNA. In fact, some of the world’s most powerful computers are involved in bioinformatics, analyzing millions of genes found in plants and animals for certain key genes.
This could even revolutionize TV detective shows like CSI. Given tiny scraps of DNA (found in hair follicles, saliva, or bloodstains), one might be able to determine not just the person’s hair color, eye color, ethnicity, height, and medical history, but perhaps also his face.

He wanted to talk to everyone implicated in this: Yann Pierzinski—meaning Marta too, which would be hard, terrible in fact, but Marta had moved to Atlanta with Yann and they lived together there, so there would be no avoiding her. And then Francesca Taolini, who had arranged for Yann’s hire by a company she consulted for, in the same way Frank had hoped to. Did she suspect that Frank had been after Yann? Did she know how powerful Yann’s algorithm might be?
He googled her. Turned out, among many interesting things, that she was helping to chair a conference at MIT coming soon, on bioinformatics and the environment. Just the kind of event Frank might attend. NSF even had a group going already, he saw, to talk about the new federal institutes.
Meet with her first, then go to Atlanta to meet with Yann—would that make his stock in the virtual market rise, triggering more intense surveillance? An unpleasant thought; he grimaced.
He couldn’t evade most of this surveillance. He had to continue to behave as if it wasn’t happening.

…

What the hell was that, after all? And how would you measure it?
So at work Anna spent her time trying to concentrate, over a persistent underlying turmoil of worry about her younger son. Work was absorbing, as always, and there was more to do than there was time to do it in, as always. And so it provided its partial refuge.
But it was harder to dive in, harder to stay under the surface in the deep sea of bioinformatics. Even the content of the work reminded her, on some subliminal level, that health was a state of dynamic balance almost inconceivably complex, a matter of juggling a thousand balls while unicycling on a tightrope over the abyss—in a gale—at night—such that any life was an astonishing miracle, brief and tenuous. But enough of that kind of thinking! Bear down on the fact, on the moment and the problem of the moment!

…

Take a problem, break it down into parts (analyze), quantify whatever parts you could, see if what you learned suggested anything about causes and effects; then see if this suggested anything about long-term plans, and tangible things to do. She did not believe in revolution of any kind, and only trusted the mass application of the scientific method to get any real-world results. “One step at a time,” she would say to her team in bioinformatics, or Nick’s math group at school, or the National Science Board; and she hoped that as long as chaos did not erupt worldwide, one step at a time would eventually get them to some tolerable state.
Of course there were all the hysterical operatics of “history” to distract people from this method and its incremental successes. The wars and politicians, the police state regimes and terrorist insurgencies, the gross injustices and cruelties, the unnecessarily ongoing plagues and famines—in short, all the mass violence and rank intimidation that characterized most of what filled the history books; all that was real enough, indeed all too real, undeniable—and yet it was not the whole story.

Indeed, the state of California, which has the largest prenatal screening program in the world, with more than four hundred thousand expectant mothers assessed annually, already provides these tests to all pregnant women who have increased risk.26
Of course, we could also sequence the fetus’s entire genome instead of just doing the simpler screens. While that is not a commercially available test, and there are substantial bioinformatic challenges that lie ahead before it could be scalable, the anticipatory bioethical issues that this engenders are considerable.27 We are a long way off for determining what would constitute acceptable genomic criteria for early termination of pregnancy, since this not only relies on accurately determining a key genomic variant linked to a serious illness, but also understanding whether this condition would actually manifest.

…

Now it is possible to use sequencing to unravel the molecular diagnosis of an unknown condition, and the chances for success are enhanced when there is DNA from the mother and father, or other relatives, to use for anchoring and comparative sequencing analysis. At several centers around the country, the success rate for making the diagnosis ranges between 25 percent and 50 percent. It requires considerable genome bioinformatic expertise, for a trio of individuals will generate around 750 billion data points (six billion letters per sequence, three people, each done forty times to assure accuracy). Of course, just making the diagnosis is not the same as coming up with an effective treatment or a cure. But there have been some striking anecdotal examples of children whose lives were saved or had dramatic improvement.

…

The most far-reaching component of the molecular stethoscope appears to be cell-free RNA, which can potentially be used to monitor any organ of the body.82 Previously that was unthinkable in a healthy person. How could one possibly conceive of doing a brain or liver biopsy in someone as part of a normal checkup? Using high-throughput sequencing of cell-free RNA in the blood, and sophisticated bioinformatic methods to analyze this data, Stephen Quake and his colleagues at Stanford were able to show it is possible to follow the gene expression from each of the body’s organs from a simple blood sample. And that is changing all the time in each of us. This is an ideal case for deep learning to determine what these dynamic genomic signatures mean, to determine what can be done to change the natural history of a disease in the making, and to develop the path for prevention.

At the same time, the computer metaphor frames our way of thinking, and how we communicate the fundamental ideas of our time. We speak of the brain as the ‘hardware’ and of the mind as the ‘software’. This dualistic software–hardware paradigm is applied across many fields, including life itself. Cells are the ‘computers’ that run a ‘program’ called the genetic code, or genome. The ‘code’ is written on the DNA. Cutting-edge research in biology does not take place in vitro in a wet lab, but in silico in a computer. Bioinformatics – the accumulation, tagging, storing, manipulation and mining of digital biological data – is the present, and future, of biology research.
The computer metaphor for life is reinforced by its apparently successful application to real problems. Many disruptive new technologies in molecular biology – for instance ‘DNA printing’ – function on the basis of digital information. This is how they do it: DNA is a molecule formed by two sets of base pairs: adenine-thymine (A-T) and guanine-cytosine (G-C).

…

Thanks to digital data and ever-accelerating computer power we are at the cusp of an era in which we can gain unprecedented insights into natural phenomena, the human body, markets, Earth’s climate, ecosystems, energy grids, and just about everything in between. Norbert Wiener’s cybernetic dream is slowly becoming a reality: the more information we have about systems, the more control we can exercise over them with the help of our computers. Big data are our newfound economic bounty.
The big data economy
In 2010, I took a contract as External Relations Officer at the European Bioinformatics Institute (EBI) at Hinxton, Cambridge. The Institute is part of the intergovernmental European Molecular Biology Laboratory, and its core mission is to provide an infrastructure for the storage and manipulation of biological data. This is the data that researchers in the life sciences produce every day, including information about the genes of humans and of other species, chemical molecules that might provide the basis for new therapies, proteins, and also about research findings in general.

…

At the time that I worked for them, EBI’s challenge was to increase the capacity of its infrastructure in order to accommodate this ‘data deluge’. As someone who facilitated communications between the Institute and potential government funders across Europe, I had first-hand experience of the importance that governments placed on biological data. Almost everyone understood the potential for driving innovation through this data, and was ready to support the expansion of Europe’s bioinformatics infrastructure, even as Europe was going through the Great Recession. The message was simple and clear: whoever owned the data owned the future.
Governments and scientists are not the only ones to have jumped on the bandwagon of big data. The advent of social media and Google Search has transformed the marketing operations of almost every business in the world, big and small. Tools have been developed to ‘mine’ the text written by billions of people on Facebook and Twitter, in order to measure sentiment and target consumers with, hopefully, the right products.

p 106: Mapping the brain is far too large a subject for me to give a comprehensive list of references. An overview of work on the Allen Brain Atlas may be found in Jonah Lehrer’s excellent article [120]. Most of the facts I relate are from that article. The paper announcing the atlas of gene expression in the mouse brain is [121]. Overviews of some of the progress and challenges in mapping the human connectome may be found in [119] and [125].
p 108: Bioinformatics and cheminformatics are now well-established fields, with a significant literature, and I won’t attempt to single out any particular reference for special mention. Astroinformatics has emerged more recently. See especially [24] for a manifesto on the need for astroinformatics.
p 113: A report on the 2005 Playchess.com freestyle chess tournament may be found at [37], with follow-up commentary on the winners at [39].

Final remarks
The origins of the maximum segment sum problem go back to about 1975,
and its history is described in one of Bentley’s (1987) programming pearls.
For a derivation using invariant assertions, see Gries (1990); for an algebraic
approach, see Bird (1989). The problem refuses to go away, and variations are
still an active topic for algorithm designers because of potential applications
in data-mining and bioinformatics; see Mu (2008) for recent results.
The interest in the non-segment problem is what it tells us about any
maximum marking problem in which the marking criterion can be formulated
78
Pearls of Functional Algorithm Design
as a regular expression. For instance, it is immediate that there is an O(nk )
algorithm for computing the maximum at-least-length-k segment problem
because F ∗ T n F ∗ (n ≥ k ) can be recognised by a k -state automaton.

…

In particular, the function sorttails that returns the unique permutation
that sorts the tails of a list can be obtained from the ﬁnal program for
ranktails simply by replacing resort ·concat ·label in the ﬁrst line of ranktails
by concat. The function sorttails is needed as a preliminary step in the
Burrows–Wheeler algorithm for data compression, a problem we will take
up in the following pearl. The problem of sorting the suﬃxes of a string has
been treated extensively in the literature because it has other applications
in string matching and bioinformatics; a good source is Gusﬁeld (1997).
This pearl was rewritten a number of times. Initially we started out with
the idea of computing perm, a permutation that sorts a list. But perm is too
speciﬁc in the way it treats duplicates: there is more than one permutation
that sorts a list containing duplicate elements. One cannot get very far with
perm unless one generalises to either rank or partition.

As more of the process of drug discovery of potential leads can be done by modeling and computational analysis, more can be organized for peer production. The relevant model here is open bioinformatics. Bioinformatics generally is the practice of pursuing solutions to biological questions using mathematics and information technology. Open bioinformatics is a movement within bioinformatics aimed at developing the tools in an open-source model, and in providing access to the tools and the outputs on a free and open basis. Projects like these include the Ensmbl Genome Browser, operated by the European Bioinformatics Institute and the Sanger Centre, or the National Center for Biotechnology Information (NCBI), both of which use computer databases to provide access to data and to run various searches on combinations, patterns, and so forth, in the data.

Amy Brown (editorial): Amy has a bachelor's degree in Mathematics from the University of Waterloo, and worked in the software industry for ten years. She now writes and edits books, sometimes about software. She lives in Toronto and has two children and a very old cat.
C. Titus Brown (Continuous Integration): Titus has worked in evolutionary modeling, physical meteorology, developmental biology, genomics, and bioinformatics. He is now an Assistant Professor at Michigan State University, where he has expanded his interests into several new areas, including reproducibility and maintainability of scientific software. He is also a member of the Python Software Foundation, and blogs at http://ivory.idyll.org.
Roy Bryant (Snowflock): In 20 years as a software architect and CTO, Roy designed systems including Electronics Workbench (now National Instruments' Multisim) and the Linkwalker Data Pipeline, which won Microsoft's worldwide Winning Customer Award for High-Performance Computing in 2006.

…

He has since contributed to almost all areas of Asterisk development, from project management to core architectural design and development. He blogs at http://www.russellbryant.net.
Rosangela Canino-Koning (Continuous Integration): After 13 years of slogging in the software industry trenches, Rosangela returned to university to pursue a Ph.D. in Computer Science and Evolutionary Biology at Michigan State University. In her copious spare time, she likes to read, hike, travel, and hack on open source bioinformatics software. She blogs at http://www.voidptr.net.
Francesco Cesarini (Riak): Francesco Cesarini has used Erlang on a daily basis since 1995, having worked in various turnkey projects at Ericsson, including the OTP R1 release. He is the founder of Erlang Solutions and co-author of O'Reilly's Erlang Programming. He currently works as Technical Director at Erlang Solutions, but still finds the time to teach graduates and undergraduates alike at Oxford University in the UK and the IT University of Gotheburg in Sweden.

…

After graduate studies in distributed systems at Carnegie-Mellon University, he worked on compilers (Tartan Labs), printing and imaging systems (Adobe Systems), electronic commerce (Adobe Systems, Impresse), and storage area network management (SanNavigator, McDATA). Returning to distributed systems and HDFS, Rob found many familiar problems, but all of the numbers had two or three more zeros.
James Crook (Audacity): James is a contract software developer based in Dublin, Ireland. Currently he is working on tools for electronics design, though in a previous life he developed bioinformatics software. He has many audacious plans for Audacity, and he hopes some, at least, will see the light of day.
Chris Davis (Graphite): Chris is a software consultant and Google engineer who has been designing and building scalable monitoring and automation tools for over 12 years. Chris originally wrote Graphite in 2006 and has lead the open source project ever since. When he's not writing code he enjoys cooking, making music, and doing research.

The largest is CRAN (Comprehensive R Archive Network; http://cran.r-project.org). CRAN is hosted by the R Foundation (the same organization that is developing R) and contains 3,646 packages as of this writing. CRAN is also mirrored in many sites worldwide.
Another public repository is Bioconductor (http://www.bioconductor.org), an open source project that provides tools for bioinformatics and is primarily R-based. While the packages in Bioconductor are focused on bioinformatics, it doesn’t mean that they can’t be used for other domains. As of this writing, there are 516 packages in Bioconductor.
Finally, there is R-Forge (http://r-forge.r-project.org), a collaborative software development application for R. It is based on FusionForge, a fork from GForge (on which RubyForge was based), which in turn was forked from the original software that was used to build SourceForge.

The most important food allergen families will be discussed
in this chapter.
Food allergen protein families
Based on their shared amino acid sequences and conserved
three-dimensional structures, proteins can be classified into
families using various bioinformatics tools which form the
basis of several protein family databases, one of which is
Pfam [8]. Over the past 10 years or so there has been an
explosion in the numbers of well characterized allergens,
which have been sequenced and are being collected into a
number of databases to facilitate bioinformatic analysis [9].
We have undertaken this analysis for both plant [1] and
animal food allergens [10] along with pollen allergens [2].
They show similar distributions with the majority of allergens in each group falling into just 3–12 families with a tail
43
44
Chapter 4
of between 14 and 23 families comprising between 1 and 3
allergens each.

…

For
example, the Codex Alimentarius (www.codexalimentarius.
net/web/index_en.jsp) recommended a percentage identity score of at least 35% matched amino acid residues of
at least 80 residues as being the lowest identity criteria for
proteins derived from biotechnology that could suggest IgE
cross-reactivity with a known allergen. However, Aalberse
[72] has noted that proteins sharing less than 50% identity
across the full length of the protein sequence are unlikely
to be cross-reactive, and immunological cross-reactivity may
not occur unless the proteins share at least 70% identity.
Recent published work has led to the harmonization of the
methods used for bioinformatic searches and a better understanding of the data generated [73,74] from such studies.
An additional bioinformatics approach can be taken by
searching for 100% identity matches along short sequences
contained in the query sequence as they are compared to
sequences in a database. These regions of short amino acid
sequence homologies are intended to represent the smallest
sequence that could function as an IgE-binding epitope [75].
If any exact matches between a known allergen and a transgenic sequence were found using this strategy, it could represent the most conservative approach to predicting potential
for a peptide fragment to act as an allergen.

Once any domain, discipline, technology or industry becomes information-enabled and powered by information flows, its price/performance begins doubling approximately annually.
Third, once that doubling pattern starts, it doesn’t stop. We use current computers to design faster computers, which then build faster computers, and so on.
Finally, several key technologies today are now information-enabled and following the same trajectory. Those technologies include artificial intelligence (AI), robotics, biotech and bioinformatics, medicine, neuroscience, data science, 3D printing, nanotechnology and even aspects of energy.
Never in human history have we seen so many technologies moving at such a pace. And now that we are information-enabling everything around us, the effects of the Kurzweil’s Law of Accelerating Returns are sure to be profound.
What’s more, as these technologies intersect (e.g., using deep-learning AI algorithms to analyze cancer trials), the pace of innovation accelerates even further.

…

Of the 155 teams competing, three were awarded a total of $100,000 in prize money. What was particularly interesting was the fact that none of the winners had prior experience with natural language processing (NLP). Nonetheless, they beat the experts, many of them with decades of experience in NLP under their belts.
This can’t help but impact the current status quo. Raymond McCauley, Biotechnology & Bioinformatics Chair at Singularity University, has noticed that “When people want a biotech job in Silicon Valley, they hide their PhDs to avoid being seen as a narrow specialist.”
So, if experts are suspect, where should we turn instead? As we’ve already noted, everything is measurable. And the newest profession making those measurements is the data scientist. Andrew McAfee calls this new breed of data experts “geeks.”

They
analyzed central banks and politicians and ﬁgured out the direction of currencies. In an era of relatively stable currencies, the
modern-day investor has to dig, early and often and everywhere.
I’d still rather dig than get whacked by a runaway yen-carry trade.
Another cycle is coming. The drivers of it are still unclear.
296
Running Money
Likely suspects are things like wireless data, on-command computing, nanotechnology, bioinformatics, genomic sorting—who the
hell knows what it will be.
But this is what I do. Looking for the next barrier, the next
piece of technology, the next waterfall and the next great, longterm investment. Sounds quaint.
I’ve come a long way from tripping across Homa Simpson
dolls trying to raise money in Hong Kong. Or getting sweated on
by desperate Koreans. Or driving around all day with Fred. Or
getting thrown out of deals.

A clique in an undirected graph G is a subset S of
vertices such that the graph has an edge between every pair of vertices
in S. The size of a clique is the number of vertices it contains.
As you might imagine, cliques play a role in social network theory.
Modeling each individual as a vertex and relationships between individuals as undirected edges, a clique represents a group of individuals all
of whom have relationships with each other. Cliques also have applications in bioinformatics, engineering, and chemistry.
The clique problem takes two inputs, a graph G and a positive integer k, and asks whether G has a clique of size k. For example, the graph
on the next page has a clique of size 4, shown with heavily shaded vertices, and no other clique of size 4 or greater.
192
Chapter 10: Hard? Problems
Verifying a certificate is easy. The certificate is the k vertices claimed
to form a clique, and we just have to check that each of the k vertices has
an edge to the other k 1.

…

Vertex cover
A vertex cover in an undirected graph G is a subset S of the vertices
such that every edge in G is incident on at least one vertex in S. We say
that each vertex in S “covers” its incident edges. The size of a vertex
cover is the number of vertices it contains. As in the clique problem,
the vertex-cover problem takes as input an undirected graph G and a
positive integer m. It asks whether G has a vertex cover of size m. Like
the clique problem, the vertex-cover problem has applications in bioinformatics. In another application, you have a building with hallways
and cameras that can scan up to 360 degrees located at the intersections
of hallways, and you want to know whether m cameras will allow you
to see all the hallways. Here, edges model hallways and vertices model
intersections. In yet another application, finding vertex covers helps in
designing strategies to foil worm attacks on computer networks.

Consider the following examples: medical information is information about medical facts (attributive use), not information that has curative properties; digital information is not information about something digital, but information that is in itself of digital nature (predicative use); and military information can be both information about something military (attributive) and of military nature in itself (predicative). When talking about biological or genetic information, the attributive sense is common and uncontroversial. In bioinformatics, for example, a database may contain medical records and genealogical or genetic data about a whole population. Nobody disagrees about the existence of this kind of biological or genetic information. It is the predicative sense that is more contentious. Are biological or genetic processes or elements intrinsically informational in themselves? If biological or genetic phenomena count as informational predicatively, is this just a matter of modelling, that is, may be seen as being informational?

DARPA's Information Processing Technology Office's project in this vein is called LifeLog, http://www.darpa.mil/ipto/Programs/lifelog; see also Noah Shachtman, "A Spy Machine of DARPA's Dreams," Wired News, May 20, 2003, http://www.wired.com/news/business/0,1367,58909,00.html; Gordon Bell's project (for Microsoft) is MyLifeBits, http://research.microsoft.com/research/barc/MediaPresence/MyLifeBits.aspx; for the Long Now Foundation, see http://longnow.org.
44. Bergeron is assistant professor of anesthesiology at Harvard Medical School and the author of such books as Bioinformatics Computing, Biotech Industry: A Global, Economic, and Financing Overview, and The Wireless Web and Healthcare.
45. The Long Now Foundation is developing one possible solution: the Rosetta Disk, which will contain extensive archives of text in languages that may be lost in the far future. They plan to use a unique storage technology based on a two-inch nickel disk that can store up to 350,000 pages per disk, with an estimated life expectancy of 2,000 to 10,000 years.

., which received angel funding
from the Y Combinator fund, and he relocated to San Francisco. WebMynd is one
of the largest installations of Solr, indexing up to two million HTML documents
per day, and making heavy use of Solr's multicore features to enable a partially active index.
Jerome Eteve holds a BSC in physics, maths and computing and an MSC in IT
and bioinformatics from the University of Lille (France). After starting his career in the field of bioinformatics, where he worked as a biological data management and analysis consultant, he's now a senior web developer with interests ranging from database level issues to user experience online. He's passionate about open source technologies, search engines, and web application architecture. At present, he is working since 2006 for Careerjet Ltd, a worldwide job search engine.

FishBase is a relational database with fish information to
cater to different professionals such as research scientists, fisheries
managers, zoologists, and many more. FishBase on the Web contains
practically all fish species known to science.”
Search Form URL: http://www.fishbase.org/search.cfm
GeneCards
http://bioinformatics.weizmann.ac.il
“GeneCards is a database of human genes, their products, and their
involvement in diseases. It offers concise information about the functions of all human genes that have an approved symbol, as well as
selected others [gene listing].”
Search Form URL: http://bioinformatics.weizmann.ac.il/cards/
Integrated Taxonomic Information System
(Biological Names)
http://www.itis.usda.gov/plantproj/itis/index.html
“The Integrated Taxonomic Information System (ITIS) is a partnership
of U.S., Canadian, and Mexican agencies, other organizations, and
taxonomic specialists cooperating on the development of an online,
scientifically credible, list of biological names focusing on the biota of
North America.”

We do not buy the argument that “Since X plays an important role in intelligence,
studying X contributes to the study of intelligence in general”, where X can be replaced
by reasoning, learning, planning, perceiving, acting, etc. On the contrary, we believe
that most of the current AI research works make little direct contribution to AGI,
though these works have value for many other reasons. Previously we have mentioned
“machine learning” as an example. One of us (Goertzel) has published extensively
about applications of machine learning algorithms to bioinformatics. This is a valid,
and highly important sort of research – but it doesn’t have much to do with achieving
general intelligence.
There is no reason to believe that “intelligence” is simply a toolbox, containing
mostly unconnected tools. Since the current AI “tools” have been built according to
very different theoretical considerations, to implement them as modules in a big system
will not necessarily make them work together, correctly and efficiently.

…

Unlike most contemporary AI projects, it is specifically
oriented towards artificial general intelligence (AGI), rather than being restricted by
design to one narrow domain or range of cognitive functions. The NAIE integrates aspects of prior AI projects and approaches, including symbolic, neural-network, evolutionary programming and reinforcement learning. The existing codebase is being applied in bioinformatics, NLP and other domains.
To save space, some of the discussion in this paper will assume a basic familiarity
with NAIE structures such as Atoms, Nodes, Links, ImplicationLinks and so forth, all
of which are described in previous references and in other papers in this volume.
1.2. Cognitive Development in Simulated Androids
Jean Piaget, in his classic studies of developmental psychology [8] conceived of child
development as falling into four stages, each roughly identified with an age group: infantile, preoperational, concrete operational, and formal.

Phillips is now the Chief Scientist of the Alogus Research Corporation, which conducts research in the physical sciences and provides technology assessment for investors.
I am grateful to the users of my gnuplot web pages for their interest, questions, and suggestions over the years, and to my family for their patience and support.
About the Reviewers
Andreas Bernauer is a Software Engineer at Active Group in Germany. He graduated at Eberhard Karls Universität Tübingen, Germany, with a Degree in Bioinformatics and received a Master of Science degree in Genetics from the University of Connecticut, USA. In 2011, he earned a doctorate in Computer Engineering from Eberhard Karls Universität Tübingen.
Andreas has more than 10 years of professional experience in software engineering. He implemented the server-side scripting engine in the scheme-based SUnet web server, hosted the Learning-Classifier-System workshops in Tübingen.

Graphs, on the other hand use index-free adjacency to ensure
that traversing connected data is extremely rapid.
The social network example helps illustrate how different technologies deal with con‐
nected data, but is it a valid use case? Do we really need to find such remote “friends?”
But substitute social networks for any other domain, and you’ll see we experience similar
performance, modeling and maintenance benefits. Whether music or data center man‐
agement, bio-informatics or football statistics, network sensors or time-series of trades,
graphs provide powerful insight into our data. Let’s look, then, at another contemporary
application of graphs: recommending products based on a user’s purchase history and
the histories of their friends, neighbours, and other people like them. With this example,
we’ll bring together several independent facets of a user’s lifestyle to make accurate and
profitable recommendations.

He does it selectively, but one speaking engagement in 2010 focused his interest and steered his career in a new direction. He had agreed to give a talk in Seattle at a conference hosted by Sage Bionetworks, a nonprofit organization dedicated to accelerate the sharing of data for biological research. Hammerbacher knew the two medical researchers who had founded the nonprofit, Stephen Friend and Eric Schadt. He had talked to them about how they might use big-data software to cope with the data explosion in bioinformatics and genomics. But the preparation for the speech forced him to really think about biology and technology, reading up and talking to people.
The more Hammerbacher looked into it, the more intriguing the subject looked. Biological research, he says, could go the way of finance with its closed, proprietary systems and data being hoarded rather than shared. Or, he says, it could “go the way of the Web”—that is, toward openness.

Netflix created
the Netflix Prize for the data science team that could optimize the company’s movie recommendations for customers and, as I noted in chapter 2, is now using big data to help in the creation of proprietary content.
The testing firm Kaplan uses its big data to begin advising customers
on effective learning and test-preparation strategies. Novartis focuses
on big data—the health-care industry calls it informatics—to develop
new drugs. Its CEO, Joe Jimenez, commented in an interview, “If you
think about the amounts of data that are now available, bioinformatics capability is becoming very important, as is the ability to mine that
data and really understand, for example, the specific mutations that are
leading to certain types of cancers.”7 These companies’ big data efforts
are directly focused on products, services, and customers.
This has important implications, of course, for the organizational
locus of big data and the processes and pace of new product development.

Wikis provide a shared space for group learning, discussion, and collaboration, while a Facebook-like social networking application helps connect researchers working on similar problems.
Meanwhile, over at the European Bioinformatics Institute, scientists are using Web services to revolutionize the way they extract and interpret data from different sources, and to create entirely new data services. Imagine, for example, you wanted to find out everything there is to know about a species, from its taxonomy and genetic sequence to its geographical distribution. Now imagine you had the power to weave together all the latest data on that species from all of the world’s biological databases with just one click. It’s not far-fetched. That power is here, today. Projects like these have inspired researchers in many fields to emulate the changes that are already sweeping disciplines such as bioinformatics and high-energy physics. Having said that, there will be some difficult adjustments and issues such as privacy and national security to confront along the way.

If we were to take 140 bytes per message, as used by Twitter, it would total more than 17 TB every month. Even before the transition to HBase, the existing system had to handle more than 25 TB a month.[12]
In addition, less web-oriented companies from across all major industries are collecting an ever-increasing amount of data. For example:
Financial
Such as data generated by stock tickers
Bioinformatics
Such as the Global Biodiversity Information Facility (http://www.gbif.org/)
Smart grid
Such as the OpenPDC (http://openpdc.codeplex.com/) project
Sales
Such as the data generated by point-of-sale (POS) or stock/inventory systems
Genomics
Such as the Crossbow (http://bowtie-bio.sourceforge.net/crossbow/index.shtml) project
Cellular services, military, environmental
Which all collect a tremendous amount of data as well
Storing petabytes of data efficiently so that updates and retrieval are still performed well is no easy feat.

The invention of an algorithmic biology
Seth Bullock
BIOLOGY and computing might not seem the most comfortable of bedfellows. It is easy to imagine nature and technology clashing as the green-welly brigade rub up awkwardly against the back-room boffins. But collaboration between the two fields has exploded in recent years, driven primarily by massive investment in the emerging field of bioinformatics charged with mapping the human genome. New algorithms and computational infrastructures have enabled research groups to collaborate effectively on a worldwide scale in building huge, exponentially growing genomic databases, to ‘mine’ these mountains of data for useful information, and to construct and manipulate innovative computational models of the genes and proteins that have been identified.

(question mark), Wildcards
calling functions and, Wildcards
character classes, Wildcards
expanding, Wildcards
misuse, Wildcards
pattern rules and, Rules
^ (tilde), Wildcards
Windows filesystem, Cygwin and, Filesystem
wordlist function, String Functions
words function, String Functions
X
XML, Ant, XML Preprocessing
build files, Ant
preprocessing book makefile, XML Preprocessing
About the Author
Robert Mecklenburg began using Unix as a student in 1977 and has been programming professionally for 23 years. His make experience started in 1982 at NASA with Unix version 7. Robert received his Ph.D. in Computer Science from the University of Utah in 1991. Since then, he has worked in many fields ranging from mechanical CAD to bioinformatics, and he brings his extensive experience in C++, Java, and Lisp to bear on the problems of project management with make
Colophon
Our look is the result of reader comments, our own experimentation, and feedback from distribution channels. Distinctive covers complement our distinctive approach to technical topics, breathing personality and life into potentially dry subjects.
The animal on the cover of Managing Projects with GNU Make, Third Edition is a potto, a member of the loris family.

However, the NCI Thesaurus is not
“just” a thesaurus; it uses OWL and is description logic based, also using a
concept hierarchy organized into trees. The terms were stored in a 11179 registry, and the registry metadata was mapped to UML structures from the Class
Diagram. The solution includes three main layers:
✦ Layer 1: Enterprise Vocabulary Services: DL (description logics) and
ontology, thesaurus
✦ Layer 2: CADSR: Metadata Registry, consisting of Common Data
Elements
✦ Layer 3: Cancer Bioinformatics Objects, using UML Domain Models
The NCI Thesaurus contains over 48,000 concepts. Although its emphasis is
on machine understandability, NCI has managed to translate description logic
somewhat into English. Linking concepts together is accomplished through
roles, which are also concepts themselves. Here’s an example:
Concept: Disease: ALD Positive Anaplastic Large Cell Lymphoma
Role: Disease_Has_Molecular_Abnormality
Concept: Molecular Abnormality: Rearrangement of 2p23 (Warzel, 2006, p.18)
216
Chapter 11
Semantics and Business Metadata
NCI’s toolkit is called caCORE, and it includes objects that developers can use
in their applications.

When I asked if he had known David, he told me that O'Connor had been intent on shutting down David's enterprise. With their deep pockets, he said "they had guys spending all their time running diff RMSs files and the O'Connor code" (Dill is one of the great suite of UNIX tools that make a programmer's life easier. It compares two different files of text and finds any common strings of words in them, a simpler version of current bio-informatics programs that search for common strings of DNA in the mouse and human genome.) I have no idea whether there were in fact commonalities, but even independent people coding the same wellknown algorithm might end up writing vaguely similar chunks of code.
O'Connor eventually disappeared, too, absorbed into Swiss Bank, which itself subsequently merged with UBS. Starting in 1990 David disappeared into some alternate nonfinancial New York; none of his old friends saw him anymore.

Yet even more traditional sectors will feel the pull of the pebbles in time, not least because the consumers and workforce of the near future will have grown up using the social web to search for and share ideas with one another. They will bring with them the web’s culture of lateral, semi-structured free association.
This new organisational landscape is taking shape all around us. Scientific research is becoming ever more a question of organising a vast number of pebbles. Young scientists especially in emerging fields like bioinformatics draw on hundreds of data banks; use electronic lab notebooks to record and then share their results daily, often through blogs and wikis; work in multi-disciplinary teams threaded around the world organised by social networks; they publish their results, including open source versions of the software used in their experiments and their raw data, in open access online journals. Schools and universities are boulders, that are increasingly dealing with students who want to be in the pebble business, drawing information from a variety of sources, sharing with their peers, learning from one another.

(Although, of course, there is potential to miss the true culprit if it lies outside the exome.) When geneticists began exome sequencing in earnest, they encountered an unexpected complication. It turns out that each human individual carries a surprisingly high number of potentially deleterious mutations, typically more than one hundred. These are mutations that alter or disturb protein sequences in a way that is predicted to have a damaging effect on protein function, based on bioinformatic (computer-based) analyses. Each mutation might be extremely rare in the population, or even unique to the person or family in which it is found. How do we sift out the true causal mutations, the ones that are functionally implicated in the disorder or trait we are studying, against a broader background of irrelevant genomic change? Sometimes we can rely on a lucky convergence of findings, for example, where distinct mutations in the same gene pop up in multiple different affected families or cases.

The Chinese had the best chance of sequencing the virus; the threat of SARS was most significant in Asia, and especially in China, which had most of the world’s confirmed cases, and China is home to brilliant biologists, with significant expertise in distributed computing. Despite these resources and incentives, however, the solution didn’t come from China.
On April 12, Genome Sciences Centre (GSC), a small Canadian lab specializing in the genetics of pathogens, published the genetic sequence of SARS. On the way, they had participated in not just one open network, but several. Almost the entire computational installation of GSC is open source; bioinformatics tools with names like BLAST, Phrap, Phred, and Consed, all running on Linux. GSC checked their work against Genbank, a public database of genetic sequences. They published their findings on their own site (run, naturally, using open source tools) and published the finished sequence to Genbank, for everyone to see. The story is shot through with involvement in various participatory networks.

While Seltzer makes the case that virtually every bit of our personal information is now available to those who want it, I do think there are parts of our lives that remain private and that we must fight to keep private. And I think the best way to do that is by focusing on defining rules for data retention and proper use. Most of our health information remains private, and the need for privacy will grow with the rise of genomics. John Quackenbush, a professor of computational biology and bioinformatics at Harvard, explained that “as soon as you touch genomic data, that information is fundamentally identifiable. I can erase your address and Social Security number and every other identifier, but I can’t anonymize your genome without wiping out the information that I need to analyze.”
The danger of genomic information being widely available is difficult to overstate. All of the most intimate details of who and what we are genetically could be used by governments or corporations for reasons going beyond trying to develop precision medicines.

Mimer is also taking active part in the standardization of SQL as a member of the ISO SQL-standardization committee ISO/IEC JTC1/SC32, WorkGroup 3, Database Languages. You can download free development versions of Mimer SQL from http://www.mimer.com.
Troels Arvin lives with his wife and son in Copenhagen, Denmark. He went half-way through medical school before realizing that computer science was the thing to do. He has since worked in the web, bioinformatics, and telecommunications businesses. Troels is keen on database technology and maintains a slowly growing web page on how databases implement the SQL standard: http://troels.arvin.dk/db/rdbms.
Acknowledgments
We would like to thank our editor, Brian Jepson, for his hard work and exceptional skill; his ability to separate the wheat from the chaff was invaluable. We are grateful to Alan Beaulieu, author of Learning SQL and Mastering Oracle SQL (both from O'Reilly), for his time, energy, and technical insight.

Back in the Cold Spring Harbor bar in 2000, a young British geneticist called Ewan Birney was pratting around, but inadvertently doing something quite profound at the same time. Nowadays it has become a tiresome cliché to say that a person’s passion or quintessential characteristic is ‘in their DNA’. The satirical magazine Private Eye has a whole column dedicated to this phrase flopping out of journalists’ and celebrities’ mouths. Well, Ewan Birney is a man with DNA in his DNA. These days he heads the European Bioinformatics Institute in Hinxton, just outside Cambridge, one of the great global genome powerhouses. While our contemporaries went off to Koh Samui or Goa to find themselves on their year off before going up to university, Ewan had won a place in the lab of James Watson, at Cold Spring Harbor, just at the birth of genomics, the biological science that would come to dominate all others.
Maybe it was his familiarity with that bar – or maybe it was just the beer – that led him to do something quite silly, fairly trivial, but something in fact that is one of the great comments on the nature of science.

Measured by growth, it was Google’s best year, with revenues soaring 60 percent to $16.6 billion, with international revenues contributing nearly half the total, and with profits climbing to $4.2 billion. Google ended the year with 16,805 full-time employees, offices in twenty countries, and the search engine available in 117 languages. And the year had been a personally happy one for Page and Brin. Page married Lucy Southworth, a former model who earned her Ph.D. in bioinformatics in January 2009 from Stanford; they married seven months after Brin wed Anne Wojcicki.
But Sheryl Sandberg was worried. She had held a ranking job in the Clinton administration before, joining Google in 2001, where she supervised all online sales for AdWords and AdSense, and was regularly hailed by Fortune magazine as one of the fifty most powerful female executives in America. Sandberg came to believe Google’s vice was the flip side of its virtue.

Lastly I want to thank all the adopters of Solr and Lucene! Without you, I wouldn't have this wonderful open source project to be so incredibly proud to be a part of! I look forward to meeting more of you at the next LuceneRevolution or Euro Lucene conference.
About the Reviewers
Jerome Eteve holds a MSc in IT and Sciences from the University of Lille (France). After starting his career in the field of bioinformatics where he worked as a Biological Data Management and Analysis Consultant, he's now a Senior Application Developer with interests ranging from architecture to delivering a great user experience online. He's passionate about open source technologies, search engines, and web application architecture.
He now works for WCN Plc, a leading provider of recruitment software solutions.
He has worked on Packt's Enterprise Solr published in 2009.

Hosted at Los Alamos by Christopher Langton, then a postdoctoral researcher at the laboratory, the conference brought together 160 biologists, physicists, anthropologists, and computer scientists. Like the scientists
and technicians of the Rad Lab and Los Alamos in World War II, the contributors to the ﬁrst Artiﬁcial Life Conference quickly established an intellectual trading zone. Specialists in robotics presented papers on questions of
cultural evolution; computer scientists used new algorithms to model seemingly biological patterns of growth; bioinformatics specialists applied what
they believed to be principles of natural ecologies to the development of
social structures. For these scientists, as formerly for members of the Rad
Lab and the cold war research institutes that followed it, systems theory
served as a contact language and computers served as key supports for a systems orientation toward interdisciplinary work. Furthermore, computers
granted participants in the workshop a familiar God’s-eye point of view.

Ron Howard, who had become interested in Bayes while at Harvard, was working on Bayesian networks in Stanford’s economic engineering department. A medical student, David E. Heckerman, became interested too and for his Ph.D. dissertation wrote a program to help pathologists diagnose lymph node diseases. Computerized diagnostics had been tried but abandoned decades earlier. Heckerman’s Ph.D. in bioinformatics concerned medicine, but his software won a prestigious national award in 1990 from the Association for Computing Machinery, the professional organization for computing. Two years later, Heckerman went to Microsoft to work on Bayesian networks.
The Federal Drug Administration (FDA) allows the manufacturers of medical devices to use Bayes in their final applications for FDA approval. Devices include almost any medical item that is not a drug or biological product, items such as latex gloves, intraocular lenses, breast implants, thermometers, home AIDS kits, and artificial hips and hearts.

Dougherty: Oftentimes, inventors who are prosecuting their application pro se are unaware that they may ask the examiner for assistance in drafting allowable claims if there is allowable subject matter in the written disclosure. The examiner’s function is to allow valid patents. So, they will help the inventor come to an allowable subject matter if it exists in the application.
Stern: Which technologies or fields exhibit high-growth trends in terms of patents?
Calvert: One area that is going to be big is bioinformatics, which is biology and computer software working together.
Dougherty: Medical device art is a high-growth area, too. People are living longer and they’re seeking to reduce costs for an enhanced life. Devices are getting smaller. Nanotechnology is already enabling medical devices, for example, that can travel through your bloodstream, collecting and reporting medical data in real time.
Calvert: Another area that’s booming is electronic games and betting devices in the gambling industry.

For example, in the ISR (intelligence, surveillance and reconnaissance) domain, we produce sensors that generate the bits, transfer those bits through networks, wireless or wired, convert the bits into data, into knowledge, and into decisions through the processing, exploitation, and dissemination chain. With a teammate we developed a brand-new type of biological sensor that we called “TIGER” (Threat ID through Genetic Evaluation of Risk). That technology won The Wall Street Journal “gold” Technology Innovation Award in 2009 for the best invention of the year. It relies on a combination of advanced biotech hardware with groundbreaking bio-informatics techniques that were based on our radar signal processing expertise. Information from a sensor like that can feed into our epidemiology and disease tracking work. That's an example of a sensor at the front end through information flow at the back end. In the cyber security domain, our subsidiary, CloudShield, has a very special piece of hardware that enables real-time, deep packet inspection of network traffic at network line speeds, and that allows you to find cyber threats embedded in the traffic.

For a project to be listed here,
first of all I had to be aware of it. Then, the project had to be
■
■
■
■
Free and open source
Available for the Linux platform
Active and mature
Available as a standalone product and allowing interactive use (this requirement eliminates libraries and graphics command languages)
348
APPENDIX C
■
■
C.3.1
Reasonably general purpose (this eliminates specialized tools for molecular
modeling, bio-informatics, high-energy physics, and so on)
Comparable to or going beyond gnuplot in at least some respects
Math and statistics programming environments
R The R language and environment (www.r-project.org) are in many ways the de
facto standard for statistical computing and graphics using open source tools. R shares
with gnuplot an emphasis on iterative work in an interactive environment. It’s extensible, and many user-contributed packages are available from the R website and its mirrors.

They also provide important insight into the concept of causality.28
One advantage of relating learning problems from specific domains to the general problem of Bayesian inference is that new algorithms that make Bayesian inference more efficient will then yield immediate improvements across many different areas. Advances in Monte Carlo approximation techniques, for example, are directly applied in computer vision, robotics, and computational genetics. Another advantage is that it lets researchers from different disciplines more easily pool their findings. Graphical models and Bayesian statistics have become a shared focus of research in many fields, including machine learning, statistical physics, bioinformatics, combinatorial optimization, and communication theory.35 A fair amount of the recent progress in machine learning has resulted from incorporating formal results originally derived in other academic fields. (Machine learning applications have also benefitted enormously from faster computers and greater availability of large data sets.)
* * *
Box 1 An optimal Bayesian agent
An ideal Bayesian agent starts out with a “prior probability distribution,” a function that assigns probabilities to each “possible world” (i.e. to each maximally specific way the world could turn out to be).29 This prior incorporates an inductive bias such that simpler possible worlds are assigned higher probabilities.

Seated in his office at the company’s Mountain View headquarters, he read a message that warned him an alien attack was under way. Immediately after he read the message, two large men burst into his office and instructed him that it was essential he immediately accompany them to an undisclosed location in Woodside, the elite community populated by Silicon Valley’s technology executives and venture capitalists.
This was Page’s surprise fortieth birthday party, orchestrated by his wife, Lucy Southworth, a Stanford bioinformatics Ph.D. A crowd of 150 people in appropriate alien-themed costumes had gathered, including Google cofounder Sergey Brin, who wore a dress. In the basement of the sprawling mansion where the party was held, a robot arm grabbed small boxes one at a time and gaily tossed the souvenirs to an appreciative crowd. The robot itself consisted of a standard Japanese-made industrial robot arm outfitted with a suction gripper hand driven by a noisy air compressor.

Statistical Language Learning,* by Eugene Charniak (MIT Press, 1996), explains how hidden Markov models work. Statistical Methods for Speech Recognition,* by Fred Jelinek (MIT Press, 1997), describes their application to speech recognition. The story of HMM-style inference in communication is told in “The Viterbi algorithm: A personal history,” by David Forney (unpublished; online at arxiv.org/pdf/cs/0504020v2.pdf). Bioinformatics: The Machine Learning Approach,* by Pierre Baldi and Søren Brunak (2nd ed., MIT Press, 2001), is an introduction to the use of machine learning in biology, including HMMs. “Engineers look to Kalman filtering for guidance,” by Barry Cipra (SIAM News, 1993), is a brief introduction to Kalman filters, their history, and their applications.
Judea Pearl’s pioneering work on Bayesian networks appears in his book Probabilistic Reasoning in Intelligent Systems* (Morgan Kaufmann, 1988).

I stare at him and almost forget to stand up and say the words of the Nicene Creed, which is what comes next.
I believe in God the Father, maker of heaven and earth and of all things seen and unseen. I believe God is important and does not make mistakes. My mother used to joke about God making mistakes, but I do not think if He is God He makes mistakes. So it is not a silly question.
Do I want to be healed?And of what?
The only self I know is this self, the person I am now, the autistic bioinformatics specialist fencer lover of Marjory.
And I believe in his only begotten son, Jesus Christ, who actually in the flesh asked that question of the man by the pool. The man who perhaps—the story does not say—had gone there because people were Page 183
tired of him being sick and disabled, who perhaps had been content to lie down all day, but he got in the way.
What would Jesus have done if the man had said, “No, I don’t want to be healed; I am quite content as I am”?

He understood that the boy was leaving Thailand in a few days. Yes, he was in Bangkok. He was occupied at the moment, but would come by in a few hours.
Niran hung up the phone, smiled to himself. It would be wonderful to see Thanom again.
35
ROOTS
"I wasn't born Samantha Cataranes. I was born Sarita Catalan. I grew up in southern California, in a little town near San Diego. My parents were Roberto and Anita. They both worked in bioinformatics, had met on the job. I had a sister, Ana." Sorrow welled up from her. Tears began to flow again, silently running down the side of her face. Kade felt troubled, concerned, empathic. He stroked her hair, sent kindness.
"My parents were hippies. The kind of hippies who worked in tech but went camping with the family, had singalongs with friends. There were always a lot of friends around the first few years.

But we have to be willing to try and take advantage of that, but also take advantage of the integration of systems and the fact that data's coming from everywhere. It's no longer encapsulated with the program, the code. We're seeing now, I think, vast amounts of data, which is accessible. And it's numeric data as well as the informational kinds of data, and will be stored all over the globe, especially if you're working in some of the bioinformatics kind of stuff. And we have to be able to create a platform, probably composed of a lot of parts, which is going to enable those things to come together—computational capability that is probably quite different than we have now. And we also need to, sooner or later, address usability and integrity of these systems.
Seibel: Usability from the point of the programmer, or usability for the end users of these systems?

This would involve sampling page view logs (because the total page view data for a popular website is huge), grouping it by time and then finding the number of new users at different time points via a custom reduce script. This is a good example where both SQL and MapReduce are required for solving the end user problem and something that is possible to achieve easily with Hive.
Data analysis
Hive and Hadoop can be easily used for training and scoring for data analysis applications. These data analysis applications can span multiple domains such as popular websites, bioinformatics companies, and oil exploration companies. A typical example of such an application in the online ad network industry would be the prediction of what features of an ad makes it more likely to be noticed by the user. The training phase typically would involve identifying the response metric and the predictive features. In this case, a good metric to measure the effectiveness of an ad could be its click-through rate.

The assertion is that genetic enhancement necessarily implies experimentation without consent and this violates bedrock bioethical principles requiring the protection of human subjects. Consequently, there is an unbridgeable gap which would-be enhancers cannot ethically cross.
This view incorporates a rather static view of what it will be possible for future genetic ­enhancers to know and test beforehand. Any genetic enhancement techniques will first be ­extensively tested and perfected in animal models. Second, a vastly expanded bioinformatics enterprise will become crucial to understanding the ramifications of proposed genetic inter­ventions (National Resource Center for Cell Analysis). As scientific understanding improves, the risk versus benefit calculations of various prospective genetic enhancements of embryos will shift. The arc of ­scientific discovery and technological progress strongly suggests that it will happen in the next few decades.

This also relates to what Heidegger once called our “confrontation with planetary technology” (an encounter that he never managed to actually make and which most Heideggerians manage to endlessly defer, or “differ”).15 That encounter should be motivated by an invested interest in several “planetary technologies” working at various scales of matter, and based on, in many respects, what cheap supercomputing, broadband networking, and isomorphic data management methodologies make possible to research and application. These include—but are no means limited to—geology (e.g., geochemistry, geophysics, oceanography, glaciology), earth sciences (e.g., focusing on the atmosphere, lithospere, biosphere, hydrosphere), as well as the various programs of biotechnology (e.g., bioinformatics, synthetic biology, cell therapy), of nanotechnology (e.g., materials, machines, medicines), of economics (e.g., modeling price, output cycles, disincentivized externalities), of neuroscience (e.g., behavioral, cognitive, clinical), and of astronomy (e.g., astrobiology, extragalactic imaging, cosmology). In that all of these are methodologically and even epistemologically informed by computer science (e.g., algorithmic modeling, macrosensors and microsensors, data structure optimization, information theory, data visualization, cryptography, networked collaboration), then all of these planetary technologies are also planetary computational technologies.

.… But now the damn thing is everywhere.”) Like any good meme, it spawned mutations. The “jumping the shark” entry in Wikipedia advised in 2009, “See also: jumping the couch; nuking the fridge.”
Is this science? In his 1983 column, Hofstadter proposed the obvious memetic label for such a discipline: memetics. The study of memes has attracted researchers from fields as far apart as computer science and microbiology. In bioinformatics, chain letters are an object of study. They are memes; they have evolutionary histories. The very purpose of a chain letter is replication; whatever else a chain letter may say, it embodies one message: Copy me. One student of chain-letter evolution, Daniel W. VanArsdale, listed many variants, in chain letters and even earlier texts: “Make seven copies of it exactly as it is written” [1902]; “Copy this in full and send to nine friends” [1923]; “And if any man shall take away from the words of the book of this prophecy, God shall take away his part out of the book of life” [Revelation 22:19].♦ Chain letters flourished with the help of a new nineteenth-century technology: “carbonic paper,” sandwiched between sheets of writing paper in stacks.

MCV faculty also helped undermine public health advocacy: in 1990 James Kilpatrick from biostatistics, working also as a consultant for the Tobacco Institute, wrote to the editor of the New York Times criticizing Stanton Glantz and William Parmley’s demonstration of thirty-five thousand U.S. cardiovascular deaths per annum from exposure to secondhand smoke.49 Glantz by this time was commonly ridiculed by the industry, which even organized skits (to practice courtroom scenarios) in which health advocates were given thinly disguised names: Glantz was “Ata Glance” or “Stanton Glass, professional anti-smoker”; Alan Blum was “Alan Glum” representing “Doctors Ought to Kvetch” or “Doctors Opposed to People Exhaling Smoke” (DOPES); Richard Daynard was “Richard Blowhard” from the “Product Liability Education Alliance,” and so forth.50 VCU continues even today to have close research relationships with Philip Morris, covering topics as diverse as pharmacogenomics, bioinformatics, and behavioral genetics.51
SYMBIOSIS
It would be a mistake to characterize this interpenetration of tobacco and academia as merely a “conflict of interest”; the relationship has been far more symbiotic. We are really talking about a confluence of interests, and sometimes even a virtual identity of interests. The Medical College of Virginia was “sold American” by the early 1940s and remained one of the tobacco industry’s staunchest allies for seven decades.