How would you like to share?

Tim Clark began by laying out recommendations about the information infrastructure required if many groups want to be able to do collective experimentation, to share data, and to exploit automated pattern recognition in that shared data. One example where this is indispensable is data mining to elucidate complex pathways, Clark said.

Clark stated that researchers' frequent inclination to "just do screens and then mine the data" is wrong because, without a world model about the data one measures, that data does not later yield meaning. By world model he means ample annotation about the system under study. Clark added that even rudimentary data needs a world model as support. Until a few years ago, people used to think that having the proper technical architecture to support database management was sufficient. People thought using relational databases would enable data mining across all biology. This proved to be wrong.

Blake concurred, saying one needs to work out an information architecture prior to running array experiments. Databases set up to allow data sharing and integration are necessary but not sufficient for effective data mining. She urged development early on of a controlled vocabulary, a hierarchy of terms. This can be a simple classification system or become very sophisticated, as in artificial intelligence methods that enable full-fledged world knowledge representation and allow automated inferencing and other functions.

Moreover, in any complex bioinformatics project one needs to separate two fundamentally different kinds of information: the technical architecture and the information architecture, Clark said. The former is code (e.g. a data management system), hard to lay down and requires a lot of programming. The latter is concepts guided by world knowledge about the biology under study; it must be lightweight and easily modified as concepts of biology change. For example, Millennium has terabyte systems that do sequence analysis and others that do expression analysis. If tissue type on one says brain and on the other says hypothalamus, the system integrates the data only if it knows the anatomical relationship between the two. That information should go into the information architecture. In AD, a defined "ontology" is necessary if one wants a pathways database to talk with an expression database, for example.

Clark and Blake urged researchers to ask: Would it make sense to construct a controlled vocabulary specifically geared to their area? If one wants to make interoperable the data from different labs running all kinds of high-throughput experiments, the answer should be yes. If the effort involves individual labs doing their own experiments and storing their results in their own databases, then the answer can be no. Certain types of analysis, however, cannot be done without data sharing and constructing an information architecture.

This triggered disagreement from many who said this approach would create too much prior information, restricting the system to what is already known. Clark and Blake said too little prior information is equally problematic. They clarified that annotation does not mean building assumptions, hypotheses, and bias from the literature into the system. It means developing a controlled nomenclature plus taking accepted knowledge from the literature. Blake said that in mouse genetics, simple things pose big problems. For example, researchers do not publish in what strains they did their experiments or are not using standard terms to describe strains. In brain research, naming cell types could cause similar problems.

BLAST searches were cited as an example of a successful common information architecture. A database to share microarray data is being built by industry. Baughman said he would like to see specific recommendations for microarray database sharing coming out of future AD informatics discussions to be incorporated into data-sharing standards that need to be developed for NIH microarray centers currently being built.

DiStefano said that when his group created their own chips, they tried to ascribe function to the microarray spots based on BLAST analysis. Without annotation, the first data analysis iteration took months. With Clark, they then entered everything they knew empirically about what they put on the microarray. Since then they have doubled the arrays to 15,000 points each and still conduct much faster analysis. The resulting clustering and organizing maps are better and have yielded interesting clusters that they would not have been able to conceive from the literature and their primary array data. Forbes Dewey concurrred, saying that without a world model in mind one cannot design the right experiments (i.e. put the right things on a chip.)

Clark went on to say that the information architecture must be transferable to other systems, such mouse databases, mass-spectrometry expression databases, and that the annotation must be continually updated to mature over time. Dewey agreed, saying biologists are too afraid of complex systems. He quoted as a successful example the cardiac myocyte, saying his group can predict how it will react to extracellular changes in e.g. calcium and confirm the computer model in the lab. This was worked out with 250 coupled equations but the system originated with 10 equations and grew over time. Another example is computational protein folding at the cell surface, which has matured to now predict what proteins get folded from first-principle physics. He says that the capability to deal with very complex, maturing systems is highly underutilized in biology.

Heywood said that annotation is a management process that is difficult to put in place in academia. Biology analysis at the level discussed here has outgrown the classic academic lab and requires technical staff and a tight management structure. Lo said the same applies to data reading in his system.

Coleman brought the discussion back from information handling to measurement, asking "what should we measure?" The basic cellular pathology of AD is unknown. There is heterogeneity of cell types and of individual cell reactions in the brain. To understand this heterogeneity, Coleman urged that two existing technologies be brought together to look at brain in AD and in normal aging. One is imaging that gives information at the level of individual cells or even within a cell. This must be developed for live humans with AD. The other is laser capture microdissection that lets one pick single cells to determine the molecular fingerprint of single cells. Coleman said that he wants this kind of information to be part of the data going into the informatics system. Others agreed.

Lansbury suggested an experiment involving cluster analysis of serum samples to identify gene clusters that can later be correlated with postmortem diagnosis of AD versus Lewy body disease versus other dementias or even Parkinson's, in which clinical diagnosis is only 75 percent correct. What annotation would be required? The expected diagnosis? Blake answered that no expectations should be included, just other factual information available on study participants:, such as symptoms, etc.

Wang suggested a systematic study of AD versus normal aging, planning what data to collect and what information infrastructure to put it into now, and starting to collect this data even though it is not fully annotated in the beginning. Lansbury asked how best to organize such a study. Should one start small with a group of well-known data that one can heavily annotate and add to, or should one start big because of the huge timesaving from putting in all that annotation?

Basilion cautioned that transcriptional profiling misses aspects of AD and PD, where patients accumulate certain proteins while the message is disappearing. True mechanistic understanding is impossible without proteins. Coleman concurred with the example of a class-1 assembly protein, whose message stays stable in AD but protein levels drop dramatically. His group found that normally, this protein is protected from degradation by glycosylation but in AD, glycosylation fails and the protein is degraded.

Balaban agreed, saying a proteomics evaluation of AD samples will lead to therapies faster than mere clustering of transcriptional profiles. Proteomics incorporates gene-gene interactions and environmental influences but requires more starting material than transcriptional profiling, making analysis of individual cells difficult. Proteomics only takes the problem one step further because activity and phosphorylation status is not captured. One needs a whole range of databases that can talk to each other.

Wang reminded the audience that mass spectrometry can help here. Examples are PS1 function, where mass spectrometry can identify all components of the complex and their interacting partners. On the question of ApoE4 function, mass spectrometry can identify its interacting partners faster and with fewer false-positives than yeast-two-hybrid screens. Then one pulls up annotation, literature, and compares with other databases.

These Technological Opportunities Drew Some Consensus

Tissue Blocks
To make better use of human samples, Balaban suggested establishing tissue blocks of clinical trial material and of blood/serum collected in epidemiological studies. Blood or brain samples can be embedded in tissue blocks, frozen, and then sliced when a new hypothesis or biomarker has come up that needs testing. This can be done in an array-based setup to support high-throughput tests or quantitative proteomics (measuring concentrations of many proteins simultaneously). Blocks also can support laser capture microdissection.

Imaging
Balaban suggested looking into imaging early physiological consequences of the pathologic mechanism in AD with high-field strength PET and fMRI methods currently in use for heart disease in some ERs. These methods have sufficient sensitivity and resolution to perhaps detect an altered metabolic response to memory tasks, or even reduced background activity, as an indicator of early neuronal dysfunction and possible early diagnostic. Mayeux objected that everyone in the field agrees neurons change prior to clinical symptoms, but how early is unclear. Epidemiology from Framingham cohort suggests 20 years, a Scottish study suggests 40. One needs to scan people longitudinally. Balaban agreed, saying one would first try to find a neuronal change that imaging can pick up in people with diagnosed early AD (this is doable now in 8-tesla magnets), then image that in presymptomatic people at high risk, then do prospective longitudinal scans of groups.

Balaban said inflammation is an early physiological consequence of the pathologic mechanism that could be useful for imaging, because in inflammation one cell affects a large area. Macrophage recruitment with concomitant change in epithelial cell fenestration (easily imaged with contrast agents in heart disease) does not occur in AD, but other inflammatory-type characteristics in AD must be amenable to imaging.

Furthermore, Balaban suggested diffusion imaging as a promising technology, for example to image how tau inclusions disturb normal diffusion patterns of water inside neurons. This has been demonstrated in stroke and offers resolution of about 4 microns. All agreed that imaging methods are available but early markers are needed.