Comments 0

Document transcript

The University of Texas MD Anderson Cancer CenterFrom the SelectedWorks of Jeffrey S. MorrisMarch 2006An Introduction to High-ThroughputBioinformatics DataContactAuthorStart Your OwnSelectedWorksNotify Meof New WorkAvailable at:http://works.bepress.com/jeffrey_s_morris/151An Introduction to High-ThroughputBioinformatics DataKeith A.Baggerly,Kevin R.Coombes,Jeﬀrey S.Morris,Department of Biostatistics and Applied MathematicsUniversity of Texas M.D.Anderson Cancer Center1515 Holcombe Blvd,Unit 447,Houston,TX 77030Email:{kabagg,kcoombes,jefmorris}@mdanderson.orgAbstractHigh throughput biological assays supply thousands of measurementsper sample,and the sheer amount of related data increases the need forbetter models to enhance inference.Such models,however,are moreeﬀective if they take into account the idiosyncracies associated with thespeciﬁc methods of measurement:where the numbers come from.Weillustrate this point by describing three diﬀerent measurement platforms:microarrays,serial analysis of gene expression (SAGE),and proteomicmass spectrometry.1.1 IntroductionIn our view,high-throughput biological experiments involve three phases:experimental design,measurement and preprocessing,and postprocess-ing.These phases are otherwise known as:deciding what you want tomeasure,getting the right numbers and assembling them in a matrix,and mining the matrix for information.Of these,it is primarily themiddle step that is unique to the particular measurement technologyemployed,and it is there that we shall focus our attention.This is notmeant to imply that the other steps are less important!It is still a truismthat the best analysis may not be able to save you if your experimentaldesign is poor.We simply wish to emphasize that each type of data has its own quirks12 Baggerly,Coombes,and Morrisassociated with the methods of measurement,and understanding thesequirks allows us to craft ever more sophisticated probability models toimprove our analyses.These probability models should ideally also letus exploit information across measurements made in parallel,and acrosssamples.Crafting these models leads to the development of brand-newstatistical methods,many of which are discussed in this volume.In this chapter,we address the importance of measurement-speciﬁcmethodology by discussing several approaches in detail.We cannot beall-inclusive,so we shall focus on three.First,we discuss microarrays,which are perhaps the most common high throughput assay in use today.The common variants of Aﬀymetrix gene chips and spotted cDNA ar-rays are discussed separately.Second,we discuss serial analysis of geneexpression (SAGE).As with microarrays,SAGE makes measurementsat the mRNA level,and thus provides a picture of the expression proﬁleof a set of cells,but the mechanics are diﬀerent and the data may giveus a diﬀerent way of looking at the biology.Third,we discuss the useof mass spectrometry for proﬁling the proteomic complement of a set ofcells.Our goal in this chapter is not to provide detailed analysis methods,but rather to place the numbers we work with in context.1.2 MicroarraysMicroarrays let us measure expression levels for thousands of genes in asingle sample all at once.Such high throughput assays allow us to asknovel biological questions,and require new methods for data analysis.In thinking about the biological context of a microarray,we startwith our underlying genomic structure [4].Your genome consists ofpairs of DNA molecules (chromosomes) held together by complementarynucleotide base pairs (in total,about 3×109base pairs).The structure ofDNA provides an explanation for heredity,by copying individual strandsand maintaining complementarity.All of your cells contain the same genetic information,but your skincells are diﬀerent from liver cells or kidney cells or brain cells.Thesediﬀerences come about because diﬀerent genes are expressed at highlevels in diﬀerent tissues.So,how are genes “expressed”?The “central dogma” of molecular biology asserts that “DNA makesRNA makes protein”.In order to direct actions within the cell,parts ofthe DNAwill uncoil and partially decouple to expose the piece of the sin-gle strand of DNA on which a given gene resides.Within the nucleus,aIntroduction 3complementary copy of the gene sequence (not the entire chromosome) isassembled out of RNA.This process of RNAsynthesis is called transcrip-tion:copying the message.The initial DNA sequence containing a genemay also contain bits of sequence that will not be used – one feature ofgene structure is that genes can have both “coding” regions (exons) and“noncoding” regions (introns).After the initial RNA copy of the gene ismade,processing within the nucleus removes the introns and “splices”the remaining pieces together into the ﬁnal messenger RNA (mRNA)that will be sent out to the rest of the cell.Once the mRNA leavesthe nucleus,the external machinery (ribosomes) will read the code andassemble proteins out of corresponding sequences of amino acids.Thisprocess of assembling proteins from mRNA is called translation:map-ping from one type of sequence (nucleotides) to another (amino acids).The proteins then fold into 3d-conﬁgurations that in large part drivetheir ﬁnal function.If diﬀerent genes are copied into RNA (expressed)in diﬀerent cells,diﬀerent proteins will be produced and diﬀerent typesof cells will emerge.Microarrays measure mRNA expression.In thinking about the informational content of these various stages forunderstanding cellular function,we need to know diﬀerent things.ForDNA,we need to knowsequence.For mRNA,we need both sequence andabundance;many copies can be made of a single gene.Gene expressiontypically refers to the number of mRNA copies of that gene.For protein,we need sequence,abundance,and shape (the 3d-conﬁguration).If we could count the number of mRNA molecules from each gene in asingle cell at a particular time,we could assemble a barchart linking eachgene with its expression level.But how do we make these measurements?As suggested,we exploit complementarity:sequences of DNA or RNAcontaining complementary base pairs have a natural tendency to bindtogether:...AAAAAGCTAGTCGATGCTAG......TTTTTCGATCAGCTACGATC...If we know the mRNA sequence (which we typically do these days,sincewe can look it up in a database),we can build a probe for it using thecomplementary sequence.By printing the probe at a speciﬁc spot on thearray,the probe location tells us the identity of the gene being measured.There are two common variants of microarrays:• Oligonucleotide (oligo) arrays,where short subsequences of the gene4 Baggerly,Coombes,and Morrisare deposited on a silicon wafer using photolithography (primarilyAﬀymetrix).• Full length (entire gene) arrays,where probes are spotted onto a glassslide using a robotic arrayer.These generally involve two samples runat the same time with diﬀerent labels.1.2.1 Aﬀymetrix GeneChipsIn looking at the structure of Aﬀymetrix data,there are several in depthresources [2,3,39] which serve as major sources for what follows,includ-ing the company’s web site,www.affymetrix.com.In general,genes will be hundreds or thousands of bases in length,and the probes are shorter by an order of magnitude.This is drivenin part by the manufacturing process,as the cost of synthesis increaseswith the number of bases deposited.Thus,choosing probes to printrequires ﬁnding sequences that will be unique to the gene of interest(for speciﬁc binding) while still being short enough to be aﬀordable.The ﬁnal length decided on was 25 bases,and all Aﬀymetrix probes arethis length.It is important to note that diﬀerent probes for the samegene have diﬀerent binding aﬃnities,and these aﬃnities are unknowna priori.Thus,it’s diﬃcult to tell whether “gene A beats gene B”,asopposed to “there’s more gene A here than there”.Microarrays onlyproduce relative measurements of gene expression.Given that the aﬃnities are unknown,we can guard against problemswith any speciﬁc probe by using several diﬀerent probes for each gene.The optimal number of probes is not clear.Subsequent generations ofAﬀymetrix chips have used 20 (eg HuGeneFL,aka Hu6800),16 (U95series),and 11 (U133 series) probes.There are some further diﬃcultieswith choosing probes:• some genes are short,so multiple subsequences will overlap.• genes have an orientation,and RNA degradation begins preferentiallyat one end (3’ bias).• the gene may not be what we think it is,as our databases are stillevolving.• probes can “cross-hybridize”,binding the wrong targets.Overlapping,we can live with.Orientation can be addressed by choosingthe probes to be more tightly concentrated at one end.Database evolu-tion we simply can’t do anything about.Cross-hybridization,however,we may be able to address more explicitly.Introduction 5Aﬀymetrix tries to control for cross-hybridization by pairing probesthat should work with probes that shouldn’t.These are known as thePerfect Match (PM) and Mismatch (MM) probes,and constitute “probepairs”.The PM probe is perfectly complementary to the sequence ofinterest.The MM probe is the same as the PM probe for all basesexcept the middle one (position 13),where the PM base is replaced byits Watson-Crick complement.PM:GCTAGTCGATGCTAGCTTACTAGTCMM:GCTAGTCGATGCAAGCTTACTAGTCIdeally,the MM value can be used as a rough assessment of the amountof cross-hybridization associated with a given PM probe.Aﬀymetrix groups probe pairs associated with a given gene into “probe-sets”;a given gene would be represented on a U133A chip by a probesetcontaining 11 probe pairs,or 22 probes with distinct sequences.Theprobes within a probeset are ordered according to the position of thespeciﬁc PMsequence within the gene itself.We have described the idealcase above,but in practice the correspondence between genes and probe-sets is not 1-to-1,so some genes are represented by several probesets.Having printed the probes,we now need to attach the target mRNAinsuch a way that we can measure the amounts bound.When we extractmRNA from a sample of cells,we do not measure this mRNA directly.Rather,we make copies.Copies are produced of the complementarysequence out of RNA (cRNA).Some of the nucleotides used to assem-ble these copies have been modiﬁed to incorporate a small moleculecalled biotin.Biotin has a strong aﬃnity for another molecule calledstreptavadin;their binding aﬃnity is the strongest known noncovalentbiological interaction.After the biotin-labeled cRNA molecules are hy-bridized to the array,they are stained with a conjugate of streptavadinand phycoerythrin;phycoerythrin is one of the brightest available ﬂu-orescent dyes.The ﬁnal complex of printed probe,biotinylated target,and streptavadin-phycoerythrin indirect label is then scanned,produc-ing an image ﬁle.For our purposes,this image constitutes bedrock:Theimage is the data.All Aﬀymetrix GeneChips are scanned in an Aﬀymetrix scanner,andthe initial quantiﬁcation of features is performed using Aﬀymetrix soft-ware.The software involves numerous ﬁles.The ﬁle types are:EXP Contains basic information about the experiment.DAT Contains the raw image.6 Baggerly,Coombes,and MorrisFig.1.1.An Aﬀymetrix image (.DAT) ﬁle.(A) The entire image,4733 pixelson a side,containing 409,600 features.(B) A zoom on the upper left corner ofthe image.Controls are used in a checkerboard pattern to indicate the printregion border,and to designate the chip type.This is a U95Av2 chip;on v2chips the “A” is ﬁlled in.CEL Contains feature quantiﬁcations.CDF Maps between features,probes,probesets,and genes.CHP Contains gene expression levels,as assessed by the Aﬀy software.Most frequently,we start with a DAT ﬁle,derive a CEL ﬁle,and thenmake extensive use of the CEL and CDF ﬁles.We make no further useof the EXP and CHP ﬁles here.To illustrate the procedure,we begin by looking at the contents of aDAT ﬁle from a U95Av2 chip (the raw image),shown in Figure 1.1 A.The array has 409,600 probes (features) arranged in a 640 × 640 grid.There is actually some structure that can be seen by eye,as we cansee if we zoom in on the upper left corner:Figure 1.1 B.The pixelatedfeatures have been combined with positive controls to spell out the chiptype – this helps ensure that the image is correctly oriented.We notethe border lattice of alternating dark and bright QC probes,makingimage alignment and feature detection easier.If we zoom in further on a single PM/MM pair or feature,shown inFigure 1.2 A and B,we can see that features are square.The horizontaland vertical alignment with the edges of the image is pretty good,butfeature boundaries can be rather blurry.Each feature on this chip is approximately 20 microns on a side.Thescanner used for this scan had a resolution of 3 microns/pixel,so thefeature is about 7 pixels on a side (more recent scanners have higherresolution).In general,Aﬀymetrix features are far smaller than theround spots in the images of other types of microarrays.Introduction 7Fig.1.2.Sets of Aﬀymetrix image features.(A) A PM/MM pair.Note thatthe PM pixel readings are higher than the MM readings.(B) A zoom on thePM feature.(C) The PM feature after trimming the outer boundary.Onlythe remaining pixels are used in deriving a summary quantiﬁcation (the 75thpercentile).The DAT ﬁle structure consists of a 512 byte header followed by theraw image data.The image shown above involved a 4733 by 4733 grid ofpixels,so the total ﬁle size is 2∗47332+512 = 44803090 bytes (45M).Thisis big.File size is a nontrivial issue with Aﬀy data;earlier versions of thesoftware could only work with a limited number of chips (say 30).Giventhis size,our ﬁrst processing step is to produce a single quantiﬁcationfor each feature,keeping in mind that the edges are blurry and that thefeatures may not be perfectly uniform in intensity.The CEL ﬁle contains the feature quantiﬁcations,achieved as follows.First,the four corners of the entire feature grid (here 640 × 640) arelocated within the DAT ﬁle,and a bilinear mapping is used to determinethe pixel boundaries for individual features.Given the pixels for a singlefeature,the outermost boundary pixels are trimmed oﬀ,as shown inFigure 1.2 C.Finally,the 75th percentile of the remaining pixel valuesis stored as the feature summary.Trimming is understandable,as thisaccounts for blurred edges in a moderately robust way.Similarly,usinga quantile makes sense,but the choice of the 75th percentile as opposedto the median is arbitrary.When Aﬀymetrix data is posted to the web,CEL ﬁles are far moreoften supplied than DAT ﬁles.Over time,there have been various ver-sions of the CEL format.Through version 3 of the CEL ﬁle format,this was a plain text ﬁle.In version 4,the format changed to binaryto permit more compact storage of the data.Aﬀymetrix provides a freetool to convert between the ﬁle formats.In the plain text version,sections are demarcated by headers in brack-8 Baggerly,Coombes,and Morrisets,as in the example below.The header tells us which DAT ﬁle it camefrom,the feature geometry (e.g.,640 ×640),the pixel locations of thegrid corners in the DAT ﬁle,and the quantiﬁcation algorithmused.Thisis followed by the actual measurements,consisting of the X and Y fea-ture locations (integers from 0 to 639 here),the mean (actually the 75thpercentile) and standard deviation (this,conversely,is the standard de-viation),and the number of pixels in the feature used for quantiﬁcationafter trimming the border.An example CEL ﬁle header is given below.[CEL]Version=3[HEADER]Cols=640Rows=640TotalX=640TotalY=640OffsetX=0OffsetY=0GridCornerUL=219 235GridCornerUR=4484 253GridCornerLR=4469 4518GridCornerLL=205 4501Axis-invertX=0AxisInvertY=0swapXY=0DatHeader=[0..19412] U95Av2_CDDO_12_14_01:CLS=4733RWS=4733 XIN=3 YIN=3 VE=17 2.0 12/14/01 12:23:30HG_U95Av2.1sq 6 Algorithm=PercentileAlgorithmParameters=Percentile:75;CellMargin:2;OutlierHigh:1.500;OutlierLow:1.004[INTENSITY]NumberCells=409600CellHeader=X Y MEAN STDV NPIXELS0 0 133.0 16.6 251 0 8150.0 1301.3 20A version 3 CEL ﬁle reduces the space required to about 12M from45M for a DAT ﬁle,but we could do better.The X and Y ﬁelds are notnecessary,as these can be inferred from position within the CEL ﬁle.Keeping 1 decimal place of accuracy for the “mean” and standard devi-ation doubles the storage space required (moving from a 16-bit integerto a ﬂoat in each case) and supplies only marginally more information.Finally,most people do not use the STDV and NPIXELS ﬁelds.Keep-ing only the mean values and storing them as 16-bit integers,storagecan be reduced to 2 ∗ 6402= 819200 bytes.This type of compression isbecoming more important as the image ﬁles get even bigger.The above description covered Aﬀymetrix version 3.0 ﬁles.In versionIntroduction 94.0,in binary format,each row is stored as a MEAN-STDV-NPIXEL orﬂoat-ﬂoat-short triplet,which cuts space,but not enough.Most recently,Aﬀymetrix has introduced a CCEL (compact CEL) format,which juststores the integer mean values as discussed above.The above problem,going fromthe image to the feature quantiﬁcation,is a major part of the discussion for quantiﬁcation of other types of arraysbecause there,we get only one spot per gene.For Aﬀymetrix data,thecompany’s quantiﬁcation has become the de facto standard.It may notbe perfect,but it is reasonable.The real challenge with Aﬀymetrix datalies in reducing the many measurements of a probeset to a single number.In summarizing a probeset,we ﬁrst need to know where its componentprobes are physically located on the chip.With any set of microarrayexperiments,one of the major challenges is keeping track of how the fea-ture quantiﬁcations map back to information about genes,probes,andprobe sets.The CDF ﬁle speciﬁes what probes are in each probeset,andwhere the probes are.There is one CDF ﬁle for each type of GeneChip.The header is partially informative,as shown in the example below.[CDF]Version=GC3.0[Chip]Name=HG_U95Av2Rows=640Cols=640NumberOfUnits=12625MaxUnit=102119NumQCUnits=13ChipReference=[Unit250_Block1]Name=31457_atBlockNumber=1NumAtoms=16NumCells=32StartPosition=0StopPosition=15CellHeader=X Y PROBE FEAT QUAL EXPOSPOS CBASE PBASE TBASE ATOM INDEXCODONIND CODON REGIONTYPE REGIONCell1=517 568 N control 31457_at 013 A A A 0 364037-1 -1 99Cell2=517 567 N control 31457_at 013 A T A 0 363397-1 -1 99Cell3=78 343 N control 31457_at 113 T A T 1 21959810 Baggerly,Coombes,and MorrisFor this probeset,31457at,there are 16 “atoms” corresponding toprobe pairs (this is the standard number for this vintage chip),and 32“cells” corresponding to individual probes or features.The ﬁrst probepair (index 0),with the PM sequence closest to one end of the gene,islocated on the chip in the 518th column (the X oﬀest is 517) and in the568th and 569th rows.The index values for these probes are (567 * 640)+ 517 = 363397 and 364037.The feature in Cell 2 is the PM probe,as(a) it has a smaller Y index value,and (b) the probe base (PBASE) inthe central base position (POS) 13 is a T,which is complementary tothe corresponding target base (TBASE).The remaining values in a givenrow are less important.The CDF ﬁles do not contain the actual probesequences,but all CDF ﬁles and probe sequences are now downloadablefrom www.affymetrix.com.On early Aﬀymetrix chips,all probes in a probeset were plotted nextto each other.This was soon realized to be imperfect,as any artifacton a chip could corrupt the measurements for an entire gene.On morerecent chips,probes within a probeset are spatially scattered,thoughPM/MM pairs are always together (the PM probe is always closer tothe edge on which the chip id is spelled out).Given quantiﬁcations for individual chips,we turn next to quantifyinga dataset,relating probeset values across chips.Before we quantify individual probesets,however,we need to addressthe problem of normalization:is the image data roughly comparablein intensity across chips?Adding twice as much sample may make theresultant image brighter,but it doesn’t tell us anything new about theunderlying biology.In most microarray experiments,we are comparingsamples of a single tissue type (eg diseased brain to normal brain),andin such cases we assume that “most genes don’t change”.Typically,weenforce this by matching quantiles of the feature intensity distributions.Given that the chips have been normalized,we still need to ﬁnd a wayof summarizing the intensities in a probeset.The PM and MM featuresfor an example probeset are shown in Figure 1.3 A and B.The earliest widely applied method was supplied by Aﬀymetrix inversion 4 of their Microarray Analysis Suite package,and is commonlyreferred to as MAS 4.0 (“Mass 4”) or AvDiﬀ [2].AvDiﬀ works with theset of PM−MM diﬀerences in a probeset one array at a time.Thesediﬀerences are sorted in magnitude,the minimum and maximum valuesare excluded,and the mean and standard deviation of the remainingdiﬀerences are computed.Using this mean and standard deviation,an“acceptance band” for the diﬀerences is deﬁned as ±3 s.d.about theIntroduction 11Fig.1.3.A single probeset from a Hu6800 chip,containing 20 PM/MMpairs.(A) A heatmap of the feature intensities extracted from the CEL ﬁle.(B)Plots of the PM (solid) and MM (dashed) values shown in A.Feature valuesare not uniform across the probeset,and MM values occasionally exceed PM.(C) A plot of the PM−MM diﬀerences,showing the computation of AvDiﬀ.The extreme values (circled) are initially excluded,and the mean and ±3 s.d.bounds (dotted) are imposed.All points within this band are then averagedto produce AvDiﬀ (dashed).mean.All of the diﬀerences falling within this band are then averagedto produce the ﬁnal AvDiﬀ value.This is illustrated in Figure 1.3 C.Inthe case illustrated here,the minimum value was excluded at the ﬁrststep,but fell into the acceptance band and was thus included in the ﬁnalaverage,moving the value down slightly.AvDiﬀ does have some nice features.It combines measurements acrossprobes,trying to exploit redundancy,and it attempts to insert some ro-bustness.However,there are some questionable aspects.AvDiﬀ weightsthe contributions from all probes equally,even though some may notbind well.It works on the PM−MM diﬀerences in an additive fashion,but some of the eﬀects may be multiplicative in nature.It can give neg-ative values,which are hard to interpret.In some cases,where all of thesignal for a probeset is concentrated in a very small number of probes,these may be omitted altogether if they fall outside the band.All ofthese drawbacks,in our view,can be tied to the fact that AvDiﬀ worksone chip at a time,and does not “learn” with the addition of more chips.12 Baggerly,Coombes,and MorrisFig.1.4.Plots of PMand MMintensities for the same probeset on 10 diﬀerentchips.The overall proﬁle shapes are fairly consistent across chips,with changesin gene expression linked to amplitude.Modelling the shapes can improveinferences about expression levels.Learning from multiple chips requires an underlying model with pa-rameters that can be estimated.In 2001,Cheng Li and Wing Wong in-troduced a new method of summarizing probeset intensities as “model-based expression indices”,or MBEI [35,36,59].At the crux of theirargument was a very simple observation – the relative expression valuesof probes within a probeset were very stable across multiple arrays.Looking at the PM and MM proﬁles for the same probeset in 10chips from a single experiment,as shown in Figure 1.4,we can see thatthe overall shape of the proﬁle is fairly consistent.It is the amplitude ofthis proﬁle which changes,and which contains the summary informationabout the level of gene expression.In order to exploit this stability,Li and Wong ﬁt a model for eachprobeset:for sample i,and probe pair j,they posit thatPMij= νj+θiαj+θiφj+,MMij= νj+θiαj+,where νiand θiαjare intended to capture nonspeciﬁc binding,and is Gaussian noise.Focusing on the PM−MM diﬀerences,this modelcondenses to one with two sets of unknowns:θiand φj.The φjtermscorrespond to the individual probe aﬃnities,and give the shape of theproﬁle.The θivalues give the amplitudes.The MBEI approach caught on fairly quickly,in part because the nu-merical approach made sense,but also due to the fact that it was imbed-Introduction 13ded in the freeware “DNA Chip Analyzer” (dChip) package,availableat www.dchip.org.This package has a very friendly user interface,andaddressed many of the most common questions (which genes are diﬀer-ent?how should I cluster them?) in a straightforward fashion.Further,by encoding the contents of CEL and CDF ﬁles in a binary format usinganalogs of the data structures outlined above,the program could handlelots of chips at once,and it could handle them quickly.Using a model has several beneﬁts.By using multiple chips,it cankeep all of the probes;there is no tossing of the most informative ones.By checking the residuals from the model,it is possible to identify out-liers due to artifacts.Using the hypothesized error model,conﬁdencebands for the fold change can be computed.Probe proﬁles can be com-puted in one experiment and used in another.The downside of most models is that they require several chips inorder to estimate the underlying model parameters.It is not a goodidea to trust the ﬁts too much if they are based on just one or two chips;10 or more is better.However,we’re not convinced that it is a bad thingto require a larger minimum number of chips for drawing inferences.The dChip model captures eﬀects that are multiplicative,and inheritsthe other good features of a model.However,the probability model is toosimplistic,as larger intensity probes typically also have larger variances.In the wake of dChip,several other quantiﬁcation methods have beensuggested,with many (but not all) using model-based approaches.Apartial list includes MAS 5.0,RMA,and PDNN.The next algorithmfromAﬀymetrix,MAS 5.0 [3],still produces quan-tiﬁcations one chip at a time,but replaces the MM values with a ratherintricate change threshold (CT) to avoid negative values.The diﬀerencesare then combined using a robust measure:Tukey Biweight(log(PMj−CTj)).The robust multichip analysis (RMA) method of Irizarry et al.[27,28]also uses a model for ﬁtting the data,but the model diﬀers fromdChip’sin some key ways.First,the authors elected to ignore the MM values,contending that any gains in accuracy were more than oﬀset by lossesin precision,in a classic bias-variance tradeoﬀ.Second,since the MMvalues were not on hand,“background” levels were estimated from thedistribution of PM probe intensities and subtracted oﬀ in such a wayas to avoid negative values [13].Third,the model introduced stochasticerrors on the log scale as opposed to the raw intensity scale.The ﬁnal14 Baggerly,Coombes,and Morrismodel is of the formlog(PMij−BG) = µi+αj+ij.The above approaches use the probe intensities,but there is addtionalbiological structure that can be exploited.In particular,Aﬀymetrix nowmakes the actual probe sequences available,though it did not when itﬁrst started selling chips.Using the sequences,it is possible to buildmodels describing the default binding eﬃciencies for individual probes,and to decouple this from binding due to gene abundance.This ap-proach was ﬁrst exploited in the Position-Dependent Nearest Neighbor(PDNN) approach introduced by Zhang et al.[76].The RMA methodhas since been extended to incorporate sequence information in its mod-eling,giving GCRMA [73].Given the proliferation of models,we need some means of decidingwhich ones are “better”.In order to make such assessments,we needto have some data sets for which “truth” can be known a priori,andsome set of deﬁned metrics that measure proximity to truth.The mostwidely used truth-known data set is a Latin Square experiment suppliedby Aﬀymetrix,in which 14 genes were spiked into a common mixtureaccording to a twofold dilution series,which was then cyclicly permutedso that each gene was assessed at each dilution level.In this case,onlythe spiked in genes should be changing in expression,and the amount ofchange is potentially known.In order to quantify truth,Cope et al.[20]introduced a suite of metrics for putting each method through its paceson the canonical datasets.The results for many diﬀerent methods havebeen assembled and posted athttp://affycomp.biostat.jhsph.edu/,and new submissions are welcome.In addition to dChip,there are now several software packages availablefor analyzing Aﬀymetrix data,but the most widely used in the statis-tical community are probably those implemented in R and freely avail-able from Bioconductor.R packages exist for implementing all of theapproaches discussed here,and most methods are suﬃciently modularthat diﬀerent background correction,normalization,and quantiﬁcationmethods can be juggled to suit.The book by Gentleman et al.[25] pro-vides an excellent introduction to this resource.Not all of the methodsavailable are equally fast,however,so for the analysis of large datasetsdChip and “justRMA” or “justGCRMA” in R are the ones that wewould suggest.The models for Aﬀymetrix data are now reasonably good,but dozensIntroduction 15of questions remain.Combining results of Aﬀymetrix experiments acrossdiﬀerent labs and diﬀerent chip types is still diﬃcult,and integratingthese results with those from glass arrays is still harder.Eventual com-bination of results at the RNA level with those from the DNA andprotein levels is tantalizing.1.2.2 Spotted cDNA ArraysWe now shift from Aﬀymetrix oligonucleotide arrays to spotted cDNAarrays.Here,a good set of overview ariticles (from 1999) is available asa special supplement to Nature Genetics,“The Chipping Forecast” [47];see also [61,60].While the biological questions of interest are similar,theprobes used are quite diﬀerent.On most cDNA arrays,the probes usedcorrespond to full-length copies of the gene of interest (sans introns),though there has been recent interest in long-oligo arrays that use probesthat are 60 or 70 bases in length (60-mers or 70-mers).Typically,eachgene will be represented by one probe,not a set.The other majordistinction is that two samples,not one,are typically hybridized toeach array.The samples are prepared using diﬀerent incorporated dyes,mixed,and the mixture is then hybridized to the array.The method of dye incorporation is diﬀerent for spotted arrays thanfor Aﬀymetrix gene chips.On a gene chip (as noted),the ﬂuorescentdye is applied after hybridization has taken place (indirect labeling),but this strategy does not work if multiple samples need to be labeledwith diﬀerent dyes.Rather,when copies of mRNA are made for spottedarrays,they are made of cDNA,and some of the bases used in theassembly of these copies have had molecules of ﬂuorescent dye attached.Thus,the dye is incorporated into the copies before hybridization (directlabeling).These labeled copies are then hybridized to the array,bindingmolecules of dye in speciﬁc positions.The most commonly reported gene summary is the log ratio of twointensity measurements,corresponding to the two dyes with which thetwo types of cells being compared have been respectively tagged.Themost commonly used dyes are Cy5,red,and Cy3,green.Thus,the singlenumber quoted is derived from the two intensity values.The intensityvalues are also derived quantities;they are derived from images.Again,for our purposes these images represent bedrock.Images are our rawdata.These images are scans of slides with lots of dots on them,each dotcorresponding to the location of a DNA probe to which labeled cDNA16 Baggerly,Coombes,and MorrisFig.1.5.Cy3 and Cy5 image scans froma spotted cDNA microarray.(A) Thefull Cy3 image.(B) The full Cy5 image.In both A and B,the patch structure(one per print-tip on the arrayer) is apparent.(C) A zoom on a Cy3 patch.(D) A zoom on the corresponding Cy5 patch.The top half of each patch isreplicated in the bottom half,and this structure is visible.Imperfections inboth the spotting and the image can also be seen,most clearly in the zooms.derived from the cells of interest has been bound.In some early exper-iments from MD Anderson,there were approximately 4800 dots on aslide,arranged in a 4 by 12 grid of patches,with each patch containinga 10 by 10 grid of dots.When the images of the slide were produced,we got 3248 by 1248 arrays of grayscale pixel values.The scans fromone such slide are shown in Figure 1.5 A and B.The patch structureis quite apparent.This structure is linked to the method of depositingthe probes.In printing,a robotic arrayer takes an array of print tips(similar to needles),dips them in wells of the DNA to be printed,movesthe coated print tips over to the slide,taps lightly to transfer probes,and takes the print tips over to a wash solution before repeating theprocess.The arrayer we used had a 4 by 12 array of print tips;eachvisible patch has been applied by a single print tip.Returning to consideration of the images,each pixel is a 16-bit inten-sity measurement,so values range from 0 to 65535.There is no colorinherently associated with these images,which is why we have presentedthem in grayscale;other colormaps are externally applied to enhancecontrast.Each image is about 8M in size,which is large enough tomake manipulation and transmission somewhat unwieldy at times.Asmore genes are spotted on the arrays,and the scanner resolutions areimproved so that smaller objects can be seen,these images will increaseIntroduction 17in size.It should be noted that the 16-bit nature of the images can makethings diﬃcult to work with in ways not having to do with ﬁle size.Someimage viewing software assumes that the values are 8-bit,ranging from0 to 255,and consequently either fails to show the large image or showsit as full white (all values set to 255).The values can be converted to8-bit fairly simply,as 8-bit = ﬂoor(16-bit/256),but we lose gradationinformation.As the dynamic range of these images is quite large,thisloss can be be damaging for the purposes of analysis.To make things more concrete in getting down to the actual spot level,we focus on a single 10 by 10 patch,marked in the bottom left of thelarge images.The corresponding regions from the two image ﬁles areshown in Figure 1.5 C and D.These arrays were printed with replicatespottings of the same genes:within each patch,the top half of the patchis replicated in the bottom half.This replicate structure is visible —the brightest Cy3 spots are in rows 4 and 9 of column 7 of the patch,areplicate pair — giving us some conﬁdence in the assay.Afewother things are immediately apparent.First,the “dots” are notreally “dot-like” in most cases.Rather,there are rings of high intensityabout lower-level centers.This is true across both channels,indicatingthat the ring pattern matches the amount of cDNA on the slide.Themost likely explanation is that surface tension on the drop as it driesmay cause clumping at the edges.In any event,how does morphologyaﬀect our measurements?Second,the dots are not of equal size.Thismay make it diﬃcult for an automatic procedure to ﬁnd the appropriateplacement of a dot-shaped target ring.Third,there is some mottling inthe lower left corner (most visible in the Green channel).How does thisaﬀect our assessment of how intense the dots in that region are?Before considering these questions further,let’s take a closer look ata single spot,highlighted in Figure 1.5 C.An expanded view of thisspot is shown in Figure 1.6.The ring shape is visible,indicating un-even hybridization.Further,the side view shows that readings outsidethe spot are not at zero intensity,indicating the need for some type ofbackground subtraction so that we have moderately good estimates ofwhere zero should be.All of these issues point out the need for good image quantiﬁcationalgorithms for summarizing the spots.Some more detailed descriptionsof algorithms for image segmentation,background estimation,and spotsummaries are given in Yang et al.[74].There are several software pack-ages (mostly commercial) now available for quantifying array images.Given the metrics,however,a more basic question is why two sam-18 Baggerly,Coombes,and MorrisFig.1.6.Zoom on a single Cy3 spot.The ring shape is visible,indicatinguneven hybridization.Further,the side view shows that readings outside thespot are not at zero intensity.ples are used per array as opposed to one.The main reason is to guardagainst artifacts.Some spots are bigger than others,and thus bind morematerial.The slide can be tilted while hybridization is proceeding,re-sulting in more binding at one edge than another.Ideally,such artifactswill aﬀect both channels similarly,and taking ratios will cancel themout.If there are replicate spots printed on the arrays,the importanceof ratios can be checked by plotting the variance of the replicate log in-tensities as a function of the mean,ﬁrst for each individual channel andthen for the ratios.The variability of the ratios is typically less (oftenmuch so).While the use of two samples does protect against some large-scalebiases,it can also introduce new ones.The dyes used have diﬀerentphysical shapes,and thus can have diﬀerent binding eﬃciencies for givengenes.In recognition of this fact,many studies use one of two approachesfor comparing two groups of samples.The ﬁrst approach involves directcomparison of a sample of type A with a sample of type B on the samearray.In this case,“dye swaps” are used so that the A samples arelabeled with Cy3 on some arrays and with Cy5 on others,so that dyebiases can be factored out.The second approach is to use the same dyeto label the samples from both groups of interest,and to contrast thesewith some common reference material labeled with the other dye.Someof the design issues raised by this natural paired blocking strucure arediscussed in [33,63].Even with these balancing features,normalization remains an issue,both within and across arrays.Again,most methods make the simpli-fying assumption that most genes don’t change.Given this assumption,Introduction 19a common means of correction is to plot the diﬀerence in channel logintensities as a function of the average log intensity,and to ﬁt a loesscurve to the dot cloud.These plots were introduced by Bland and Alt-man [12],but are more commonly referred to as “MA” plots in themicroarray context [22].Subtracting the loess curve ideally normalizesexpression values within the array.A further extension of this approachis to apply a separate loess ﬁt for the spots associated with each printtip.This makes stronger assumptions about which groups of genes arenot expected to change,but smooths things more evenly.While we haveseen cases where print-tip loess has produced more stable values (andbetter agreement between replicate spots),in many of these cases weare correcting for spatial trends that are visible on the array images,asopposed to discrepancies that are ascribable to the pins.Print-tip loessworks in part because it is a surrogate for spatial position.Once theindividual arrays have been normalized,quantile normalization can beused to match log ratio values across arrays [75].Given the spot quantiﬁcations,and knowledge of what samples arebound on which arrays,there are freeware tools available for most basicanalyses.Again,the book by Gentleman et al.[25] provides a nice surveyof the suite of R tools available with Bioconductor.One last concern with glass arrays relative to Aﬀymetrix chips is sim-ply that the number of diﬀerent array conﬁgurations and gene spottingpatterns is legion.This means that annotation and gene informationmust be checked carefully keeping the gene to spot mappings clear.Italso means that comparisons across diﬀerent array platforms may yielddiﬀerent measures of the “same” gene if diﬀerent cDNAs are used.1.3 SAGEMicroarrays work by exploiting hybridization to assess amounts of dyeaggregating to speciﬁc probes printed on the arrays.There are,however,some potential downsides to microarrays.First,a microarray is a closedsystem,in the sense that you will only be able to measure an mRNAif you have printed a probe for it.Unexpected transcripts will not beseen.Second,the quantitative nature of the data is somewhat question-able,as dye response is a nonlinear phenomenon.Third,diﬀerences inprotocols or preparations have made comparison of array results acrosslabs diﬃcult.We would like to have some mechanism for more directly counting allof the mRNA transcripts of a given type.Failing that,if we could take20 Baggerly,Coombes,and Morrisa random sample of all of the mRNA transcripts available and countthose,then this would still provide an unbiased and quantitative proﬁleof mRNA expression.This idea of sampling and counting underlies theserial analysis of gene expression (SAGE) technique.Some case studiesare given in [68,69,67,77,51,50,52,64,57,56].As before,we still need to know both sequence (identity) and abun-dance to characterize the expression proﬁle.With microarrays,the un-known sequence of the transcript is inferred from the known sequenceof the printed probe.With SAGE,a part of the transcript itself issequenced.Restricting attention to only a part of the transcript is de-liberate.While sequencing the entire transcript would identify it un-ambiguously,sequencing is time-consuming and costly enough that theexpense would be prohibitive.We want to sequence just enough of thetranscript to identify it,and then move on.The question now becomesone of how to biologically extract an identifying subsequence.An identifying subsequence need not be long.Current estimates ofthe number of genes in the human genome are around 25,000 to 30,000.While alternative splicing of the exons within the gene may allow thesame gene to produce several distinct transcripts,the total number ofdistinct transcripts is unlikely to be more than a few hundred thou-sand.Considering the 4 letter DNA alphabet,there are 410= 1,048,576distinct 10-letter “words”,suggesting that a 10 base pair (bp) subse-quence may be enough for unique identiﬁcation.This rough calculationimplicitly assumes that the 10 bp are in a speciﬁc location;it is consid-erably harder to ﬁnd unique subsequences if these are allowed to occuranywhere within the gene.We are going to ﬁrst specify position,andthen extract sequence.This process is rather intricate.The steps areillustrated in Figure 1.7,and discussed in detail below.We begin by harvesting the mRNA from a biological sample.ThemRNA is single-stranded and has a poly-A tail (Figure 1.7 A).ThemRNA is diﬃcult to work with,as it is prone to degradation,but DNAis more stable.We would thus like to map the mRNA to cDNA.Toget to DNA,we introduce a biotin-labeled dT primer (Figure 1.7 B)and use reverse transcriptase to synthesize more stable double-strandedcomplementary DNA (cDNA;Figure 1.7 C).Like the initial mRNA,there is something special about one end (the biotin label),and we canuse this to “anchor” the cDNAs.We anchor the cDNAs by binding the biotin to streptavidin-coatedbeads.To focus on speciﬁc sites within the sequences,we introduce arestriction enzyme,known in the SAGE context as the “anchoring en-Introduction 21Fig.1.7.Steps in the preparation of a SAGE library.(A) Extract mRNA.(B) Add a biotin-labeled primer.(C) Synthesize cDNA.(D) Cleave with ananchoring enzyme (AE).(E) Discard loose segments.(F) Split cDNA intotwo pools,and introduce a linker for each.(G) Ligate linker to bound cDNAfragments.(H) Cleave the product with a tagging enzyme,and discard thebound parts.In addition to the linker,the piece remaining contains a 10-base“tag” that can be used to identify the initial mRNA.(I) Ligate the fragments,and use PCR starting from the primers attached to the linkers to amplify.(J) Cleave with the AE again,and discard the pieces bound to linker.Theremaining fragments contain pairs of tags,or “ditags”,bracketed by the motifrecognized by the AE.(K) Ligate the ditags and sequence the product.zyme” (AE),which will cut the cDNA whenever a speciﬁc DNA “motif”occurs.We will only measure genes that contain at least one occurrenceof the motif,so we want the motif to be fairly common;this in turnimplies that the motif should be fairly short.Conversely,we don’t wantthe motif to be too short,or it will reduce the number of distinct subse-quences available afterwards.The most commonly used such enzyme isNlaIII,which searches for the motif “CATG”.When this enzyme cleaves22 Baggerly,Coombes,and Morristhe cDNA,it produces an “overhang” (an unmatched single strand) atthe cleavage site (Figure 1.7 D).Cleaving produces a number of sub-strands,most of which are “loose” — unconnected to the streptavidinbead (Figure 1.7 E).These loose fragments are washed away before thenext step.At this point,we have zoomed in on a particular site on eachcDNA:the occurrence of the AE motif closest to the bead (the mRNApoly-A tail).As noted,cleaving typically produces an “overhang”.We can use thisoverhang to bind new “linker sequences” at the end.As it turns out,we’re going to bind two distinct linkers (Figure 1.7 F).The two distinctlinkers will be exploited in a PCR ampliﬁcation step described below.So,we divide the material into two pools,and add the two linkers.Thelinkers are diﬀerent only at one end;at the other they have an overhang(to match the bound sequence) and another short motif,which will guideyet another enzyme.Within each pool,the linker sequences will bind tothe bound cDNAs due to base pairing – and the sequences are ligated(Figure 1.7 G).Next,we introduce a “type IIS” restriction enzyme (called the “tag-ging enzyme”;TE) which looks for the motif we introduced with thelinker sequence.Type IIS restriction endonucleases cleave not at themotif itself,but rather a speciﬁc number of base pairs (say 20) awayfrom it.Unlike the motif for the anchoring enzyme,the motif for thetagging enzyme is asymmetric,so there is a direction for placing the cutsite.This cut is “blunt”,producing no overhang (Figure 1.7 H).At this point,the loose double strands in a pool have,in order:linker,the TE motif,the AE motif,and the 10 bp fromthe cDNA next to the an-choring enzyme motif closest to the poly-A tail.This 10bp subsequenceis the “tag” that we shall use to identify the parent gene.To focus on the tags,we now remove the beaded ends,leaving justthe loose double strands.We then combine the two resultant pools,sothat we have loose strands with two diﬀerent linkers.We then induceligation amongst the strands (Figure 1.7 I).The sequence geometry is nowLinker A – TE AE (motifs) – ditag – AE TE – Linker Bwhere the central region,or “ditag” contains the identifying informationfor two distinct transcripts.Ideally,this ditag is bounded by linker Aon one side,and linker B on the other.However,since the ligation isnot targeted,it is possible to get linker A (or B) on both sides.We now have a pool of DNA,but not necessarily a large amount.Introduction 23Tag Count Tag Count Tag CountCCCATCGTCC 1286 CCTGTAATCC 448 TGATTTCACT 358CCTCCAGCTA 715 TTCATACACC 400 ACCCTTGGCC 344CTAAGACTTC 559 ACATTGGGTG 377 ATTTGAGAAG 320GCCCAGGTCA 519 GTGAAACCCC 359 GTGACCACGG 294CACCTAATTG 469 CCACTGCACT 359......Table 1.1.Part of a SAGE library.Since it’s easier to work with large amounts of DNA,we amplify whatwe have using PCR.PCR requires primers at both ends of the targetampliﬁcation sequence,and we can choose primers to match the twodistinct linkers (this is why we divided things into two pools).Thus,the resultant products will be overwhelmingly of the form shown above,with linker A at one end and linker B at the other.An ampliﬁed ditag with linkers has a fairly well-deﬁned mass,so ﬁl-tering of unwanted ampliﬁcation products can be achieved using a gel.At this point,the information containing part of the data has been com-pressed (the length of the linker is less than the length of the gene,onaverage),but the linkers and enzyme motifs are still extraneous and wewould prefer not to sequence them.Fortunately,if we reintroduce theAE,the linkers and the TE motifs will be cleaved oﬀ from the ditags.To isolate the ditags (Figure 1.7 J),we use another gel to select for theappropriate target mass.After the above selection and cleaving,the information content of ashort piece of DNA (ditag plus overhangs) is quite high,but short readsare ineﬃcient with respect to sequencing.Thus,we ligate the ditagstogether (Figure 1.7 K).We then sequence the concatenated product.Atypical sequencing read involves 500bp,or about 20 ditags and motifs.The AE motif actually provides a useful bit of “punctuation” for qualitycontrol purposes.Within the read,we locate a bracketing pair of motifs and extract theditag (this can be between 20 and 26 bp).The 10 bp closest to the leftend give one tag,and the 10 bp closest to the right end are reversedand complemented to give the other.The tabulated results from a setof reads comprise a SAGE “library”.Part of a typical SAGE library isshown in Table 1.1.Given the data,what questions can we ask?The most common goalis (as with microarray experiments) to ﬁnd genes that show expression24 Baggerly,Coombes,and Morrislevels that vary with phenotype.There are,however,complexities asso-ciated with the methods of measurement.The ﬁrst question is whether we see all the data.There are somesequences that we should not see.If we see the AE motif within a tag,we know that that is an artifact and should be excluded.In many cases,sequences corresponding to mitochondrial DNA will also be excluded.If there are multiple occurrences of a given ditag,typically only one isrecorded,to preclude biases associated with PCR ampliﬁcation.If thereare genes that do not contain an occurrence of the cleavage site,thesewill not be seen.Similarly,if a cleavage site is too close to the poly-Atail,the true identity may be obscured.Conversely,if the RNAis of poorquality,sequence degradation can remove the cleavage site altogether.There are other issues related to whether the tags we do see are “cor-rect”.Mappings of tags to genes are not always unique;the math sug-gesting that 10 bp should be “enough” relies on independence assump-tions that likely do not hold.At present,our genomic information isstill a draft,so annotations are not ﬁxed.At the processing level,thereare sequencing errors.Published rates are about 0.7% per bp,so to aﬁrst approximation 7% of the tags will be so aﬀected [66].This canproduce small “shadow” counts for tags that are “similar” to abundanttags.This renders estimation with rare counts diﬃcult,and somewhatlimits the dynamic range.Interim ﬁxes have been suggested for some of the above problems,butthere is still room for improvement.“Long SAGE” [58],where the tagsare 14 bp or more in length,has been introduced to address the issueof identiﬁcation ambiguity.Many of the issues with identiﬁcation couldalso potentially be resolved by using multiple restriction enzymes to pro-duce “coupled libraries”,but in practical terms this rarely happens (see,however,[72]).Errors in sequencing can be addressed by deconvolution,pulling shadows back to their source,given deﬁnitions of a local neigh-borhood in the sequencing space [18].Alternatively,information aboutthe tag quality could be acquired at the time of sequencing and usedto suggest the most likely ﬁxes;sequences can produce quality “phred”scores associated with each base read [23,11].Once the table of counts has been “ﬁnalized”,there is still the questionof choosing a good test statistic for assessing diﬀerential expression.Many statistics have been proposed,most focusing on comparing onelibrary with another and dealing primarily with the Poisson samplingvariability associated with extracting a count [77,40,5,30,17,34,44,42,55].Some papers have looked at more than two groups [77,52,57,46],Introduction 25and some analogs of ANOVA have been suggested [26,65].However,each library supplies a vector of proportions for an individual.Evenunder ideal conditions,estimates of the true level of a proportion ina group of individuals are subject to two sources of error:binomialvariation associated with the count nature of the data,and variation inproportions between individuals within a group [6,7].Better methodsfor combining these proportions to estimate contrasts are still underdevelopment.At present,SAGE is not as widely used as microarrays,due primarilyto the higher costs of assembling libraries.However,these costs are alsolinked to the costs of sequencing,and the approach may become moreviable as sequencing gets easier.The sequencing and counting approach,however,still has many open questions associated with it.Given esti-mated rates of sequencing errors,what is the realistic dynamic range ofthis approach?Given this dynamic range,how big does a library needto be to catch the measurable changes stably?Given the relative sizes ofthe between and within library variance components,should we assemblemore small libraries or a small number of big ones?Massively parallelsignature sequencing (MPSS) [16] enables the assembly of huge libraries,but the costs are still high.If we compare SAGE and microarray results,how should we measure agreement?There are some software packages available for analyzing SAGE data,and some large repositories of SAGE data.We recommend SAGE Genie[14] as a source of data for further exploration.1.4 Mass SpectrometryMicroarrays and SAGE let us measure the relative abundance levelsof thousands of mRNA transcripts all at once,giving us some pictureof the dynamic activity within the cell.However,much of the action ishappening at the protein level,and we’d really like to have the equivalentof a microarray for proteins as well.Some progress has been made onthis front,but there are several limitations here.• the number of distinct proteins is larger than the number of genes.• many proteins undergo post-translational modiﬁcations (eg,phospho-rylation),and it is the amount of modiﬁcation that can aﬀect things.Thus,it can be hard to get abundance and identity at the same time.However,we can make substantial progress if we relax one of these con-straints,getting only partial identiﬁcation.One tool for getting such26 Baggerly,Coombes,and Morrisinformation,letting us measure hundreds of proteins at once,is massspectrometry.(More extensive descriptions are given in [38,62].)Mass spectrometry works by taking a sample and sequentially addinga charge to the substances to be measured (ionizing proteins,proteinfragments,or peptides),using electromagnetic manipulation to separatethe ionized peptides on the basis of their mass to charge (m/z) ratios,and using a detector to count the abundance of ions with a given m/zratio.Plotting abundance as a function of m/z gives a mass spectrum.There are many variants of mass spectrometry,corresponding to diﬀer-ent modular conﬁgurations of ionization,separation,and detection tools(not all combinations are possible),with much greater emphasis on themethods of ionization and separation than detection.Mass spectrometry has been around for a long time;it was ﬁrst in-troduced by J.J.Thomson around 1900,but it is only in recent decadesthat it has generated great excitement as a tool for exploring the pro-teome.This delay was due to limitations of the ﬁrst few ionization meth-ods available;charges were attached or broken oﬀ with suﬃcient forcethat larger molecules (including proteins) were torn apart into muchsmaller chunks.The late 1980s saw the introduction of two “soft” ioniza-tion methods,matrix-assisted laser desorption and ionization (MALDI)[31,32] and electrospray ionization (ESI) [24] that allowed measure-ments to extend to the tens and hundreds of kiloDaltons (kDa,1 Da =the mass of a hydrogen atom).Recently,mass spectra have begun to be explored for their potentialdiagnostic utility —can peaks in the spectra serve as biomarkers of theearly stages of diseases such as cancer?See,for example,[1,37,48,49,53,71,78].While similar questions have been asked with respectto microarrays,a key diﬀerence has been that many explorations withmass spectra have focused on spectra obtainable from readily availablebiological ﬂuids such as blood,urine,or saliva.In this context,the mostcommon mass spectrometry methods used have been variants of MALDIcoupled with a time-of-ﬂight (TOF) ion separator (MALDI-TOF).Thisis the only method that we discuss in detail here.In MALDI-TOF,the sample of interest (e.g.,serum) is combined withone of several matrix compounds,and this mixture is applied to a stain-less steel plate.As the mixture dries,the matrix forms a crystal struc-ture holding the proteins in place.Many samples are typically spottedon the same plate;one MALDI plate we have (square,and about 7cmon a side) has 100 deposition sites indicated.After the samples havebeen spotted,the plate is inserted into a receiving chamber connectedIntroduction 27to the main measurement instrument.The chamber is then pumped outto near vacuum conditions.A robotic arm is used to position the plateso that the spot of interest is in a desired target area,and a laser isthen ﬁred at the spot.Most of the laser energy goes into breaking thecrystal structure of the matrix apart,and less to shaking the peptidesapart.The physics of exactly how this works is not well understood.As a result of the matrix fragmentation,many peptides break free intothe gas phase.Most matrix compounds are slightly acidic,and thusare willing to donate spare protons to nearby molecules during fragmen-tation — the peptides going into gas phase are ionized by capturing asmall number of protons (typically 1–3 in the data we have seen).Inthe receiving chamber,a strong electric ﬁeld propels the ions towards aﬂight tube.This electric ﬁeld is typically set up by raising the potentialof the plate itself (to V) before the laser is ﬁred;the ﬂight tube entranceis at zero.The ﬂight tube itself is ﬁeld-free,so the ions drift with thevelocity imparted by the electric ﬁeld until they reach a detector at thefar end of the tube.The detector attempts to record the number of ionshitting it as a function of time of ﬂight,assembling an initial form ofthe spectrum.Typically,several (∼100) laser shots are made and theresulting spectra are summed to produce the ﬁnal spectrum examined.To ﬁrst order,the ions all cover the same potential diﬀerence and thusthe kinetic energy imparted is proportional to the number of unbalancedcharges,z (spare protons),the ion is carrying.The ﬂight tube itself istypically much longer than the region over which the potential diﬀer-ence exists,and so the time spent in the acceleration region is typicallydiscounted and the ion is treated as moving at a ﬁxed velocity down theﬂight tube.Equating expressions for kinetic energy,we get12mv2= z ∗ V.As the velocity is ﬁxed in the drift tube,v = L/t where L is the lengthof the tube and t is the time of ﬂight.Substituting and rearranging theabove equation,we getm/z = t2∗ (2V/L2) = kt2,showing how the m/z ratio can be inferred from the time of ﬂight.MALDI spectra are commonly supplied as comma-separated value(csv) ﬁles with two columns,containing the m/z value and spectrumintensity for each digitizer sample.Ignoring the m/z values,the rowsgive intensities that are equally spaced in time.An example MALDI28 Baggerly,Coombes,and MorrisFig.1.8.Two views of the same MALDI-TOF spectrum.(A) Intensity plottedas a function of m/z,which is the standard display option.(B) Intensityplotted against time-of-ﬂight,which is directly recorded by the instrument;m/z is a derived quantity.There are two natural scales on which to look atthis data,as the time to m/z mapping is not linear.spectrum is shown in Figure 1.8.The same spectrum is plotted againstm/z in the panel A (the most common display option),and against ﬁlerow in the panel B.This dual presentation is to emphasize that thereis more than one natural scale on which to examine this type of data.This spectrum was derived from a serum sample,and peaks at 66 kDaand 150 kDa correspond to albumin and immunoglobulin,known serumproteins.Most of the interest in the biomarker papers published to datehas been focused at somewhat lower m/z values;the identities of manyof the peaks seen here are not known,and we want to ﬁnd some thatare present in patients with disease and not present in those without,orvice-versa.It is important to realize that not all of the peptides present in thesample will be seen in a spectrum.Diﬀerent types of matrix can causediﬀerent groups of peptides to ionize more readily,so choosing a speciﬁcmatrix amounts to choosing a subset of the peptides to be examined.Itis common to further subset the peptides by “fractionating” the samplesin a variety of ways;some separation axes include pH (acidity) and hy-drophobicity (greasiness).Fractionation yields two clear beneﬁts.First,it can allow for more precise identiﬁcation of a peptide of interest.Sev-eral peptides may share a common (or very similar) m/z value,and thusbe “aliased” if the entire sample is used.Fractionation introduces asecond axis of separation for dealiasing.Second,it can remove (or splitoﬀ) some of the most abundant peptides.This is an issue because ourIntroduction 29present instruments have a limited dynamic range,so if an abundantpeptide is present at a level of 100,a trace peptide present at a level lessthan 1 will simply not be seen.The dynamic range of protein expressionis thought to cover 9 or so orders of magnitude,which means that trulyscarce peptides will be diﬃcult to detect even with extensive fraction-ation [21].The downside of fractionation is that it requires more time,eﬀort,and amount of starting material.One variant of MALDI,knownas surface-enhanced laser desorption and ionization (SELDI),works bydepositing the sample/matrix mixture on a chemically precoated sur-face,where diﬀerent surface coatings allow us to bind diﬀerent subsetsof peptides with high eﬃciency.SELDI has been commercialized by thecompany Ciphergen,which sells chips with diﬀerent coatings preapplied,so some fractionation is done for you.Ciphergen also sells their own in-struments and software,but there has been some experimentation withreading Ciphergen chips with other instruments.Having introduced the structure of the data,we now turn to process-ing issues:Given a set of spectra,what do we have to do to it beforeanalyzing an expression matrix?A partial list of important steps in-cludes• Spectral calibration,• Correcting for matrix noise,• Spectral denoising,• Baseline estimation and subtraction,• Peak detection and quantiﬁcation,• Normalization,• Looking for common patterns and modiﬁcations (harmonics),and we will address each in turn.Earlier,we derived the relationship m/z = kt2.In theory,physicalparameters such as the potential diﬀerence,tube length,and digitizerrate of the detector are known and a value for k can be derived.Inpractice,the same peak may drift slightly over time due to changes inthe instrument.One common way of addressing this problem is to run a“calibration sample” consisting of only a small number of proteins whoseidentities are known a priori,producing a spectrumwith a small numberof clearly deﬁned peaks,as illustrated in Figure 1.9.The masses of thepeptides are known,the ﬂight times are empirically observed,and a setof (mass,time) pairs is used to ﬁt a quadratic model of the formm/z = at2+bt +c30 Baggerly,Coombes,and MorrisFig.1.9.A SELDI calibration spectrum.The sample was comprised of a smallnumber of known peptides,and the associated peaks are clearly seen.Theknown masses and the observed times-of-ﬂight are then used to ﬁt a quadraticcalibration equation.by least squares.The model parameters found are then assumed tohold for several samples.These parameters can change over time,so itis often useful to check that some of the biggest peaks seen “line up”across samples [29].Matrix noise is a problemunique to MALDI.When a sample is blastedwith a laser,many things break free,not just the peptides of interest.This other,unwanted stuﬀ is colloquially referred to as “matrix noise”,and it is predominantly present at the very low m/z end of the spec-trum.Matrix noise can often saturate the detector,and detectors donot immediately recover after saturation.This eﬀect is quite unstable[41].Empirically,this has largely been addressed by excluding valuesbelow some chosen m/z cutoﬀ.Exactly where this cutoﬀ should go is notclear,and it can be aﬀected by other machine settings such as the laserintensity.Higher intensity settings can blast loose heavier ions,allowinghigher m/z regions to be explored,but these same settings kick up morenoise and distort a larger low m/z region with noise.Conversely,lowm/z regions can be probed with lower laser settings.Mathematically,we tend to think of spectra as being comprised of 3pieces – the signals we want to extract,which are present as peaks,asmooth underlying baseline,and some high frequency noise.In short,Yi(t) = kiSi(t) +Bi(t) +it,where Yi(j) is the intensity of spectrum i at time index t,kiis a nor-malization factor,Siis the protein signal of interest (a set of peaks),Biis baseline,and ∼ N(0,σ2(t)).We would like to remove the noise,subtract the baseline,estimate the peaks,and scale the spectra.Thereis a natural order to these steps,and performing them out of sequence(or omitting some) can make the downstream analysis more diﬃcult.Introduction 31Fig.1.10.Two raw MALDI spectra,with the peaks and intensities automati-cally ﬂagged by software superimposed.There are diﬀerences in baseline andscaling visible in the raw spectra.These diﬀerences should be corrected for,but this was not done for the peaks found.Baseline cannot be estimated fromthe peaks alone.Many mass spectrometry instruments are sold with associated soft-ware that will perform peak detection and quantiﬁcation automatically,but these may not address all of the steps.For one dataset we examined,we were supplied with both raw spectra and associated lists of peak lo-cations and intensities.Two spectra from this set are plotted as curvesin Figure 1.10,with the peaks supplied plotted as asterisks.The twospectra obviously have diﬀerent baseline levels,still have additive whitenoise present,and may involve diﬀerent normalization factors.However,the peak lists supplied use the intensities fromthe peaks before adjustingfor baseline or normalization,and baseline cannot be reliably estimatedfrom the peaks alone.We also note that one of the larger peaks,nearm/z 36000,is missed in one spectrum because it wasn’t “sharp enough”.Matrix noise is present at the very lowest m/z values,where the spectrajump out of view [10].One problem with both denoising and peak detection is simply thatpeaks can have diﬀerent shapes in diﬀerent parts of the m/z range;higherm/z peaks are broader.Some factors that can contribute to this broad-ening are:uncertainty in the initial velocity of the peptide,isotopicspread,and the nonlinearity of the clock tick to m/z mapping.The lastof these was mentioned earlier,so we expand only on the former two.When peptides are blasted loose from the matrix crystal,all peptides ofthe same type do not break out with the same initial velocity.Rather,there is a velocity distribution,causing the peak to be spread out.Thisspread becomes more pronouced the longer the peptide drifts down thetube,and is thus bigger at higher m/z values.For higher m/z peptides,the deﬁnition of “mass” can actually be somewhat ambiguous.Carbon,for example,exists 99% as12C,and 1% as13C.If a peptide contains 100carbon atoms,the mass contribution from these atoms will be roughly32 Baggerly,Coombes,and Morris1200 plus a small integer;this integer will have a Poisson distributionwith mean 100*1% = 1.Similar eﬀects are associated with other ele-ments.The overall isotopic spread widens as mass increases,so that itis common to refer to both the monoisotopic mass (assuming all carbonshave mass 12) and the average mass (incorporating the isotopic eﬀects).It is possible to devise an approximate isotopic spread for a peptide giveneither mass estimate,using the general abundances of carbon and otherelements in the population of amino acids.This can be used to sharpenthe peaks through deconvolution.There are a number of denoising ﬁlters that exist for spectra (eg,Savitzky-Golay),but we admit a preference for wavelet-based methodswhich adapt naturally to the multi-scale nature of the data.Here,wemap to the wavelet domain,zero out the small coeﬃcients (hard thresh-olding),and map back before looking for peaks [19].Once the spectra have been smoothed,we attempt to estimate base-line.At present,we do not use very sophisticated algorithms for thispurpose,generally sticking with a local minimum ﬁt so that negative in-tensities will not be produced by subtraction.Again,the “local” neigh-borhood used needs to be altered as m/z increases.Even with basicalgorithms,the eﬀects can be rather dramatic.In Figure 1.11,we showspectra derived from20 pH fractions for a single patient both before andafter denoising and baseline subtraction (panels A and B,respectively).In this case,baseline subtraction causes the more dramatic eﬀect,givingall of the base levels the same hue.As an aside,we note that this displayalso points out that fractionation is an imperfect procedure,and thatsignal from the same peptide can be found in several adjacent fractions.After subtracting baseline from smoothed spectra,we still need toidentify peaks and get summary values for them.A ﬁrst pass approachcan use a simple maximum ﬁnder.We could attempt to use peak areasinstead,but we do not pursue this here.We note,however,that locatingthe peaks can be aided by considering a set of spectra rather than asingle spectrum.Assuming the spectra have been roughly aligned,wehave found it useful to average spectra within a group and performpeak detection on the average spectrum [45].Averaging may even beuseful before doing wavelet denoising,as small peaks can be reinforcedas the noise level drops,and they can be retained.Values for individualspectra can be extracted as local maxima in small windows about thecentral peak location.The width of this window can be linked to thenominal precision of the instrument.For a low-resolution instrument,the uncertainty can be on the order of 0.1% of the nominal m/z;higher-Introduction 33Fig.1.11.Spectra derived from 20 pH fractions of the serum from a singlepatient.(A) Raw spectra.There are clear diﬀerences in baseline,seen asdiﬀerent shadings for the rows.There is also some unwanted noise,visibleas periodic ripples in the spectra.(B) The same spectra after correcting forbaseline and denoising.Peaks stand out more clearly against a ﬂat “surface”.In both cases,peaks can extend across neighboring fractions,as the separationprocess is imperfect.resolution instruments will attain mass accuracies expressed in parts permillion (ppm).Before comparing peak intensities across spectra,we need to normalizethe spectra to make them comparable.One common method is to usethe total ion current,or summed intensities for the entire spectrum.Thisis done after excluding the matrix noise region and subtracting baseline.This step is where we feel there is the most room for improvement,asthere may be local scaling factors that are more appropriate than asingle factor throughout.Even if a single scaling is to be used,it maybe better to identify a small number of key peaks that appear to berelatively stable and to target the median log ratio for the set of peaks.Having identiﬁed some peaks as being of potential interest,it alsomakes sense to look at other peaks that may be related (as assessed bycorrelation) or that should be related.The idea of “should be related”is diﬀerent for mass spectrometry data than for microarray data in thatthere is a natural ordering to the peaks in a spectrum.In Figure 1.12,weshow zooms on two distinct regions of averaged spectra from a higher-resolution (Qstar) instrument.The patterns of peaks look the same,though the m/z range in the bottom panel is half that of the top panel,34 Baggerly,Coombes,and MorrisFig.1.12.Two regions derived from the average of several high-resolutionQstar spectra.(A) The m/z range from 7600 to 8400.(B) The m/z rangefrom 3800 to 4200;values exactly half those in A.The peak patterns in thetwo panels are perfectly aligned,as we are seeing the same peptides.In A,the peptides are singly charged (z = 1),and in B they are doubly charged(z = 2).Other regularities (oﬀsets of 189 in A) are due to further identiﬁablephenomena (matrix adducts).and the intensities are dramatically reduced.In this case,the parallelstructure is due to the fact that the two panels are showing singly anddoubly charged versions of the same peptides;ﬁnding the appropriateharmonic patterns on the m/z scale can tell us both the charge stateof the peptide (and thus its mass) and provide some reassurance thatwe have identiﬁed it correctly.(With higher resolution data,the chargestate can also be inferred fromthe spacing between isotopic peaks,whichshould be 1Da apart.) Looking at the top panel,we can also see thatthere are groups of peaks oﬀset from each other by 189 Da.This oﬀsetmass matches that of a single molecule of the the matrix used here:α-cyano-4-hydroxycinnamic acid.These peaks are referred to as matrixadducts.Similarly,there are smaller peaks close to the biggest one,withthe largest ones 18 Da below the main peak and 22 Da above.Thesecorrespond to loss of a water molecule or replacing an ionized hydrogenatomwith one of sodium,respectively.Viewing the ensemble,we can seethat almost all of the peaks visible here are diﬀerently modiﬁed formsof the same major peptide.Graphically,we have found it useful to construct heat maps of thespectral regions surrounding peaks identiﬁed as potentially useful mark-ers in a few diﬀerent ways.First,in a very localized region (say 20 Da oneither side) simply to check that the peak is reasonably clear.Second,Introduction 35in a larger window going out to either side by 250 Da or so,which iswide enough to capture most matrix adducts and common modiﬁcationssuch as phosphorlyation (a mass oﬀset of 80 Da).Third,by checkingheatmaps at half and twice the nominal m/z value to check the chargestate.Finally,a note of caution.The use of mass spectrometry data forbiomarker discovery is more recent than the use of microarrays,and thereare a number of external factors that can introduce unwanted biases.Some of these are discussed in Baggerly et al.[8,9] and Villanueva etal.[70].These tools are incredibly sensitive,which they need to be ifthey are to pick up new biomarkers.This very sensitivity,however,means that they will also pick up changes in experimental conditionsquite well.In terms of keeping track of and reporting on your data,werecommend Ransohoﬀ [54] for a discussion of some of the issues,andMcShane et al.[43] for a more speciﬁc set of guidelines.1.5 Finding DataSimply discussing the features of various types of data is no substitutefor diving in and working with raw data.If possible,we recommendvisiting labs as the data is being collected or trying to collect someyourself.(Our colleagues have been willing to work with us on testcases.) Even without that,raw data of the types discussed are readilyavailable on the web.Lots of microarray data has been on the web for a while,and muchmore has been posted since the advent of the mimimum informationabout a microarray experiment (MIAME) standards [15].Several majorjournals now require that the raw data be made available at the timeof publication.For Aﬀymetrix data,the ﬁrst place to go is simply thecompany’s web site,www.affymetrix.com.Sample data sets for severaldiﬀerent chip types are available,as are all of the CDF ﬁles,probesequences,and the latest annotation for what the probes on the chipsactually correspond to.Registration is required,but free.For cDNAmicroarray data (and Aﬀymetrix data),we also recommend the GeneExpression Omnibus (GEO) maintained by the NCBI,athttp://www.ncbi.nlm.nih.gov/geo.For SAGE data,we recommend SAGE Genie [14],maintained as partof the Cancer Genome Anatomy Project (CGAP) athttp://cgap.nci.nih.gov/SAGE.The data repositories for mass spectrometry data are not yet as ex-36 Baggerly,Coombes,and Morristensive,but several proteomics journals are getting set to require rawdata in a fashion akin to MIAME,so we hope this will change shortly.In the meantime,there are a few sites that have data of various types.The best known is probably the Clinical Proteomics programjointly runby the NCI and FDA [48].The databank is currently located athttp://home.ccr.cancer.gov/ncifdaproteomics/and has various SELDI and Qstar datasets.Questions have been raisedabout the quality of some of this data,and we strongly recommend read-ing Baggerly et al.[8] for a more detailed discussion of some of the issuesinvolved.There is some SELDI data available from MD Anderson,athttp://bioinformatics.mdanderson.orgtogether with Matlab scripts for processing and analysis.AcknowledgementsThis work was partially supported by NCI grant CA-107304.Bibliography[1] B.-L.Adam,Y.Qu,J.W.Davis,et al.,Serum protein ﬁngerprintingcoupled with a pattern-matching algorithm distinguishes prostate cancerfrom benign prostate hyperplasia and healthy men,Cancer Res,62 (2002),pp.3609–3614.[2] Affymetrix,Aﬀymetrix Microarray Suite Users Guide,Version 4.0,Aﬀymetrix,1999.[3],Aﬀymetrix Microarray Suite Users Guide,Version 5.0,Aﬀymetrix,2001.[4] B.Alberts,A.Johnson,J.Lewis,et al.,Molecular Biology of theCell (4e),Garland Publishing,2002.[5] S.Audic and J.-M.Claverie,The signiﬁcance of digital gene expres-sion proﬁles,Genome Res,7 (1997),pp.986–995.[6] K.A.Baggerly,L.Deng,J.S.Morris,et al.,Diﬀerential expres-sion in SAGE:Accounting for normal between-library variation,Bioin-formatics,19(12) (2003),pp.1477–1483.[7],Overdispersed logistic regression for SAGE:Modelling multiplegroups and covariates,BMC Bioinformatics,5 (2004),p.144.[8] K.A.Baggerly,J.S.Morris,and K.R.Coombes,Reproducibil-ity of SELDI-TOF protein patterns in serum:Comparing datasets fromdiﬀerent experiments,Bioinformatics,20(5) (2004),pp.777–785.[9] K.A.Baggerly,J.S.Morris,S.R.Edmonson,et al.,Signal innoise:Evaluating reported reproducibility of serum proteomic tests forovarian cancer,J Natl Cancer Inst,97(4) (2005),pp.307–309.[10] K.A.Baggerly,J.S.Morris,J.Wang,et al.,A comprehensiveapproach to the analysis of MALDI-TOF proteomics spectra from serumsamples,Proteomics,3(9) (2003),pp.1667–1672.Introduction 37[11] T.Beißbarth,L.Hyde,G.K.Smyth,et al.,Statistical modeling ofsequencing errors in SAGE libraries,Bioinformatics,20 (2004),pp.i31–i39.[12] J.M.Bland and D.G.Altman,Statistical method for assessing agree-ment between two methods of clinical measurement,The Lancet,i (1986),pp.307–310.[13] B.M.Bolstad,R.A.Irizarry,M.˚Astrand,et al.,A comparison ofnormalization methods for high density oligonucleotide array data basedon variance and bias,Bioinformatics,19 (2003),pp.185–193.[14] K.Boon,E.C.Osorio,S.F.Greenhut,et al.,An anatomy ofnormal and malignant gene expression,Proc Natl Acad Sci USA,99(17)(2002),pp.11287–11292.[15] A.Brazma,P.Hingamp,J.Quackenbush,et al.,Minimum infor-mation about a microarray experiment (MIAME):Toward standards formicroarray data,Nature Genetics,29 (2001),pp.365–371.[16] S.Brenner,M.Johnson,J.Bridgham,et al.,Gene expression anal-ysis by massively parallel signature sequencing (MPSS) on microbead ar-rays,Nature Biotechnology,18(6) (2000),pp.630–634.[17] H.Chen,M.Centola,S.F.Altschul,et al.,Characterization ofgene expression in resting and activated mast cells,J Exp Med,188(9)(1998),pp.1657–1668.[18] J.Colinge and G.Feger,Detecting the impact of sequencing errorson SAGE data,Bioinformatics,17(9) (2001),pp.840–842.[19] K.R.Coombes,S.Tsavachidis,J.S.Morris,et al.,Improvedpeak detection and quantiﬁcation of mass spectrometry data acquiredfrom surface-enhanced laser desorption and ionization by denoising spec-tra with the undecimated discrete wavelet transform,Proteomics,5(16)(2005),pp.4107–4117.[20] L.M.Cope,R.A.Irizarry,H.A.Jaffee,et al.,A benchmarkfor Aﬀymetrix genechip expression measures,Bioinformatics,20 (2004),pp.323–331.[21] E.P.Diamandis,Analysis of serum proteomic patterns for early cancerdiagnosis:Drawing attention to potential problems,J Natl Cancer Inst,96 (2004),pp.353–356.[22] S.Dudoit,Y.H.Yang,M.J.Callow,et al.,Statistical methods foridentifying diﬀerentially expressed genes in replicated cDNA microarrayexperiments,Statistica Sinica,12(1) (2002),pp.111–139.[23] B.Ewing,L.Hillier,M.C.Wendl,et al.,Base-calling of automatedsequencer traces using Phred.I.Accuracy assessment,Genome Res,8(1998),pp.175–185.[24] J.B.Fenn,M.Mann,C.K.Meng,et al.,Electrospray ionization formass spectrometry of large biomolecules,Science,246 (1989),pp.64–71.[25] R.Gentleman,V.J.Carey,W.Huber,et al.,eds.,Bioinfor-matics and Computational Biology Solutions Using R and Bioconductor,Springer-Verlag,2005.[26] L.D.Greller and F.L.Tobin,Detecting selective expression of genesand proteins,Genome Res,9 (1999),pp.282–296.[27] R.A.Irizarry,B.M.Bolstad,F.Collin,et al.,Summaries ofAﬀymetrix genechip probe level data,Nucleic Acids Res,31 (2003),p.e15.[28] R.A.Irizarry,B.Hobbs,F.Collin,et al.,Exploration,normal-ization,and summaries of high density oligonucleotide array probe level38 Baggerly,Coombes,and Morrisdata,Biostatistics,4 (2003),pp.249–264.[29] N.Jeffries,Algorithms for alignment of mass spectrometry proteomicdata,Bioinformatics,21 (2005),pp.3066–3073.[30] A.J.Kal,A.J.van Zonneveld,V.Benes,et al.,Dynamics of geneexpression revealed by comparison of serial analysis of gene expressiontranscript proﬁles from yeast grown on two diﬀerent carbon sources,MolBiol Cell,10 (1999),pp.1859–1872.[31] M.Karas,D.Bachmann,U.Bahr,et al.,Matrix-assisted ultravioletlaser desorption of non-volatile compounds,Intl J Mass Spectrometry IonProcesses,78 (1987),pp.53–68.[32] M.Karas and F.Hillenkamp,Laser desorption ionization of pro-teins with molecular masses exceeding 10,000 Daltons,Anal Chem,60(20)(1988),pp.2299–2301.[33] M.K.Kerr,M.Martin,and G.A.Churchill,Analysis of variancefor gene expression microarray data,J Comp Biol,7 (2000),pp.819–837.[34] A.Lal,A.E.Lash,S.F.Altschul,et al.,A public database for geneexpression in human cancers,Cancer Res,59 (1999),pp.5403–5407.[35] C.Li and W.H.Wong,Model-based analysis of oligonucleotide arrays:Expression index computation and outlier detection,Proc Natl Acad SciUSA,98 (2001),pp.31–36.[36],Model-based analysis of oligonucleotide arrays:Model validation,design issues and standard error application,Genome Biol,2(8) (2001),p.RESEARCH0032.[37] J.Li,Z.Zhang,J.Rosenzweig,et al.,Proteomics and bioinformaticsapproaches for identiﬁcation of serum biomarkers to detect breast cancer,Clin Chem,48(8) (2002),pp.1296–1304.[38] D.Liebler,Introduction to Proteomics:Tools for the New Biology,Hu-mana Press,2001.[39] R.J.Lipshutz,S.P.Fodor,T.R.Gingeras,et al.,High densitysynthetic oligonucleotide arrays,Nature Genetics,21 (1999),pp.S20–S24.[40] S.L.Madden,E.A.Galella,J.Zhu,et al.,SAGE transcript proﬁlesfor p53-dependent growth regulation,Oncogene,15 (1997),pp.1079–1085.[41] D.I.Malyarenko,W.E.Cooke,B.-L.Adam,et al.,Enhance-ment of sensitivity and resolution of surface-enhanced laser desorp-tion/ionization time-of-ﬂight mass spectrometric records for serum pep-tides using time-series analysis techniques,Clin Chem,51 (2005),pp.65–74.[42] M.Z.Man,X.Wang,and Y.Wang,POWERSAGE:comparingstatistical tests for SAGE experiments,Bioinformatics,16(11) (2000),pp.953–959.[43] L.M.McShane,D.G.Altman,W.Sauerbrei,et al.,Reporting rec-ommendations for tumor marker prognostic studies (REMARK),J NatlCancer Inst,97 (2005),pp.1180–1184.[44] E.M.C.Michiels,E.Oussoren,M.van Groenigen,et al.,Genesdiﬀerentially expressed in medulloblastoma and fetal brain,Physiol Ge-nomics,1 (1999),pp.83–91.[45] J.S.Morris,K.R.Coombes,J.Koomen,et al.,Feature extractionand quantiﬁcation for mass spectrometry data in biomedical applicationsusing the mean spectrum,Bioinformatics,21(9) (2005),pp.1764–1775.[46] M.Nacht,T.Dracheva,Y.Gao,et al.,Molecular characteristicsof non-small cell lung cancer,Proc Natl Acad Sci USA,98(26) (2001),Introduction 39pp.15203–15208.[47] Nature,The chipping forecast,Nature Genetics Supplement,21 (1999).[48] E.F.Petricoin,III,A.M.Ardekani,B.A.Hitt,et al.,Use ofproteomic patterns in serum to identify ovarian cancer,The Lancet,359(2002),pp.572–577.[49] E.F.Petricoin,III,D.K.Ornstein,C.P.Paweletz,et al.,Serumproteomic patterns for detection of prostate cancer,J Natl Cancer Inst,94(20) (2002),pp.1576–1578.[50] K.Polyak and G.J.Riggins,Gene discovery using the serial analysisof gene expression technique:Implications for cancer research,J ClinOncology,19(11) (2001),pp.2948–2958.[51] K.Polyak,Y.Xia,J.L.Zweler,et al.,A model for p53-inducedapoptosis,Nature,389 (1997),pp.300–305.[52] D.A.Porter,I.E.Krop,S.Nasser,et al.,A SAGE (serial analy-sis of gene expression) view of breast tumor progression,Cancer Res,61(2001),pp.5697–5702.[53] A.J.Rai,Z.Zhang,J.Rosenzweig,et al.,Proteomic approaches totumor marker discovery:Identiﬁcation of biomarkers for ovarian cancer,Arch Path Lab Med,126 (2002),pp.1518–1526.[54] D.F.Ransohoff,Bias as a threat to the validity of cancer molecular-marker research,Nature Reviews Cancer,5(2) (2005),pp.142–149.[55] J.M.Ruijter,A.H.C.van Kampen,and F.Baas,Statistical eval-uation of SAGE libraries:Consequences for experimental design,PhysiolGenomics,11 (2002),pp.37–44.[56] B.Ryu,J.Jones,N.J.Blades,et al.,Relationships and diﬀerentiallyexpressed genes among pancreatic cancers examined by large-scale serialanalysis of gene expression,Cancer Res,62 (2002),pp.819–826.[57] B.Ryu,J.Jones,M.A.Hollingsworth,et al.,Invasion-speciﬁcgenes in malignancy:Serial analysis of gene expression comparisons ofprimary and passaged cancers,Cancer Res,61 (2001),pp.1833–1838.[58] S.Saha,A.B.Sparks,C.Rago,et al.,Using the transcriptome toannotate the genome,Nature Biotechnology,20 (2002),pp.508–512.[59] E.E.Schadt,C.Li,B.Ellis,et al.,Feature extraction and normal-ization algorithms for high-density oligonucleotide gene expression arraydata,J Cell Bioc,37 (2001),pp.S120–S125.[60] M.Schena,ed.,Microarray Biochip Technology,BioTechniques Books,2000.[61] M.Schena,D.Shalon,R.W.Davis,et al.,Quantitative monitor-ing of gene expression patterns with a complementary DNA microarray,Science,270 (1995),pp.467–470.[62] G.Siuzdak,The Expanding Role of Mass Specrometry in Biotechnology,MCC Press,2003.[63] T.Speed,ed.,Statistical Analysis of Gene Expression Microarray Data,Chapman and Hall/CRC Press,2003.[64] B.St.Croix,C.Rago,V.Velculescu,et al.,Genes expressed inhuman tumor epithelium,Science,289 (2000),pp.1197–1202.[65] D.J.Stekel,Y.Git,and F.Falciani,The comparison of gene ex-pression from multiple cDNA libraries,Genome Res,10 (2000),pp.2055–2061.[66] J.Stollberg,J.Urschitz,Z.Urban,et al.,A quantitative evalua-tion of SAGE,Genome Res,10 (2000),pp.1241–1248.40 Baggerly,Coombes,and Morris[67] V.E.Velculescu,S.L.Madden,L.Zhang,et al.,Analysis ofhuman transcriptomes,Nature Genetics,23 (1999),pp.387–388.[68] V.E.Velculescu,L.Zhang,B.Vogelstein,et al.,Serial analysisof gene expression,Science,270 (1995),pp.484–487.[69] V.E.Velculescu,L.Zhang,W.Zhou,et al.,Characterization ofthe yeast transcriptome,Cell,88 (1997),pp.243–251.[70] J.Villanueva,J.Philip,C.A.Chaparro,et al.,Correcting com-mon errors in identifying cancer-speciﬁc serum peptide signatures,J ProtRes,4 (2005),pp.1060–1072.[71] A.Vlahou,P.F.Schellhammer,S.Mendrinos,et al.,Develop-ment of a novel proteomic approach for the detection of transitional cellcarcinoma of the bladder in urine,AmJ Path,154 (2001),pp.1491–1502.[72] C.-L.Wei,P.Ng,K.P.Chiu,et al.,5’ long serial analysis of geneexpression (longSAGE) and 3’ longSAGE for transcriptome characteri-zation and genome annotation,Proc Natl Acad Sci USA,101(32) (2004),pp.11701–11706.[73] Z.Wu,R.A.Irizarry,R.Gentleman,et al.,A model based back-ground adjustment for oligonucleotide expression arrays,J Am StatistAssoc,99 (2004),pp.909–917.[74] Y.H.Yang,M.J.Buckley,S.Dudoit,et al.,Comparison of meth-ods for image analysis on cDNA microarray data,J Comp Graph Statist,11 (2002),pp.108–136.[75] Y.H.Yang,S.Dudoit,P.Luu,et al.,Normalization for cDNA mi-croarray data:A robust composite method addressing single and multipleslide systematic variation,Nucleic Acids Res,30 (2002),p.e15.[76] L.Zhang,M.F.Miles,and K.D.Aldape,A model for interactionson short oligonucleotide microarrays:Implications for probe design anddata analysis,Nature Biotechnology,21(7) (2004),pp.818–821.[77] L.Zhang,W.Zhou,V.E.Velculescu,et al.,Gene expression pro-ﬁles in normal and cancer cells,Science,276 (1997),pp.1268–1272.[78] Z.Zhang,R.C.Bast,Jr.,Y.Yu,et al.,Three biomarkers identiﬁedfrom serum proteomic analysis for the detection of early stage ovariancancer,Cancer Res,64 (2004),pp.5882–5890.