Integration of gene expression in the design and analysis of traditional F2 intercrossstudies allows high confidence prediction of causal genes and identification of involvedpathways and networks.

NG-A23866

Revision 2

3

The discovery of novel genes which contribute to complex human disorders remains a challengefor geneticists. The conventional methodology of determining whether a particular locus isinvolved in agivendisease involves testing for inheritance of specific genomic regions insuccessive generations of affected individuals.Thistypically

leads

to multiple loci (known asquantitative trait loci, or QTLs), each of which contributesmodestly

to the overall phenotype.Each locusmay

contain hundreds of genes, making the elucidation of the underlying gene orgenes labor intensive and time consuming. Additionally, differentiatinggenes that are causal forthe disease from those that are reactive to thebiological alterations resulting from thedisease hasbeendifficult.

The advent of microarray technology has enabled scientists to simultaneously examinealterations in the mRNA levels of thousands of transcripts in a sample. Since microarrays yieldquantitative estimates of gene expression changes, the loci that control their expression can bemapped. These loci are known as expression QTLs (or eQTLs). eQTLs that map near the geneand are likely to regulate gene expression incis

are termedcis-eQTLs

1. Genes withcis-eQTLsthat are coincident with a clinical disease-related

trait QTL (or cQTL) have an increasedlikelihood of contributing causally to the particular disorder, especially if expression of the geneis correlated withthe severity of the disease trait

2,3.

However, correlation between an expression trait and a clinical phenotype does not imply acausal/reactive relationship due to linked causal mutations, and particular alleles may alsoinfluence RNA levels and phenotypes independently, further confounding the analysis. There isunambiguous biological directionality in that DNA changes influence alterations in transcript

NG-A23866

Revision 2

4

abundances and clinical phenotypes, so the number of possible relationships among correlatedtraits can be greatly reduced. For example, among two traits which are correlated and controlledby a unique DNA locus, only three likely relationship models exist, namely causal, reactive, andindependent

4,5. Therefore, after constructing a network, one can simultaneously integrate allpossible DNA variants and their underlying changes in transcript levels, and each relationshipcan be supported as being causal, reactive, or independent in relation to a particular phenotypesuch as obesity. This is referred to as thelikelihood-basedcausalitymodel selection (LCMS)procedure

4.

Using our LCMS procedure, we have predicted ~100 causal genes for abdominal obesity usingan F2 intercross between the C57BL/6J and DBA/2J strains of mice (the BXD cross)4.In ordertovalidate the predictive power ofLCMS, wecarried out phenotypic characterization oftransgenic or knockout

mouse modelsfor

nine of thetop candidate genes,andreportherethatintotal eight

out ofthe nine

genes

under characterization were found to influence

obesity-relatedtraits.We analyzed liver gene expression signatures of the transgenic and knockout mousemodels and demonstrated that all nine genes affect common pathways and subnetworks

F2DBA/J and C57BL/6Jintercross13. As depicted in Figure 2, two or moreover-represented metabolic pathways identified for each strain of mice in the current studyoverlappedwith these previously described pathways. Therefore, the causal genes identified

bythe LCMSprocedureand tested in this study likely affect adiposity by modifying similar obesity-NG-A23866

Revision 2

10

related pathways.Moreover, as shown inTable 2, the signature genes fromeach of the mousemodelsoverlap significantly with the signature genes identified from one or more of the othermouse models.

Overlap between liver signature genes from

mouse models

and a macrophage-enrichedmetabolic network (MEMN)

Based on the co-expression networks constructed from liver and adipose tissues collected from amouse cross betweenC57BL/6J (B6) and C3H/HeJ on an apolipoprotein E null background(BXH/apoE), wepreviously

identified a macrophage-enriched metabolic network (MEMN) thatis highly associated with metabolic traits and appears to be of macrophage-derived origin9. Fiveof the nine genes under validation, namely,Zfp90,Lactb,Lpl,C3ar1, andTgfbr2, are within thisMEMN subnetwork. In addition, the livergene signatures derived from five

out of the ninevalidation mouse models,Zfp90

tg,Gpx3

tg,Lpl

ko,C3ar1

ko, andTgfbr2

ko, significantlyoverlapped with MEMN genes (Table 2). Recall that these candidate genes were identified ascausal genes for obesity from a C57BL/6 x DBA/2J cross. Now, in a different mouse crosssetting (BxH/apoE), many of these candidate genes and their downstream genes are confirmed tobe within or highly overlap with a coexpression subnetwork that is relevant to obesity, diabetes,and atherosclerosis traits. Furthermore, we recently uncovered a human MEMN which isassociated with obesity-related traits and contains extensive overlap with the mouse MEMN wedescribe here

14. Therefore, it is not surprising to findC3AR1,LACTB, andLPL

as well as thepreviously characterizedHSD11B1

15

among the human MEMN network genes, furtherhighlighting the shared networks between mice and humans and suggesting a commonmechanism leading to the development of obesity.

NG-A23866

Revision 2

11

Bayesian network analysis

of liver signatures in mouse models

As the sample sizes for defining perturbation signatures were small, the quality of the signaturesderived could be noisy,

with lowassociation p values to BMIamong the genes composing the full core sub-network set. However, when the102

mouseMEMN geneswithin

the core

sub-network were

considered,we did observe a significantenrichment

for SNPs with low association p values to BMI.

NG-A23866

Revision 2

13

Specifically, 12.5% (267 out of 2132) ofcis-SNPs selected from the 102 mouse MEMN genes inthe core subnetwork reached an BMI association cutoff of p<0.1, as compared to an average of10.4% [95% confidence interval (CI): 7.9% to 12.6%] in thecis-SNPs derived from 105

sets ofrandomly selected 102 genes by permutation. The enrichment p value was 0.031, defined as theprobability of obtaining the observed result or even more extreme under random sampling. Inaddition, the average log of association p values (logP) for thecis-SNPs of the 102 mouseMEMN core subnetwork genes is-1.11, as compared to an average of-1.00 [95% CI:-1.10 to-0.91] in the 105

random sets (p=0.010). The comparison of the percentage of low association pvalues and the average logP between the mouse MEMN core subnetwork genes and the 105

random gene sets are shown in Supplementary Figures

4a and 4b, respectively.

DISCUSSION

As summarized inTable 5,genetic perturbation

of eightout of thenine

(~90%) candidategenes

which were predicted to be causal for obesity in mice usingourLCMSprocedurecausedsignificant alterations

The liver gene expression signaturegenes from the validation mouse models highly overlapped with one another and with themetabolic trait-associated MEMN module genes. All of these causal candidate genes were foundto impact a common liver transcriptional subnetwork that is enriched for GO metabolic pathwaysNG-A23866

Revision 2

14

and the MEMN module.These multiple lines of evidence

suggest that the perturbation of thesepredicted causalgenes influences obesity via a common functional mechanism.

which isinvolved in the production of androgens33.Down-regulation of both genes in

the ko mousemodels may alter the impact of sex hormones

and lead to sex-specific phenotypes.

Among the newly validated genes,

Gas7

was originally identified as a gene that was expressed inserum-starved NIH3T3 cells, and its protein structure resemblesOct2 and synapsins, which areinvolved in neuronal development and neurotransmitter release, respectively34,35. It isselectively expressed in mature cerebral cortical, hippocampal, and cerebellar neurons35.Ourstudies now indicaterelevance of this gene tofat metabolism

and other previously unknownpathways such asthe insulin signaling pathway.

Gpx3

is involved in cellular protection against oxidative damage through the reduction ofperoxides36.The cytosolic isoform ofGpx3,Gpx1, has been associated with obesity

37, andrecently

the dysregulation ofGpx3

in the plasma and adipose of obese subjects has beenimplicated in the increase in inflammatory signals and oxidative stressand hence obesity-relatedmetabolic disorders38.Our study provides

by a given genetic locus and then causevariations in the obesity traits.Thus, it is not surprising to observe a limited overlap betweentheGWAS genes and the mouse causal genes we have identified. The causal genes that are affectedby DNA variation intrans

may not have been identified in the GWAS because the signals weretoo subtle to detect with the current scale, yet are still of interest because they are supported ascausal.The fact that we found a weak

phenotypic characterization and gene expression profiling, thussupporting the LCMSasa powerful tool in predicting causal genesfor diseases.Although thegenes are seemingly disparate, each appears to affect metabolic pathways that are linked to theTCA cycle.Future directions include the application of these network approaches to additionalrelevant tissues such as adipose,with incorporation ofpotential cross-tissueinteractions, as wellas environmental variations. Also,the investigation of negative predictive value of the LCMSNG-A23866

Revision 2

17

procedure

would be of value, though this

is amore complicated

problem than it appears on thesurface, given thata

listofLCMS predictedcausal genesfrom one tissueisby no meanscomprehensiveasmany tissuesareinvolved in the regulation of bodyfat.Considering that alarge number of genes influence body weight, focusing on pathways and networks

rather than

pinpointing individual genes may be more efficient inelucidating the pathogenesis of obesityand the development of noveltreatments.

NG-A23866

Revision 2

18

METHODS

Construction of mouse models

Details regarding theGas7

transgenicandMe1

knockoutmouse models

can be found in theSupplementary Methods. TheZfp90,Lactb, andGpx3

transgenic andGyk

andLpl

knockoutmouse models were constructed as described previously4,6-8.Tgfbr2

andC3ar1

knockoutmousemodels were obtained from Deltagen as described previously4.

and maintained on a 12 hour light/dark cycle.Genomic DNA was isolated from ear and tail samples using a DNeasy kit (Qiagen, CA) andgenotyped using PCR. All reactions were carried out usinginitialenzyme activation at 95°C for5 minutes, followed by 35 cycles at 95°C for 30 seconds, 56°C for 30 seconds, and 72°C for 1minute, and finished with an extension at 72°C for 7 minutes.A detailed method for breedingand genotypingMe1-/-

mice has been described by Qian et al41.

Phenotypic characterization of the mouse models

Starting at 11 weeks of age, mice (exceptMe1

ko) were fed a 6% chow diet (Harlan Teklad7013; 6.25% fat, 0% cholesterol) for 12 weeks. Each mouse was monitored for body weight andevaluated by NMR (Brucker Minispec) for body weight composition including lean mass, fatmass and water content over the course of the diet every twoweeks. At the end of the 12-weekdiet, mice were sacrificed using CO2

asphyxiation. Gonadal (fat surrounding the gonads),NG-A23866

Revision 2

19

retroperitoneal (fat beneath the kidneys), mesenteric (fat attached to the intestines), andsubcutaneous (fat below the surface of the skin on the thighs) fat pads were collected andweighed. Liver was collected for RNA profiling. All procedures were done in accordance withthe current National Research Council Guide for the Care and Use of Laboratory Animals andwere approved by theUCLA Animal Research Committee. Additional details regarding thecharacterization of theGpx3

andMe1

mice can be found in the Supplementary Methods.

Analysis of phenotypic data

The Student’s t-test was used to analyze the differences in the phenotypic traits between tg or koanimals and their wt littermate controls. The significance level was set to p<0.05. Significance ofthe difference in the growth curves of fat/muscle ratio between tg/ko and wt controls for allmouse models exceptMe1

ko mice and ingrowth curves of body weight forMe1

mice wasdetermined using an autoregressive method described previously to enhance the power ofdifference detection by leveragingmultiple repeated measures over a number of time points foreach animal

4.

RNA sample preparation and microarray processing

For the liver tissues from theZfp90

transgenics, theTgfbr2

heterozygous ko, theC3ar1

ko, theLpl

heterozygous ko, theMe1

ko, and each of their respective littermate control mice, RNApreparation and array hybridizations were performed at Rosetta Informatics. ForC3ar1,Tgfbr2

andZfp90mouse strains, the custom ink-jet microarrays used in this study were manufactured byAgilent Technologies (Palo Alto, CA) and consisted of 23,574 non-control oligonucleotidesextracted from mouse Unigene clusters and combined with RefSeq sequences and RIKEN full-NG-A23866

femaleheterozygous ko, microarry profiling of the liver tissues was performed using IlluminaMouseRef-8 beadchips. Each beadchip contained 24,886 oligonucleotide probes(849 controland 24,837 non-control) designed based on the Mouse Exonic Evidence Based Oligonucleotide(MEEBO) set, the RIKEN FANTOM 2 database, and the NCBI Reference Sequence (RefSeq)database. Additional details can be found in the Supplementary Methods. All expression datahave been submitted to GEO database under the Super-series accession number GSE12000.

Selection of active or expressed gene sets based on microarray profiling

As a limited number of mice (n=3-9) per mouse model was used for gene expression profilingand the statistical power was low considering the large number of multiple tests for tens ofthousands genes, we restricted attention to the subsets of genes that are more biologicallyrelevant. The Agilent arrays do not provide a measure of "presence" or "absence,” and therefore,we selected set of “most transcriptionally active genes” for mouse models profiled with Agilentarrays using the program Resolver3,45,46.The active genes were defined as those withsignificance level p <0.05 (as determined byerror model3,45,46) in at least 10% animals for eachstrain of mice including both tg/ko and controls.These active genes represent those whoseexpression levels vary across samples and, thus, are more biologically relevant.For mousestrains that were profiled with Illumina arrays, the program BeadArray was used to normalize theexpression intensity values across as well as within arrays using the “average” algorithmembedded in the software. Genes with detection scores of >0.99 (corresponding to detectionp<0.01) in atleast 10% animals for each strain of mice were selected as "expressed" genes. ThisNG-A23866

A Student’s t-test was used to identify genes with significant differences between tg or koanimals and the corresponding wt control mice.

These genes were defined as “signature” genes,representing the perturbed gene expression signature as a result of single gene modification. Thesignificance level was set to p<0.05. The false discovery rate at this significance level wascalculated using Q-value as reported47. All statistical analyses were carried out in the Rstatistical environment.

Pathway analysis

Each signature gene set identified above was classified using Gene Ontology (GO)10

andPanther pathway11

database assignments.

The Ingenuity Pathway Analysis software(Ingenuity® Systems, www.ingenuity.com) was used to analyze the enrichment of canonicalpathways in the signature genes identified above, and we also analyzed the enrichment of ~470functional gene sets curated from public databases using Gene Set Enrichment Analysis (GSEA)12. Additional details can be found

and B6 x CAST (BXC) were described previously18,48. Additional details can be found in theSupplementary Methods. All procedures of housing and treatment of animals were performed inaccordance with IACUC regulations.

Identification of the macrophage-enriched metabolic network (MEMN)

The construction of the coexpression network using liver and

adipose tissues from BXH crossand the identification of the MEMN has been described previously9. Briefly, both genotype andgene

expression data were utilized to construct co-expression networks that consisted of highlyconnected genes from each tissue and sex. An iterative search algorithm was then used to detecthighly interconnected subnetworks. One particular subnetwork that was highly enriched forcausal genes for all metabolic traits tested, highly conserved between tissues and sexes, andhighly enriched for macrophage genes was referred as MEMN.

Liver expression data generated from the above three mouse cross populations BXH/wt,BXH/apoE, and BXC was integrated with the genotypic data also generated in the samepopulations to reconstruct the Bayesian networks as previously described21-24. Additionaldetails can be found in the Supplementary Methods.

The construction ofasubnetwork for a set of signature genes in the network is as follows: givena set of genes, weidentified all of the nearest neighbors of these genes in the network (i.e., weidentified all nodes in the network that were either in the input set or directly connected to a nodeNG-A23866

Revision 2

23

in the input set).

This first step produced a set of node pairs connected by an edge. In the nextstep, direct connections/edges among all the nodes in these pairs were identified and added. Theresulting largest connected subnetwork (i.e, all smaller subnetworks that disconnect from thelargest connected subnetwork were removed) was used as the subnetwork to represent the inputset of signature genes.

The core subnetwork was identified by searching for genes that were present in more than half,in this case, five of thesubnetworks representing the nine liver signature gene sets of the mousemodels. The enrichment of the core subnetwork for 2283 gene sets from Gene OntologyBiological Processes category was analyzed using Fisher's exact test. A statistical cutoff of 2.2e-5 was applied to the nominal p values to reflect the Bonferroni correction of multiple testing.

Comparison of signature genes across mouse models, between signature genes and theBayesian subnetwork, and between signature genes and MEMN genes

The significance of overlap between different gene sets was estimated using Fisher's exact teststatistics under the null hypothesis that the frequency of the genes in one signature set is thesame between a reference set of 18,739 genes with Entrez Gene ID and thecomparison gene set.The background for overlapping between gene sets and subnetworks in liver transcriptionalnetwork is 14,882 genes included in the whole network.

The raw association p values between SNPs and body mass index (BMI) from the GWASconducted by the BROAD Institute31

were downloaded from the official website(http://www.broad.mit.edu/diabetes/).For each gene in the 667 core subnetwork and the102mouse MEMN module9

genes within the core subnetwork, SNPs within 100kb distance (50kbupstream and 50kb downstream), termedcis-SNPs, were selected from dbSNP database and theirassociation p values to BMI in the control population of the BROAD GWAS study wereextracted. We compared each set ofcis-SNPs of interest with 105sets ofcis-SNPs fromrandomly selected gene sets with matched size on the human 44k array, such that the number ofcis-SNPs from the random gene sets roughly matched that of the gene sets of our interest. Twodifferent tests were used to estimate the significance of enrichment for low association p values

as detailed in SupplementaryMethods.

NG-A23866

Revision 2

25

Acknowledgements

The authors thankDr. Richard Davis, Pingzi Wen, Melenie Rosales, Xiaohui Wu, KathleenRanola, and Xiaoyu Xia for helping with the tissue collection. We would also like to thank LeslieIngram-Drake

Table2. Overlap among liver signature gene sets derived from the mouse models of thecandidate causal genes as well as between the signature genes and the previously identifiedmacrophage-enriched metabolic network (MEMN).

SignatureSet

Zfp90

Gas7

Gpx3

Lactb

Me1

Gyk

Lpl

C3ar1

Tgfbr2

MEMNmodule

Zfp90

0

1.14E-03

n.s

8.66E-03

1.83E-03

n.s

n.s

n.s

1.68E-02

1.45E-02

Gas7

1.14E-03

0

1.38E-17

9.55E-06

n.s

4.35E-12

4.29E-02

1.78E-03

n.s

n.s

Gpx3

n.s

1.38E-17

0

2.72E-06

n.s

1.12E-18

7.34E-04

n.s

n.s

1.86E-03

Lactb

8.66E-03

9.55E-06

2.72E-06

0

n.s

2.86E-07

4.61E-02

n.s

n.s

n.s

Me1

1.83E-03

n.s

n.s

n.s

0

n.s

5.87E-03

2.06E-02

n.s

n.s

Gyk

n.s

4.35E-12

1.12E-18

2.86E-07

n.s

0

n.s

n.s

n.s

n.s

Lpl

n.s

4.29E-02

7.34E-04

4.61E-02

5.87E-03

n.s

0

n.s

n.s

7.15E-04

C3ar1

n.s

1.78E-03

n.s

n.s

2.06E-02

n.s

n.s

0

n.s

1.89E-02

Tgfbr2

1.68E-02

n.s

n.s

n.s

n.s

n.s

n.s

n.s

0

1.37E-15

Uncorrected p values<0.05 from Fisher's exact test are listed and the ones that pass Bonferroni-corrected p<0.05 arehighlighted in red. For the overlap among the 9 liver signature gene sets, the Bonferroni corrected p value cutoff is0.05/45=1.11e-3. For the overlap between the signature gene sets and MEMN module, the Bonferroni corrected pvalue cutoff is 0.05/9=5.56e-3.

Figure 1.Adiposity (fat/muscle ratio)or body weightgrowth curves in the mouse models. Thegrowth curves of males and females for each model are derived from the biweekly measurementof fat/muscle ratio every two weeks over the course of 14 weeks ona 6% fatdiet. ForMe1

ko,the growth curves are from weekly measurement of body weight over the course of 10 weeks onahigh fat diet. The p values are derived from the autoregressive model, which indicate thedifferences between the growth curves of tg/ko and wt, and are < 10-10

for panels a, b, c, i, and j;< 10-5

for panels d and e; <

10-2

for panel g, and > 0.05 for panels f and h.

Figure 2.Disruptionof metabolic pathways involved in fat pad mass trait in mouse models of thecandidate genes. The nine mouse models are labeled as 1 to 9 in red. Each of the metabolicpathways previously identified to be different between fat and lean mouse

is marked with theidentifiers of themouse models whose liver gene expression signatures are enriched for thespecific pathway. The number of pathways that are over-represented in the liver gene expressionsignature of each mouse model is listed in parenthesis following the name of each mouse

model.

Figure 3.A portion of the core subnetwork,

derived from the liver transcriptional subnetworksrepresentative of gene expression signatures of the mouse models of the candidate genes. Theliver transcriptional network is the union of Bayesian networks constructed from three crossesderived from B6, C3H, and CAST. This core subnetwork consists of key regulators for fattyacid and lipid metabolism,includingInsig1

andInsig2

(in red),

and is enriched for genesinvolvedinrelated GO biological processes.

A scalable image of the full subnetwork is

includedas Supplementary

Figure5.

NG-A23866

Revision 2

35

EdSumm

Thomas Drake and colleagues report the results of knocking out nine candidate genes for obesityin mice. Eight of the nine knockouts result in significant changes in obesity-related traits,validating their previously developed approach for identifying candidate genes involved inparticular phenotypes. They further identify related metabolic pathways that are altered byknockout of the eight genes.