Patent application title: TWO BIOMARKERS FOR DIAGNOSIS AND MONITORING OF ATHEROSCLEROTIC CARDIOVASCULAR DISEASE

Abstract:

The present invention identifies two circulating proteins that have been
newly identified as being differentially expressed in atherosclerosis.
Circulating levels of these two proteins, particularly as a panel of
proteins, can discriminate patients with acute myocardial infarction from
those with stable exertional angina and from those with no history of
atherosclerotic cardiovascular disease. Such levels can also predict
cardiovascular events, determine the effectiveness of therapy, stage
disease, and the like. For example, these markers are useful as surrogate
biomarkers of clinical events needed for development of vascular specific
pharmaceutical agents.

2. A method for generating a result useful in diagnosing and monitoring
atherosclerotic disease using a sample obtained from a mammalian subject,
comprising:obtaining a dataset associated with said sample, wherein said
dataset comprises protein expression levels for at least three protein
markers selected from the group consisting of RANTES, TIMP1, MCP-1,
MCP-2, MCP-3, MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5,
IL-7, and IGF-1, wherein one of the at least three protein markers is
RANTES or TIMP1; andinputting said dataset into an analytical process
that uses said data to generate a result useful in diagnosing and
monitoring atherosclerotic disease.

3. The method of claim 1 wherein said result is a classification, a
continuous variable or a vector.

4. The method of claim 3 wherein the classification comprises two or more
classes.

5. The method of claim 4 wherein the classification is a pseudo coronary
calcium score and the two or more classes are a low coronary calcium
score and a high coronary calcium score.

6. The method of claim 1 wherein said analytical process is a linear
algorithm, a quadratic algorithm, a polynomial algorithm, a decision tree
algorithm, a voting algorithm, a Linear Discriminant Analysis model, a
support vector machine classification algorithm, a recursive feature
elimination model, a prediction analysis of microarray model, a Logistic
Regression model, a CART algorithm, a FlexTree algorithm, a LART
algorithm, a random forest algorithm, a MART algorithm, or Machine
Learning algorithms.

7. The method of claim 1, wherein said analytical process comprises use of
a predictive model.

8. The method of claim 1, wherein said analytical process comprises
comparing said obtained dataset with a reference dataset.

9. The method of claim 8, wherein said reference dataset comprises protein
expression levels obtained from one or more healthy control subjects, or
comprises protein expression levels obtained from one or more subjects
diagnosed with an atherosclerotic disease.

10. The method of claim 8, further comprising obtaining a statistical
measure of a similarity of said obtained dataset to said reference
dataset.

11. The method of claim 8, wherein said statistical measure is derived
from a comparison of at least three parameters of said obtained dataset
to corresponding parameters from said reference dataset.

12. A method for classifying a sample obtained from a mammalian subject,
comprising:obtaining a dataset associated with said sample, wherein said
dataset comprises protein expression levels for at least three protein
markers selected from the group consisting of RANTES, TIMP1, MCP-1,
MCP-2, MCP-3, MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5,
IL-7, and IGF-1, wherein one of the at least three protein markers is
RANTES or TIMP1;inputting said dataset into an analytical process that
uses said data to classify said sample, wherein said classification is
selected from the group consisting of an atherosclerotic cardiovascular
disease classification, a healthy classification, a medication exposure
classification, a no medication exposure classification, a low coronary
calcium score and a high coronary calcium score; andclassifying said
sample according to the output of said process.

13. The method of claim 1, wherein said analytical process comprises use
of a predictive model.

14. The method of claim 1, wherein said analytical process comprises
comparing said obtained dataset with a reference dataset.

15. The method of claim 14, wherein said reference dataset comprises
protein expression levels obtained from one or more healthy control
subjects, or comprises protein expression levels obtained from one or
more subjects diagnosed with an atherosclerotic disease.

16. The method of claim 14, further comprising obtaining a statistical
measure of a similarity of said obtained dataset to said reference
dataset.

17. The method of claim 16, wherein said statistical measure is derived
from a comparison of at least three parameters of said obtained dataset
to corresponding parameters from said reference dataset.

18. The method of claim 1, wherein said at least three protein markers
comprise a marker set selected from the group consisting of RANTES,
TIMP1, MCP-1, IGF-1, TNFa, M-CSF, Ang-2, and MCP-4.

23. A method for classifying a sample obtained from a mammalian subject,
comprising:obtaining a dataset associated with said sample, wherein said
dataset comprises protein expression levels for at least three protein
markers selected from the group consisting of MCP1, MCP2, MCP3, MCP4,
Eotaxin, IP10, MCSF, IL3, TNFα, ANG2, IL5, IL7, IGF1, IL10,
INFγ, VEGF, MIP1a, RANTES, IL6, IL8, ICAM-1, TIMP1, CCL19,
TCA4/6kine/CCL21, CSF3, TRANCE, IL2, IL4, IL13, Il1b, CXCL1/GRO1,
GROalpha, IL12, and Leptin, wherein one of the at least three protein
markers is RANTES or TIMP1;inputting said data into a predictive model
that uses said data to classify said sample, wherein said classification
is selected from the group consisting of an atherosclerotic
cardiovascular disease classification, a healthy classification, a
medication exposure classification, a no medication exposure
classification, wherein said predictive model has at least one quality
metric of at least 0.7 for classification; andclassifying said sample
according to the output of said predictive model.

24. The method of claim 23, wherein said predictive model has a quality
metric of at least 0.8 for classification.

25. The method of claim 24, wherein said predictive model has a quality
metric of at least 0.9 for classification.

26. The method of claim 23, wherein said quality metric is selected from
AUC and accuracy.

27. The method of claim 23, wherein the limits of said predictive model
are adjusted to provide at least one of sensitivity or specificity of at
least 0.7.

28. The method of claim 25, wherein the limits of said predictive model
are adjusted to provide at least one of sensitivity or specificity of at
least 0.7.

29. The method of claim 1, wherein said atherosclerotic cardiovascular
disease classification is selected from the group consisting of coronary
artery disease, myocardial infarction, and angina.

35. The method of claim 34, wherein said process comprises using a Linear
Discriminant Analysis model or a Logistic Regression model, and said
model comprises terms selected to provide a quality metric greater than
0.75.

36. The method of claim 1, further comprising obtaining a plurality of
classifications for a plurality of samples obtained at a plurality of
different times from said subject.

Description:

CROSS REFERENCE TO RELATED APPLICATION

[0001]This application claims the benefit of U.S. Provisional Application
No. 60/876,614, filed Dec. 22, 2006, which is hereby incorporated by
reference in its entirety.

BACKGROUND OF THE INVENTION

[0002]1. Field of the Invention

[0003]This application is directed to the fields of bioinformatics and
atherosclerotic disease. In particular this invention relates to methods
and compositions for diagnosing and monitoring atherosclerotic disease.

[0004]2. Description of the Related Art

[0005]Because of our limited ability to provide early and accurate
diagnosis followed by aggressive treatment, atherosclerotic
cardiovascular disease (ASCVD) remains the primary cause of morbidity and
mortality worldwide. Patients with ASCVD represent a heterogeneous group
of individuals, with a disease that progresses at different rates and in
distinctly different patterns. Despite appropriate evidence-based
treatments for patients with ASCVD, recurrence and mortality rates remain
high. Also, the full benefits of primary prevention are unrealized due to
our inability to accurately identify those patients who would benefit
from aggressive risk reduction.

[0006]Whereas certain disease markers have been shown to predict outcome
or response to therapy at a population level, they are not sufficiently
sensitive or specific to provide adequate clinical utility in an
individual patient. As a result, the first clinical presentation for more
than half of the patients with coronary artery disease is either
myocardial infarction or death.

[0007]Physical examination and current diagnostic tools cannot accurately
determine an individual's risk for suffering a complication of ASCVD.
Known risk factors such as hypertension, hyperlipidemia, diabetes, family
history, and smoking do not establish the diagnosis of atherosclerosis
disease. Diagnostic modalities which rely on anatomical data (such as
coronary angiography, coronary calcium score, CT or MRI angiography) lack
information on the biological activity of the disease process and can be
poor predictors of future cardiac events. Functional assessment of
endothelial function can be non-specific and unrelated to the presence of
atherosclerotic disease process, although some data has demonstrated the
prognostic value of these measurements. Individual biomarkers, such as
the lipid and inflammatory markers, have been shown to predict outcome
and response to therapy in patients with ASCVD and some are utilized as
important risk factors for developing atherosclerotic disease.
Nonetheless, up to this point, no single biomarker is sufficiently
specific to provide adequate clinical utility for the diagnosis of ASCVD
in an individual patient.

Complex Nature of Atherosclerotic Cardiovascular Disease

[0008]In general, atherosclerosis is believed to be a complex disease
involving multiple biological pathways. Variations in the natural history
of the atherosclerotic disease process, as well as differential response
to risk factors and variations in the individual response to therapy,
reflect in part differences in genetic background and their intricate
interactions with the environmental factors that are responsible for the
initiation and modification of the disease. Atherosclerotic disease is
also influenced by the complex nature of the cardiovascular system itself
where anatomy, function and biology all play important roles in health as
well as disease. Given such complexities, it is unlikely that an
individual marker or approach will yield sufficient information to
capture the true nature of the disease process.

Single Biomarker Approach

Inflammation

[0009]Inflammation has been implicated in all stages of ASCVD and is
considered to be a major part of the pathophysiological basis of
atherogenesis, providing a potential marker of the disease process.
Elevated circulating inflammatory biomarkers have been shown to stratify
cardiovascular risk and assess response to therapy in large
epidemiological studies. Currently, while general markers of inflammation
are potentially useful in risk stratification, they are not adequate to
identify the presence of CAD in an individual, due a lack of specificity
for many markers. For similar reasons, the general markers of
inflammation such as C-reactive protein (CRP) and erythrocyte
sedimentation rate (ESR) have long been abandoned as specific diagnostic
markers in other inflammatory diseases such as lupus and rheumatoid
arthritis, although they remain important markers for risk stratification
and response to therapy in clinical practice.

[0010]It is also possible that the heterogeneity of the individual
response to environmental risk factors induces a high variability in
ASCVD marker concentration. In this context, biological information
carried by a single inflammatory protein cannot be sufficient in
providing a comprehensive representation of the vascular inflammatory
state, and may not be able to accurately identify the presence or extent
of the disease.

Pathophysiological Basis of Atherosclerosis

[0011]Atherosclerotic plaque consists of accumulated intracellular and
extracellular lipids, smooth muscle cells, connective tissue, and
glycosaminoglycans. The earliest detectable lesion of atherosclerosis is
the fatty streak, consisting of lipid-laden foam cells, which are
macrophages that have migrated as monocytes from the circulation into the
subendothelial layer of the intima, which later evolves into the fibrous
plaque, consisting of intimal smooth muscle cells surrounded by
connective tissue and intracellular and extracellular lipids. As plaques
develop, calcium is deposited.

[0012]Interrelated hypotheses have been proposed to explain the
pathogenesis of atherosclerosis. The lipid hypothesis postulates that an
elevation in plasma LDL levels results in penetration of LDL into the
arterial wall, leading to lipid accumulation in smooth muscle cells and
in macrophages. LDL also augments smooth muscle cell hyperplasia and
migration into the subintimal and intimal region in response to growth
factors. LDL is modified or oxidized in this environment and is rendered
more atherogenic. The modified or oxidized LDL is chemotactic to
monocytes, promoting their migration into the intima, their early
appearance in the fatty streak, and their transformation and retention in
the subintimal compartment as macrophages. Scavenger receptors on the
surface of macrophages facilitate the entry of oxidized LDL into these
cells, transferring them into lipid-laden macrophages and foam cells.
Oxidized LDL is also cytotoxic to endothelial cells and may be
responsible for their dysfunction or loss from the more advanced lesion.

[0013]The chronic endothelial injury hypothesis postulates that
endothelial injury by various mechanisms produces loss of endothelium,
adhesion of platelets to subendothelium, aggregation of platelets,
chemotaxis of monocytes and T-cell lymphocytes, and release of
platelet-derived and monocyte-derived growth factors that induce
migration of smooth muscle cells from the media into the intima, where
they replicate, synthesize connective tissue and proteoglycans, and form
a fibrous plaque. Other cells, e.g. macrophages, endothelial cells,
arterial smooth muscle cells, also produce growth factors that can
contribute to smooth muscle hyperplasia and extracellular matrix
production.

[0014]Endothelial dysfunction includes increased endothelial permeability
to lipoproteins and other plasma constituents, expression of adhesion
molecules and elaboration of growth factors that lead to increased
adherence of monocytes, macrophages and T lymphocytes. These cells may
migrate through the endothelium and situate themselves within the
subendothelial layer. Foam cells also release growth factors and
cytokines that promote migration of smooth muscle cells and stimulate
neointimal proliferation, continue to accumulate lipid and support
endothelial cell dysfunction. Clinical and laboratory studies have shown
that inflammation plays a major role in the initiation, progression and
destabilization of atheromas.

[0015]The "autoimmune" hypothesis postulates that the inflammatory
immunological processes characteristic of the very first stages of
atherosclerosis are initiated by humoral and cellular immune reactions
against an endogenous antigen. Human Hsp60 expression itself is a
response to injury initiated by several stress factors known to be risk
factors for atherosclerosis, such as hypertension. Oxidized LDL is
another candidate for an autoantigen in atherosclerosis. Antibodies to
oxLDL have been detected in patients with atherosclerosis, and they have
been found in atherosclerotic lesions. T lymphocytes isolated from human
atherosclerotic lesions have been shown to respond to oxLDL and to be a
major autoantigen in the cellular immune response. A third autoantigen
proposed to be associated with atherosclerosis is 2-Glycoprotein I
(2GPI), a glycoprotein that acts as an anticoagulant in vitro. 2GPI is
found in atherosclerotic plaques, and hyper-immunization with 2GPI or
transfer of 2GPI-reactive T cells enhances fatty streak formation in
transgenic atherosclerotic-prone mice.

[0016]Infections may contribute to the development of atherosclerosis by
inducing both inflammation and autoimmunity. A large number of studies
have demonstrated a role of infectious agents, both viruses
(cytomegalovirus, herpes simplex viruses, enteroviruses, hepatitis A) and
bacteria (C. pneumoniae, H. pylori, periodontal pathogens) in
atherosclerosis. Recently, a new "pathogen burden" hypothesis has been
proposed, suggesting that multiple infectious agents contribute to
atherosclerosis, and that the risk of cardiovascular disease posed by
infection is related to the number of pathogens to which an individual
has been exposed. Of single micro-organisms, C. pneumoniae probably has
the strongest association with atherosclerosis.

[0017]These hypotheses are closely linked and not mutually exclusive.
Modified LDL is cytotoxic to cultured endothelial cells and may induce
endothelial injury, attract monocytes and macrophages, and stimulate
smooth muscle growth. Modified LDL also inhibits macrophage mobility, so
that once macrophages transform into foam cells in the subendothelial
space they may become trapped. In addition, regenerating endothelial
cells (after injury) are functionally impaired and increase the uptake of
LDL from plasma.

[0018]Atherosclerosis is characteristically silent until critical
stenosis, thrombosis, aneurysm, or embolus supervenes. Initially,
symptoms and signs reflect an inability of blood flow to the affected
tissue to increase with demand, e.g. angina on exertion, intermittent
claudication. Symptoms and signs commonly develop gradually as the
atheroma slowly encroaches on the vessel lumen. However, when a major
artery is acutely occluded, the symptoms and signs may be dramatic.

[0019]As mentioned above, currently, due to lack of appropriate diagnostic
strategies, the first clinical presentation of more than half of the
patients with coronary artery disease is either myocardial infarction or
death. Further progress in prevention and treatment depends on the
development of strategies focused on the primary inflammatory process in
the vascular wall, which is fundamental in the etiology of
atherosclerotic disease. Without good surrogate markers that accurately
report the activity and/or extent of vessel wall disease, methods cannot
be developed that completely define risk, monitor the effects of risk
reduction toward primary disease amelioration, or develop new classes of
therapies that target the vessel wall.

[0020]One promising approach is the identification of circulating proteins
that reflect the degree and character of vascular inflammation as the
hallmark of active cardiovascular disease. A number of immune modulatory
proteins have been identified to have some value as surrogate markers,
but such biomarkers have not been shown to add sufficient information to
have clinical utility. This is due to: i) the failure to consider data on
multiple markers measured in parallel, ii) the failure to integrate
individual marker data with clinical data that modulates the levels of
circulating proteins and obscures the informative patterns, iii)
inherited genetic variation that contributes to expression levels of the
genes encoding the markers and confounds the abundance measurements, and
iv) a lack of information regarding specific immune pathways activated in
ASCVD that would better inform biomarker choice. Finally, the prior art
fails to provide effective diagnostic or predictive methods using
measurements of a panel of circulating proteins.

Unmet Clinical and Scientific Need

[0021]As described above, there is an unmet need for use in clinical
medicine and biomedical research for improved tools to identify
individuals with vascular inflammation and active atherosclerotic
cardiovascular disease. At present, although insights into mechanisms and
circumstances of atherosclerosis are increasing, our methods for
identifying high-risk patients and predicting the efficacy of prevention
strategies remain inadequate. New approaches are needed to better
diagnose patients with active atherosclerotic cardiovascular disease at
risk for near-term cardiovascular complications. Identification of such
patients can lead to initiation of much needed therapies that can result
in improved clinical outcomes. The present invention addresses these and
other shortcomings of the prior art.

[0023]Another preferred set of protein markers is RANTES, TIMP1, MCP-1,
MCP-2, MCP-3, MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5,
IL-7, and IGF-1. In certain aspects, the result will be a classification,
a continuous variable or a vector. Such classifications may include two
or more classes, three or more classes, four or more classes, or five or
more classes. An exemplary classification is a pseudo coronary calcium
score where the two or more classes are a low coronary calcium score and
a high coronary calcium score.

[0024]Preferred forms of the analytical process are a linear algorithm, a
quadratic algorithm, a polynomial algorithm, a decision tree algorithm, a
voting algorithm, a Linear Discriminant Analysis model, a support vector
machine classification algorithm, a recursive feature elimination model,
a prediction analysis of microarray model, a Logistic Regression model, a
CART algorithm, a FlexTree algorithm, a LART algorithm, a random forest
algorithm, a MART algorithm, or Machine Learning algorithms. The
analytical processes may use a predictive model or may involve comparing
the obtained dataset with a reference dataset. In certain aspects, the
reference dataset may be data obtained from one or more healthy control
subjects or from one or more subjects diagnosed with an atherosclerotic
disease. Comparing the reference dataset to the obtained dataset may
include obtaining a statistical measure of a similarity of said obtained
dataset to said reference dataset, which may be a comparison of at least
three parameters of said obtained dataset to corresponding parameters
from said reference dataset.

[0025]In certain aspects, the classes may be an atherosclerotic
cardiovascular disease classification, a healthy classification, a
medication exposure classification, a no medication exposure
classification, a low coronary calcium score and a high coronary calcium
score.

[0027]Preferred analytical processes will provide a quality metric of at
least 0.7, at least 0.75, at least 0.8, at least 0.85, or at least 0.9,
where preferred quality metrics are AUC and accuracy. Additionally,
preferred analytical processes will provide at least one of sensitivity
or specificity of at least 0.65, at least 0.7, or at least 0.75.

[0029]In addition to the other markers disclosed herein, the markers may
be selected from one or more clinical indicia, examples of which are age,
gender, LDL concentration, HDL concentration, triglyceride concentration,
blood pressure, body mass index, CRP concentration, coronary calcium
score, waist circumference, tobacco smoking status, previous history of
cardiovascular disease, family history of cardiovascular disease, heart
rate, fasting insulin concentration, fasting glucose concentration,
diabetes status, and use of high blood pressure medication.

[0030]This invention provides methods for detection of circulating protein
expression for diagnosis, monitoring, and development of therapeutics,
with respect to atherosclerotic conditions, including but not limited to
conditions that lead to angina, unstable angina, acute coronary syndrome,
myocardial infarction, and heart failure. Specifically, circulating
proteins are identified and described herein that are differentially
expressed in atherosclerotic patients, including but not limited to
circulating inflammatory markers. Circulating inflammatory markers
identified herein include MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10,
M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1.

[0031]The detection of circulating levels of proteins identified herein,
which are specifically produced in the vascular wall as a result of the
atherosclerotic process, can classify patients as belonging to
atherosclerotic conditions, including atherosclerotic disease, no
disease, myocardial infarction, stable angina, treatment with medication,
no treatment, and the like. Such classification can also be used in
prediction of cardiovascular events and response to therapeutics; and are
useful to predict and assess complications of cardiovascular disease.

[0032]In one embodiment of the invention, the expression profile of a
panel of proteins is evaluated for conditions indicative of various
stages of atherosclerosis and clinical sequelae thereof. Such a panel
provides a level of discrimination not found with individual markers. In
one embodiment, the expression profile is determined by measurements of
protein concentrations or amounts.

[0033]Methods of analysis may include, without limitation, utilizing a
dataset to generate a predictive model, and inputting test sample data
into such a model in order to classify the sample according to an
atherosclerotic classification, where the classification is selected from
the group consisting of an atherosclerotic disease classification, a
healthy classification, a vascular inflammation classification, a
medication exposure classification, a no medication exposure
classification, and a coronary calcium score classification, and
classifying the sample according to the output of the process. In some
embodiments, such a predictive model is used in classifying a sample
obtained from a mammalian subject by obtaining a dataset associated with
a sample, wherein the dataset comprises at least three, or at least four,
or at least five protein markers selected from the group consisting of
TIMP1, RANTES, MCP1; MCP2; MCP3; MCP4; Eotaxin; IP10; MCSF; IL3; TNFa;
ANG2; IL5; IL7; IGF1; IL10; INFEy; VEGF; MIP1a; RANTES; IL6; IL8; ICAM-1;
TIMP1; IL2; IL4; IL13; and Il1b. The data optionally includes a profile
for clinical indicia; additional protein expression profiles; metabolic
measures, genetic information, and the like.

[0034]A predictive model of the invention utilizes quantitative data, such
as protein expression levels, from one or more sets of markers described
herein. In some embodiments a predictive model provides for a level of
accuracy in classification; i.e. the model satisfies a desired quality
threshold. A quality threshold of interest may provide for an accuracy or
AUC of a given threshold, and either or both of these terms (AUC;
accuracy) may be referred to herein as a quality metric. A predictive
model may provide a quality metric, e.g. accuracy of classification or
AUC, of at least about 0.7, at least about 0.8, at least about 0.9, or
higher. Within such a model, parameters may be appropriately selected so
as to provide for a desired balance of sensitivity and selectivity.

[0035]In other embodiments, analysis of circulating proteins is used in a
method of screening biologically active agents for efficacy in the
treatment of atherosclerosis. In such methods, cells associated with
atherosclerosis, e.g. cells of the vessel wall, etc., are contacted in
culture or in vivo with a candidate agent, and the effect on expression
of one or more of the markers, e.g. a panel of markers, is determined. In
another embodiment, analysis of differential expression of the above
circulating proteins is used in a method of following therapeutic
regimens in patients. In a single time point or a time course,
measurements of expression of one or more of the markers, e.g. a panel of
markers, is determined when a patient has been exposed to a therapy,
which may include a drug, combination of drugs, non-pharmacologic
intervention, and the like.

[0036]In another method, relative quantitative measures of 3 or more of
atherosclerosis associated proteins identified herein are used to
diagnose or monitor atherosclerotic disease in an individual. This panel
of proteins identified herein can further include other clinical indicia;
additional protein expression profiles; metabolic measures, genetic
information, and the like.

[0037]In another embodiment, the invention includes methods for
classifying a sample obtained from a mammalian subject by obtaining a
dataset associated with a sample, wherein the dataset comprises protein
expression levels for at least three, or at least four, or at least five,
or at least six, or at least seven, or at least eight, or at least nine,
or more than nine protein markers selected from the group consisting of
TIMP1, RANTES, MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10, M-CSF, IL-3,
TNFa, Ang-2, IL-5, IL-7, and IGF-1, inputting the data into an analytical
process that uses the data to classify the sample, where the
classification is selected from the group consisting of an
atherosclerotic disease classification, a healthy classification, a
vascular inflammation classification, a medication exposure
classification, a no medication exposure classification, and a coronary
calcium score classification, and classifying the sample according to the
output of the process.

[0038]In another embodiment, the invention includes methods for
classifying a sample obtained from a mammalian subject by obtaining a
dataset associated with a sample, wherein the dataset comprises protein
expression levels for at least three, or at least four, or at least five,
or at least six, protein markers that each shows a correlation between a
circulating protein concentration and an atherosclerotic vascular tissue
RNA concentration, inputting the data into an analytical process that
uses the data to classify the sample, where the classification is
selected from the group consisting of an atherosclerotic disease
classification, a healthy classification, a vascular inflammation
classification, a medication exposure classification, a no medication
exposure classification, and a coronary calcium score classification, and
classifying the sample according to the output of the process.

BRIEF DESCRIPTION OF THE DRAWINGS

[0039]FIG. 1 shows term selection for a Logistic regression model using
cross-validation. A model including TIMP1, MCP-1 and RANTES satisfies the
expected AUC threshold of 0.85.

[0040]FIG. 2 shows the term selection for a Linear discriminant analysis
model using cross-validation. A model including TIMP1, MCP-1 and RANTES
satisfies the expected AUC threshold of 0.85.

[0041]FIG. 3 shows the term selection for a Logistic regression model
using cross-validation for the classification of subjects with CCS<10
vs. those with CCS>400

[0042]FIG. 4 shows the term selection for a Logistic regression model
using the AIC criterion for the classification of subjects with CCS<10
vs. those with CCS>400

[0044]FIG. 5b shows expected AUC value and S.E. for a series of Logistic
Regression models involving an increasing number of terms in the order
given in the figure (=inverse order of term removal from the complete
model by applying the AIC criterion in the marker selection process).

[0046]FIG. 7 shows a Logistic regression model including alternate
clinical variables and biological markers. A model including "Beta
Blockers" (DC512) and "Statins" (DC3005) and MCP-4 produces an expected
value of AUC in excess of 0.85.

[0047]FIG. 8 shows boxplots of value distribution of the first
discriminant variate for the three groups: "Untreated," "ACE or Statins,"
and "ACE and Statins."

[0048]FIG. 9 shows the general method applied using 10-fold
cross-validation to select an optimum set of markers with an optimum
analytical process.

[0049]FIG. 10 shows a demonstration of the 10-fold cross-validation
approach to select an optimum set of markers using accuracy as a
selection criterion.

DETAILED DESCRIPTION OF THE INVENTION

Overview

[0050]The methods of this invention are useful for diagnosing and
monitoring atherosclerotic disease. Atherosclerotic disease is also known
as atherosclerosis, arteriosclerosis, atheromatous vascular disease,
arterial occlusive disease, or cardiovascular disease, and is
characterized by plaque accumulation on vessel walls and vascular
inflammation. Vascular inflammation is hallmark of active atherosclerotic
disease, unstable plaque, or vulnerable plaque. The plaque consists of
accumulated intracellular and extracellular lipids, smooth muscle cells,
connective tissue, inflammatory cells, and glycosaminoglycans. Certain
plaques also contain calcium. Unstable or active or vulnerable plaques
are enriched with inflammatory cells.

[0051]By way of example, the present invention includes methods for
generating a result useful in diagnosing and monitoring atherosclerotic
disease by obtaining a dataset associated with a sample, where the
dataset at least includes quantitative data (typically protein expression
levels) about protein markers which Applicants have identified as
predictive of atherosclerotic disease, and inputting the dataset into an
analytic process that uses the dataset to generate a result useful in
diagnosing and monitoring atherosclerotic disease. In certain
embodiments, the dataset also includes quantitative data about other
protein markers previously identified by others as being predictive of
atherosclerotic disease and clinical indicia. This quantitative data
about other protein markers may be DNA, RNA, or protein expression
levels.

[0052]The present invention identifies expression profiles of biomarkers
of inflammation that can be used for diagnosis and classification of
atherosclerotic cardiovascular disease. The protein markers used in the
present invention are those identified using a learning algorithm as
being capable of distinguishing between different atherosclerotic
classifications, e.g., diagnosis, staging, prognosis, monitoring,
therapeutic response, prediction of pseudo-coronary calcium score. Other
data useful for making atherosclerotic classifications, such as other
protein markers previously identified as being predictive of
cardiovascular disease and various clinical indicia, may also be a part
of the dataset use to generate a result useful for atherosclerotic
classification.

[0053]Datasets containing quantitative data, typically protein expression
levels, for the various protein markers used in the present invention,
and quantitative data for other dataset components (e.g., DNA, RNA, and
protein expression levels for markers previously identified as useful by
others, measures of clinical indicia) can be inputted into an analytical
process and used to generate a result. The analytic process may be any
type of learning algorithm with defined parameters, or in other words, a
predictive model. Predictive models can be developed for a variety of
atherosclerotic classifications by applying learning algorithms to the
appropriate type of reference or control data. The result of the
analytical process/predictive model can be used by an appropriate
individual to take the appropriate course of action. For example, if the
classification is "healthy" or "atherosclerotic cardiovascular disease",
then a result can be used to determine the appropriate clinical course of
treatment for an individual.

[0054]The present invention is also useful for diagnosing and monitoring
complications of cardiovascular disease, including myocardial infarction,
acute coronary syndrome, stroke, heart failure, and angina. An example of
a common complication is myocardial infarction, which refers to ischemic
myocardial necrosis usually resulting from abrupt reduction in coronary
blood flow to a segment of myocardium. In the great majority of patients
with acute MI, an acute thrombus, often associated with plaque rupture,
occludes the artery that supplies the damaged area. Plaque rupture occurs
generally in arteries previously partially obstructed by an
atherosclerotic plaque enriched in inflammatory cells. Altered platelet
function induced by endothelial dysfunction and vascular inflammation in
the atherosclerotic plaque presumably contributes to thrombogenesis.
Myocardial infarction can be classified into ST-elevation and non-ST
elevation MI (also referred to as unstable angina). In both forms of
myocardial infarction, there is myocardial necrosis. In ST-elevation
myocardial infraction there is transmural myocardial injury which leads
to ST-elevations on electrocardiogram. In non-ST elevation myocardial
infarction, the injury is sub-endocardial and is not associated with ST
segment elevation on electrocardiogram. Another example of a common
atherosclerotic complication is angina, a condition with symptoms of
chest pain or discomfort resulting from inadequate blood flow to the
heart.

DEFINITIONS

[0055]Terms used in the claims and specification are defined as set forth
below unless otherwise specified.

[0056]The term "monitoring" as used herein refers to the use of results
generated from datasets to provide useful information about an individual
or an individual's health or disease status. "Monitoring" can include,
for example, determination of prognosis, risk-stratification, selection
of drug therapy, assessment of ongoing drug therapy, determination of
effectiveness of treatment, prediction of outcomes, determination of
response to therapy, diagnosis of a disease or disease complication,
following of progression of a disease or providing any information
relating to a patient's health status over time, selecting patients most
likely to benefit from experimental therapies with known molecular
mechanisms of action, selecting patients most likely to benefit from
approved drugs with known molecular mechanisms where that mechanism may
be important in a small subset of a disease for which the medication may
not have a label, screening a patient population to help decide on a more
invasive/expensive test, for example, a cascade of tests from a
non-invasive blood test to a more invasive option such as biopsy, or
testing to assess side effects of drugs used to treat another indication.
In particular, the term "monitoring" can refer to atherosclerosis
staging, atherosclerosis prognosis, vascular inflammation levels,
assessing extent of atherosclerosis progression, monitoring a therapeutic
response, predicting a coronary calcium score, or distinguishing stable
from unstable manifestations of atherosclerotic disease.

[0057]The term "quantitative data" as used herein refers to data
associated with any dataset components (e.g., protein markers, clinical
indicia, metabolic measures, or genetic assays) that can be assigned a
numerical value. Quantitative data can be a measure of the DNA, RNA, or
protein level of a marker and expressed in units of measurement such as
molar concentration, concentration by weight, etc. For example, if the
marker is a protein, quantitative data for that marker can be protein
expression levels measured using methods known to those skill in the art
and expressed in mM or mg/dL concentration units.

[0058]The term "ameliorating" refers to any therapeutically beneficial
result in the treatment of a disease state, e.g., an atherosclerotic
disease state, including prophylaxis, lessening in the severity or
progression, remission, or cure thereof.

[0059]The term "mammal" as used herein includes both humans and non-humans
and include but is not limited to humans, non-human primates, canines,
felines, murines, bovines, equines, and porcines.

[0060]The term "pseudo coronary calcium score" as used herein refers to a
coronary calcium score generated using the methods as disclosed herein
rather than through measurement by an imaging modality. One of skill in
the art would recognize that a pseudo coronary calcium score may be used
interchangeably with a coronary calcium score generated through
measurement by an imaging modality.

[0061]The term percent "identity" in the context of two or more nucleic
acid or polypeptide sequences, refer to two or more sequences or
subsequences that have a specified percentage of nucleotides or amino
acid residues that are the same, when compared and aligned for maximum
correspondence, as measured using one of the sequence comparison
algorithms described below (e.g., BLASTP and BLASTN or other algorithms
available to persons of skill) or by visual inspection. Depending on the
application, the percent "identity" can exist over a region of the
sequence being compared, e.g., over a functional domain, or,
alternatively, exist over the full length of the two sequences to be
compared.

[0062]For sequence comparison, typically one sequence acts as a reference
sequence to which test sequences are compared. When using a sequence
comparison algorithm, test and reference sequences are input into a
computer, subsequence coordinates are designated, if necessary, and
sequence algorithm program parameters are designated. The sequence
comparison algorithm then calculates the percent sequence identity for
the test sequence(s) relative to the reference sequence, based on the
designated program parameters.

[0064]One example of an algorithm that is suitable for determining percent
sequence identity and sequence similarity is the BLAST algorithm, which
is described in Altschul et al., J. Mol. Biol. 215:403-410 (1990).
Software for performing BLAST analyses is publicly available through the
National Center for Biotechnology Information (www.ncbi.nlm.nih.gov/).

[0065]The term "sufficient amount" means an amount sufficient to produce a
desired effect, e.g., an amount sufficient to alter a protein expression
profile.

[0066]The term "therapeutically effective amount" is an amount that is
effective to ameliorate a symptom of a disease. A therapeutically
effective amount can be a "prophylactically effective amount" as
prophylaxis can be considered therapy.

[0087]In addition to the other markers disclosed herein, the markers may
be selected from one or more clinical indicia, examples of which are age,
gender, LDL concentration, HDL concentration, triglyceride concentration,
blood pressure, body mass index, CRP concentration, coronary calcium
score, waist circumference, tobacco smoking status, previous history of
cardiovascular disease, family history of cardiovascular disease, heart
rate, fasting insulin concentration, fasting glucose concentration,
diabetes status, and use of high blood pressure medication. Further
markers are disclosed in U.S. Ser. application Ser. No. 11/473,826 which
is hereby incorporated by reference in its entirety.

[0088]Additional information regarding preferred markers is provided in
Tables 1A and 1B, which contain information taken from Genbank.

[0089]In addition to the specific biomarker sequences identified in this
application by name, accession number, or sequence, the invention also
contemplates use of biomarker variants that are at least 90% or at least
95% or at least 97% identical to the exemplified sequences and that are
now known or later discovered and that have utility for the methods of
the invention. These variants may represent polymorphisms, splice
variants, mutations, and the like.

Identification of Additional Protein Markers

[0090]Additional protein markers useful for making atherosclerotic
classifications may be identified using learning algorithms known in the
art (described in further detail in the section entitled "Learning
Algorithms") or other methods known in the art for identifying useful
markers, such a imaging or differential expression of mRNA expression
levels.

[0091]For example, in vivo imaging may be utilized to detect the presence
of atherosclerosis associated proteins in heart tissue. Such methods may
utilize, for example, labeled antibodies or ligands specific for such
proteins. In these embodiments, a detectably-labeled moiety, e.g., an
antibody, ligand, etc., which is specific for the polypeptide is
administered to an individual (e.g., by injection), and labeled cells are
located using standard imaging techniques, including, but not limited to,
magnetic resonance imaging, computed tomography scanning, and the like.
Detection may utilize one or a cocktail of imaging reagents.

[0092]Alternatively, an mRNA sample from vessel tissue, preferably from
one or more vessels affected by atherosclerosis, can be analyzed for a
genetic signature indicating atherosclerosis in order to identify other
protein markers useful for atherosclerotic classification.

[0093]In a preferred embodiment, additional useful protein markers are
identified by determining the biological pathways which known protein
markers are a part of and identifying other markers in that pathway.

[0094]The provided patterns of circulating protein expression characterize
the inflammatory signature in atherosclerosis, and further links specific
immune related pathways to diabetes and medication therapy. While current
data suggests a significant role for inflammation in atherosclerosis,
there remains little direct data linking immune pathways in the vessel
wall to critical aspects of the disease, including the mechanisms by
which risk factors impact the primary inflammatory process, and how
medications that modify risk factors such as hypertension and
hyperlipidemia may specifically impact inflammation. The present
invention identifies expression profiles of biomarkers of inflammation
that can be used for diagnosis and classification of atherosclerotic
cardiovascular disease.

[0095]Each of the above-described markers can be used in combination with
other dataset components known to be useful for diagnosing or monitoring
cardiovascular disease.

Other Components of Dataset

[0096]The dataset may further include a variety of quantitative data about
other circulating markers, clinical indicia, metabolic measures, and
genetic assay known to those of skill in the art as being useful for
diagnosing or monitoring atherosclerotic disease.

[0099]Additional clinical indicia useful for making atherosclerotic
classifications can be identified using learning algorithms known in the
art, such as linear discriminant analysis, support vector machine
classification, recursive feature elimination, prediction analysis of
microarray, logistic regression, CART, FlexTree, LART, random forest, or
MART, which are described in further detail in the section entitled
"Learning Algorithms".

Obtaining Quantitative Data Used to Generate Dataset

[0100]Quantitative data is obtained for each component of the dataset and
inputted into an analytic process with previously defined parameters (the
predictive model) and then used to generate a result.

[0101]The data may be obtained via any technique that results in an
individual receiving data associated with a sample. For example, an
individual may obtain the dataset by generating the dataset himself by
methods known to those in the art. Alternatively, the dataset may be
obtained by receiving the dataset from another individual or entity. For
example, a laboratory professional may generate the dataset while another
individual, such as a medical professional, or may input the dataset into
an analytic process to generate the result.

[0102]One of skill should understand that although reference is made to "a
sample" throughout the specification that the quantitative data may be
obtained from multiple samples varying in any number of characteristics,
such as the method of procurement, time of procurement, tissue origin,
etc.

Quantitative Data Regarding Protein Markers

[0103]In methods of generating a result useful for atherosclerotic
classification, the expression pattern in blood, serum, etc. of the
protein markers provided herein is obtained. The quantitative data
associated with the protein markers of interest can be any data that
allows generation of a result useful for atherosclerotic classification,
including measurement of DNA or RNA levels associated with the markers
but is typically protein expression patterns. Protein levels can be
measured via any method known to those of skill of art that generates a
quantitative measurement either individually or via high-throughput
methods as part of an expression profile. For example, a blood derived
patient sample, e.g., blood, plasma, serum, etc. may be applied to a
specific binding agent or panel of specific binding agents to determine
the presence and quantity of the protein markers of interest.

[0104]Sample Procurement

[0105]Blood samples, or samples derived from blood, e.g. plasma,
circulating, etc. are assayed for the presence of expression levels of
the protein markers of interest. Typically a blood sample is drawn, and a
derivative product, such as plasma or serum, is tested.

[0106]Expression Profiling/Patterns of Multiple Markers

[0107]The quantitative data associated with the protein markers of
interest typically takes the form of an expression pattern. Expression
profiles constitute a set of relative or absolute expression values for a
number of RNA or protein products corresponding to the plurality of
markers evaluated. In various embodiments, expression profiles containing
expression patterns at least about two, three, four, or five markers are
produced. The expression pattern for each differentially expressed
component member of the expression profile may provide a particular
specificity and sensitivity with respect to predictive value, e.g., for
diagnosis, prognosis, monitoring treatment, etc.

[0108]Methods for Obtaining Expression Data

[0109]Numerous methods for obtaining expression data are known, and any
one or more of these techniques, singly or in combination, are suitable
for determining expression patterns and profiles in the context of the
present invention.

[0111]Protein expression patterns can be evaluated by any method known to
those of skill in the art which provides a quantitative measure and is
suitable for evaluation of multiple markers extracted from samples such
as one or more of the following methods: ELISA sandwich assays, mass
spectrometric detection, calorimetric assays, binding to a protein array
(e.g., antibody array), or fluorescent activated cell sorting (FACS).

[0112]One preferred approach involves the use of labeled affinity reagents
(e.g., antibodies, small molecules, etc.) that recognize epitopes of one
or more protein products in an ELISA, antibody array, or FACS screen.
Methods for producing and evaluating antibodies are well known in the
art, see, e.g., Coligan, supra; and Harlow and Lane (1989) Antibodies: A
Laboratory Manual, Cold Spring Harbor Press, NY ("Harlow and Lane").
Additional details regarding a variety of immunological and immunoassay
procedures adaptable to the present embodiment by selection of antibody
reagents specific for the products of protein markers described herein
can be found in, e.g., Stites and Ten (eds.) (1991) Basic and Clinical
Immunology, 7th ed.

[0113]High Throughput Expression Assays

[0114]A number of suitable high throughput formats exist for evaluating
expression patterns. Typically, the term high throughput refers to a
format that performs at least about 100 assays, or at least about 500
assays, or at least about 1000 assays, or at least about 5000 assays, or
at least about 10,000 assays, or more per day. When enumerating assays,
either the number of samples or the number of protein markers assayed can
be considered.

[0115]Numerous technological platforms for performing high throughput
expression analysis are known. Generally, such methods involve a logical
or physical array of either the subject samples, or the protein markers,
or both. Common array formats include both liquid and solid phase arrays.
For example, assays employing liquid phase arrays, e.g., for
hybridization of nucleic acids, binding of antibodies or other receptors
to ligand, etc., can be performed in multiwell or microtiter plates.
Microtiter plates with 96, 384 or 1536 wells are widely available, and
even higher numbers of wells, e.g., 3456 and 9600 can be used. In
general, the choice of microtiter plates is determined by the methods and
equipment, e.g., robotic handling and loading systems, used for sample
preparation and analysis. Exemplary systems include, e.g., the ORCA®
system from Beckman-Coulter, Inc. (Fullerton, Calif.) and the Zymate
systems from Zymark Corporation (Hopkinton, Mass.).

[0116]Alternatively, a variety of solid phase arrays can favorably be
employed to determine expression patterns in the context of the
invention. Exemplary formats include membrane or filter arrays (e.g.,
nitrocellulose, nylon), pin arrays, and bead arrays (e.g., in a liquid
"slurry"). Typically, probes corresponding to nucleic acid or protein
reagents that specifically interact with (e.g., hybridize to or bind to)
an expression product corresponding to a member of the candidate library,
are immobilized, for example by direct or indirect cross-linking, to the
solid support. Essentially any solid support capable of withstanding the
reagents and conditions necessary for performing the particular
expression assay can be utilized. For example, functionalized glass,
silicon, silicon dioxide, modified silicon, any of a variety of polymers,
such as (poly)tetrafluoroethylene, (poly)vinylidenedifluoride,
polystyrene, polycarbonate, or combinations thereof can all serve as the
substrate for a solid phase array.

[0117]In one embodiment, the array is a "chip" composed, e.g., of one of
the above-specified materials. Polynucleotide probes, e.g., RNA or DNA,
such as cDNA, synthetic oligonucleotides, and the like, or binding
proteins such as antibodies or antigen-binding fragments or derivatives
thereof, that specifically interact with expression products of
individual components of the candidate library are affixed to the chip in
a logically ordered manner, i.e., in an array. In addition, any molecule
with a specific affinity for either the sense or anti-sense sequence of
the marker nucleotide sequence (depending on the design of the sample
labeling), can be fixed to the array surface without loss of specific
affinity for the marker and can be obtained and produced for array
production, for example, proteins that specifically recognize the
specific nucleic acid sequence of the marker, ribozymes, peptide nucleic
acids (PNA), or other chemicals or molecules with specific affinity.

[0121]Quantitative data regarding other dataset components, such as
clinical indicia, metabolic measures, and genetic assays, can be
determined via methods known to those of skill in the art.

Analytic Processes used to Generate Result

[0122]The quantitative data thus obtained about the protein markers and
other dataset components is then subjected to an analytic process with
parameters previously determined using a learning algorithm, i.e.,
inputted into a predictive model, as in the examples provided herein
(Examples 1-5). The parameters of the analytic process may be those
disclosed herein or those derived using the guidelines described herein.
Learning algorithms such as linear discriminant analysis, recursive
feature elimination, a prediction analysis of microarray, logistic
regression, CART, FlexTree, LART, random forest, MART, or another machine
learning algorithm are applied to the appropriate reference or training
data to determine the parameters for analytical processes suitable for a
variety of atherosclerotic classifications.

Analytic Processes

[0123]The analytic process used to generate a result may be any type of
process capable of providing a result useful for classifying a sample,
for example, comparison of the obtained dataset with a reference dataset,
a linear algorithm, a quadratic algorithm, a decision tree algorithm, or
a voting algorithm.

[0124]Various analytic processes for obtaining a result useful for making
an atherosclerotic classification are described herein, however, one of
skill in the art will readily understand that any suitable type of
analytic process is within the scope of this invention.

[0125]Prior to input into the analytical process, the data in each dataset
is collected by measuring the values for each marker, usually in
triplicate or in multiple triplicates. The data may be manipulated, for
example, raw data may be transformed using standard curves, and the
average of triplicate measurements used to calculate the average and
standard deviation for each patient. These values may be transformed
before being used in the models, e.g. log-transformed, Box-Cox
transformed (see Box and Cox (1964) J. Royal Stat. Soc., Series B,
26:211-246), etc. This data can then be input into the analytical process
with defined parameters.

[0126]The analytic process may set a threshold for determining the
probability that a sample belongs to a given class. The probability
preferably is at least 50%, or at least 60% or at least 70% or at least
80% or higher.

[0127]In other embodiments, the analytic process determines whether a
comparison between an obtained dataset and a reference dataset yields a
statistically significant difference. If so, then the sample from which
the dataset was obtained is classified as not belonging to the reference
dataset class. Conversely, if such a comparison is not statistically
significantly different from the reference dataset, then the sample from
which the dataset was obtained is classified as belonging to the
reference dataset class.

[0128]In general, the analytical process will be in the form of a model
generated by a statistical analytical method such as those described
below. Examples of such analytical processes may include a linear
algorithm, a quadratic algorithm, a polynomial algorithm, a decision tree
algorithm, a voting algorithm. A linear algorithm may have the form:

R = C 0 + i = 1 N C i x i

[0129]Where R is the useful result obtained. C0 is a constant that
may be zero. Ci and xi are the constants and the value of the
applicable biomarker or clinical indicia, respectively, and N is the
total number of markers.

[0130]A quadratic algorithm may have the form:

R = C 0 + i = 1 N C i x i 2

[0131]Where R is the useful result obtained. C0 is a constant that
may be zero. Ci and xi are the constants and the value of the
applicable biomarker or clinical indicia, respectively, and N is the
total number of markers.

[0132]A polynomial algorithm is a more generalized form a linear or
quadratic algorithm that may have the form:

R = C 0 + i = 0 N C i x i y i

[0133]Where R is the useful result obtained. C0 is a constant that
may be zero. Ci and xi are the constants and the value of the
applicable biomarker or clinical indicia, respectively; yi is the
power to which xi is raised and N is the total number of markers.

Use of Reference/Training Datasets to Determine Parameters of Analytical
Process

[0134]Using any suitable learning algorithm, an appropriate reference or
training dataset is used to determine the parameters of the analytical
process to be used for classification, i.e., develop a predictive model.

[0135]The reference or training dataset to be used will depend on the
desired atherosclerotic classification to be determined. The dataset may
include data from two, three, four or more classes.

[0136]For example, to use a supervised learning algorithm to determine the
parameters for an analytic process used to diagnose atherosclerosis, a
dataset comprising control and diseased samples is used as a training
set. Alternatively, if a supervised learning algorithm is to be used to
develop a predictive model for atherosclerotic staging, then the training
set may include data for each of the various stages of cardiovascular
disease. Further detail regarding the types of the reference/training
datasets used to determine certain atherosclerotic classifications is
described in further detail in the section entitled "Use of Results
Generated by Analytic Process".

Statistical Analysis

[0137]The following are examples of the types of statistical analysis
methods that are available to one of skill in the art to aid in the
practice of the disclosed methods. The statistical analysis may be
applied for one or both of two tasks. First, these and other statistical
methods may be used to identify preferred subsets of the markers and
other indicia that will form a preferred dataset. In addition, these and
other statistical methods may be used to generate the analytical process
that will be used with the dataset to generate the result. Several of
statistical methods presented herein or otherwise available in the art
will perform both of these tasks and yield a model that is suitable for
use as an analytical process for the practice of the methods disclosed
herein.

[0138]Biomarkers whose corresponding features values (e.g., expression
levels) are capable of discriminating between, e.g., healthy and
atherosclerotic are identified herein. The identity of these markers and
their corresponding features (e.g., expression levels) can be used to
develop an analytical process, or plurality of analytical processes, that
discriminate between classes of patients. The examples below illustrate
how data analysis algorithms can be used to construct a number of such
analytical processes. Each of the data analysis algorithms described in
the examples use features (e.g., expression values) of a subset of the
markers identified herein across a training population that includes
healthy and atherosclerotic patients. Specific data analysis algorithms
for building an analytical process, or plurality of analytical processes,
that discriminate between subjects disclosed herein will be described in
the subsections below. Once an analytical process has been built using
these exemplary data analysis algorithms or other techniques known in the
art, the analytical process can be used to classify a test subject into
one of the two or more phenotypic classes (e.g. a healthy or
atherosclerotic patient). This is accomplished by applying the analytical
process to a marker profile obtained from the test subject. Such
analytical processes, therefore, have enormous value as diagnostic
indicators.

[0139]The disclosed methods provide, in one aspect, for the evaluation of
a marker profile from a test subject to marker profiles obtained from a
training population. In some embodiments, each marker profile obtained
from subjects in the training population, as well as the test subject,
comprises a feature for each of a plurality of different markers. In some
embodiments, this comparison is accomplished by (i) developing an
analytical process using the marker profiles from the training population
and (ii) applying the analytical process to the marker profile from the
test subject. As such, the analytical process applied in some embodiments
of the methods disclosed herein is used to determine whether a test
subject has atherosclerosis.

[0140]In some embodiments of the methods disclosed herein, when the
results of the application of an analytical process indicate that the
subject will likely acquire atherosclerosis, the subject is diagnosed as
an "atherosclerotic" subject. If the results of an application of an
analytical process indicate that the subject will not develop
atherosclerosis, the subject is diagnosed as a healthy subject. Thus, in
some embodiments, the result in the above-described binary decision
situation has four possible outcomes:

[0141](i) truly atherosclerotic, where the analytical process indicates
that the subject will develop atherosclerosis and the subject does in
fact develop atherosclerosis during the definite time period (true
positive, TP);

[0142](ii) falsely atherosclerotic, where the analytical process indicates
that the subject will develop atherosclerosis and the subject, in fact,
does not develop atherosclerosis during the definite time period (false
positive, FP);

[0143](iii) truly healthy, where the analytical process indicates that the
subject will not develop atherosclerosis and the subject, in fact, does
not develop atherosclerosis during the definite time period (true
negative, TN); or

[0144](iv) falsely healthy, where the analytical process indicates that
the subject will not develop atherosclerosis and the subject, in fact,
does develop atherosclerosis during the definite time period (false
negative, FN).

[0145]It will be appreciated that other definitions for TP, FP, TN, UN can
be made. While all such alternative definitions are within the scope of
the disclosed methods, for ease of understanding, the definitions for TP,
FP, TN, and FN given by definitions (i) through (iv) above will be used
herein, unless otherwise stated.

[0146]As will be appreciated by those of skill in the art, a number of
quantitative criteria can be used to communicate the performance of the
comparisons made between a test marker profile and reference marker
profiles (e.g., the application of an analytical process to the marker
profile from a test subject). These include positive predicted value
(PPV), negative predicted value (NPV), specificity, sensitivity,
accuracy, and certainty. In addition, other constructs such a receiver
operator curves (ROC) can be used to evaluate analytical process
performance. As used herein: PPV=TP/(TP+FP), NPV=TN/(TN+FN),
specificity=TN/(TN+FP), sensitivity=TP/(TP+FN), and
accuracy=certainty=(TP+TN)/N.

[0147]Here, N is the number of samples compared (e.g., the number of test
samples for which a determination of atherosclerotic or healthy is
sought). For example, consider the case in which there are ten subjects
for which this classification is sought. Marker profiles are constructed
for each of the ten test subjects. Then, each of the marker profiles is
evaluated by applying an analytical process, where the analytical process
was developed based upon marker profiles obtained from a training
population. In this example, N, from the above equations, is equal to 10.
Typically, N is a number of samples, where each sample was collected from
a different member of a population. This population can, in fact, be of
two different types. In one type, the population comprises subjects whose
samples and phenotypic data (e.g., feature values of markers and an
indication of whether or not the subject developed atherosclerosis) was
used to construct or refine an analytical process. Such a population is
referred to herein as a training population. In the other type, the
population comprises subjects that were not used to construct the
analytical process. Such a population is referred to herein as a
validation population. Unless otherwise stated, the population
represented by N is either exclusively a training population or
exclusively a validation population, as opposed to a mixture of the two
population types. It will be appreciated that scores such as accuracy
will be higher (closer to unity) when they are based on a training
population as opposed to a validation population. Nevertheless, unless
otherwise explicitly stated herein, all criteria used to assess the
performance of an analytical process (or other forms of evaluation of a
biomarker profile from a test subject) including certainty (accuracy)
refer to criteria that were measured by applying the analytical process
corresponding to the criteria to either a training population or a
validation population. Furthermore, the definitions for PPV, NPV,
specificity, sensitivity, and accuracy defined above can also be found in
Draghici, Data Analysis Tools for DNA Microanalysis, 2003, CRC Press LLC,
Boca Raton, Ha., pp. 342-343, which is hereby incorporated herein by
reference.

[0148]In some embodiments, N is more than one, more than five, more than
ten, more than twenty, between ten and 100, more than 100, or less than
1000 subjects. An analytical process (or other forms of comparison) can
have at least about 99% certainty, or even more, in some embodiments,
against a training population or a validation population. In other
embodiments, the certainty is at least about 97%, at least about 95%, at
least about 90%, at least about 85%, at least about 80%, at least about
75%, at least about 70%, at least about 65%, or at least about 60%
against a training population or a validation population. The useful
degree of certainty may vary, depending on the particular method. As used
herein, "certainty" means "accuracy." In one embodiment, the sensitivity
and/or specificity is at is at least about 97%, at least about 95%, at
least about 90%, at least about 85%, at least about 80%, at least about
75%, or at least about 70% against a training population or a validation
population. In some embodiments, such analytical processes are used to
predict the development of atherosclerosis with the stated accuracy. In
some embodiments, such analytical processes are used to diagnoses
atherosclerosis with the stated accuracy. In some embodiments, such
analytical processes are used to determine a stage of atherosclerosis
with the stated accuracy.

[0149]The number of features that may be used by an analytical process to
classify a test subject with adequate certainty is two or more. In some
embodiments, it is three or more, four or more, ten or more, or between
10 and 200. Depending on the degree of certainty sought, however, the
number of features used in an analytical process can be more or less, but
in all cases is at least two. In one embodiment, the number of features
that may be used by an analytical process to classify a test subject is
optimized to allow a classification of a test subject with high
certainty.

[0151]In one embodiment, comparison of a test subject's marker profile to
a marker profiles obtained from a training population is performed, and
comprises applying an analytical process. The analytical process is
constructed using a data analysis algorithm, such as a computer pattern
recognition algorithm. Other suitable data analysis algorithms for
constructing analytical process include, but are not limited to, logistic
regression (see below) or a nonparametric algorithm that detects
differences in the distribution of feature values (e.g., a Wilcoxon
Signed Rank Test (unadjusted and adjusted)). The analytical process can
be based upon two, three, four, five, 10, 20 or more features,
corresponding to measured observables from one, two, three, four, five,
10, 20 or more markers. In one embodiment, the analytical process is
based on hundreds of features or more. Analytical process may also be
built using a classification tree algorithm. For example, each marker
profile from a training population can comprise at least three features,
where the features are predictors in a classification tree algorithm (see
below). The analytical process predicts membership within a population
(or class) with an accuracy of at least about at least about 70%, of at
least about 75%, of at least about 80%, of at least about 85%, of at
least about 90%, of at least about 95%, of at least about 97%, of at
least about 98%, of at least about 99%, or about 100%.

[0152]Suitable data analysis algorithms are known in the art, some of
which are reviewed in Hastie et al., supra. In a specific embodiment, a
data analysis algorithm of the invention comprises Classification and
Regression Tree (CART), Multiple Additive Regression Tree (MART),
Prediction Analysis for Microarrays (PAM) or Random Forest analysis. Such
algorithms classify complex spectra from biological materials, such as a
blood sample, to distinguish subjects as normal or as possessing
biomarker expression levels characteristic of a particular disease state.
In other embodiments, a data analysis algorithm of the invention
comprises ANOVA and nonparametric equivalents, linear discriminant
analysis, logistic regression analysis, nearest neighbor classifier
analysis, neural networks, principal component analysis, quadratic
discriminant analysis, regression classifiers and support vector
machines. While such algorithms may be used to construct an analytical
process and/or increase the speed and efficiency of the application of
the analytical process and to avoid investigator bias, one of ordinary
skill in the art will realize that computer-based algorithms are not
required to carry out the methods of the present invention.

[0153]Analytical processes can be used to evaluate biomarker profiles,
regardless of the method that was used to generate the marker profile.
For example, suitable analytical process that can be used to evaluate
marker profiles generated using gas chromatography, as discussed in
Harper, "Pyrolysis and GC in Polymer Analysis," Dekker, New York (1985).
Further, Wagner et al., 2002, Anal. Chem. 74:1824-1835 disclose an
analytical process that improves the ability to classify subjects based
on spectra obtained by static time-of-flight secondary ion mass
spectrometry (TOF-SIMS). Additionally, Bright et al., 2002, J. Microbiol.
Methods 48:127-38, hereby incorporated by reference herein in its
entirety, disclose a method of distinguishing between bacterial strains
with high certainty (79-89% correct classification rates) by analysis of
MALDI-TOF-MS spectra. Dalluge, 2000, Fresenius J. Anal. Chem.
366:701-711, hereby incorporated by reference herein in its entirety,
discusses the use of MALDI-TOF-MS and liquid chromatography-electrospray
ionization mass spectrometry (LC/ESI-MS) to classify profiles of
biomarkers in complex biological samples.

Artificial Neural Network

[0154]In some embodiments, a neural network is used. A neural network can
be constructed for a selected set of markers. A neural network is a
two-stage regression or classification model. A neural network has a
layered structure that includes a layer of input units (and the bias)
connected by a layer of weights to a layer of output units. For
regression, the layer of output units typically includes just one output
unit. However, neural networks can handle multiple quantitative responses
in a seamless fashion.

[0155]In multilayer neural networks, there are input units (input layer),
hidden units (hidden layer), and output units (output layer). There is,
furthermore, a single bias unit that is connected to each unit other than
the input units. Neural networks are described in Duda et al., 2001,
Pattern Classification, Second Edition, John Wiley & Sons, Inc., New
York; and Hastie et al., 2001, The Elements of Statistical Learning,
Springer-Verlag, New York

[0156]The basic approach to the use of neural networks is to start with an
untrained network, present a training pattern to the input layer, and to
pass signals through the net and determine the output at the output
layer. These outputs are then compared to the target values; any
difference corresponds to an error. This error or criterion function is
some scalar function of the weights and is minimized when the network
outputs match the desired outputs. Thus, the weights are adjusted to
reduce this measure of error. For regression, this error can be
sum-of-squared errors. For classification, this error can be either
squared error or cross-entropy (deviation). See, e.g., Hastie et al.,
2001, The Elements of Statistical Learning, Springer-Verlag, New York,
which is hereby incorporated by reference in its entirety.

[0157]The basic approach to the use of neural networks is to start with an
untrained network, present a training pattern, e.g., marker profiles from
training patients, to the input layer, and to pass signals through the
net and determine the output, e.g., the prognosis of the training
patients, at the output layer. These outputs are then compared to the
target values; any difference corresponds to an error. This error or
criterion function is some scalar function of the weights and is
minimized when the network outputs match the desired outputs. Thus, the
weights are adjusted to reduce this measure of error. For regression,
this error can be sum-of-squared errors. For classification, this error
can be either squared error or cross-entropy (deviation). See, e.g.,
Hastie et al., 2001, The Elements of Statistical Learning,
Springer-Verlag, New York.

[0158]Three commonly used training protocols are stochastic, batch, and
on-line. In stochastic training, patterns are chosen randomly from the
training set and the network weights are updated for each pattern
presentation. Multilayer nonlinear networks trained by gradient descent
methods such as stochastic back-propagation perform a maximum-likelihood
estimation of the weight values in the model defined by the network
topology. In batch training, all patterns are presented to the network
before learning takes place. Typically, in batch training, several passes
are made through the training data. In online training, each pattern is
presented once and only once to the net.

[0159]In some embodiments, consideration is given to starting values for
weights. If the weights are near zero, then the operative part of the
sigmoid commonly used in the hidden layer of a neural network (see, e.g.,
Hastie et al., 2001, The Elements of Statistical Learning,
Springer-Verlag, New York) is roughly linear, and hence the neural
network collapses into an approximately linear model. In some
embodiments, starting values for weights are chosen to be random values
near zero. Hence the model starts out nearly linear, and becomes
nonlinear as the weights increase. Individual units localize to
directions and introduce nonlinearities where needed. Use of exact zero
weights leads to zero derivatives and perfect symmetry, and the algorithm
never moves. Alternatively, starting with large weights often leads to
poor solutions.

[0160]Since the scaling of inputs determines the effective scaling of
weights in the bottom layer, it can have a large effect on the quality of
the final solution. Thus, in some embodiments, at the outset all
expression values are standardized to have mean zero and a standard
deviation of one. This ensures all inputs are treated equally in the
regularization process, and allows one to choose a meaningful range for
the random starting weights. With standardization inputs, it is typical
to take random uniform weights over the range [-0.7, +0.7].

[0161]A recurrent problem in the use of networks having a hidden layer is
the optimal number of hidden units to use in the network. The number of
inputs and outputs of a network are determined by the problem to be
solved. For the methods disclosed herein, the number of inputs for a
given neural network can be the number of markers in the selected set of
markers. The number of output for the neural network will typically be
just one. However, in some embodiment more than one output is used so
that more than just two states can be defined by the network. If too many
hidden units are used in a neural network, the network will have too many
degrees of freedom and is trained too long, there is a danger that the
network will overfit the data. If there are too few hidden units, the
training set cannot be learned. Generally speaking, however, it is better
to have too many hidden units than too few. With too few hidden units,
the model might not have enough flexibility to capture the nonlinearities
in the data; with too many hidden units, the extra weight can be shrunk
towards zero if appropriate regularization or pruning, as described
below, is used. In typical embodiments, the number of hidden units is
somewhere in the range of 5 to 100, with the number increasing with the
number of inputs and number of training cases.

[0162]One general approach to determining the number of hidden units to
use is to apply a regularization approach. In the regularization
approach, a new criterion function is constructed that depends not only
on the classical training error, but also on classifier complexity.
Specifically, the new criterion function penalizes highly complex models;
searching for the minimum in this criterion is to balance error on the
training set with error on the training set plus a regularization term,
which expresses constraints or desirable properties of solutions:

J=Jpat+λJreg.

[0163]The parameter λ is adjusted to impose the regularization more
or less strongly. In other words, larger values for λ will tend to
shrink weights towards zero: typically cross-validation with a validation
set is used to estimate λ. This validation set can be obtained by
setting aside a random subset of the training population. Other forms of
penalty can also be used, for example the weight elimination penalty
(see, e.g., Hastie et al., 2001, The Elements of Statistical Learning,
Springer-Verlag, New York).

[0164]Another approach to determine the number of hidden units to use is
to eliminate--prune--weights that are least needed. In one approach, the
weights with the smallest magnitude are eliminated (set to zero). Such
magnitude-based pruning can work, but is nonoptimal; sometimes weights
with small magnitudes are important for learning and training data. In
some embodiments, rather than using a magnitude-based pruning approach,
Wald statistics are computed. The fundamental idea in Wald Statistics is
that they can be used to estimate the importance of a hidden unit
(weight) in a model. Then, hidden units having the least importance are
eliminated (by setting their input and output weights to zero). Two
algorithms in this regard are the Optimal Brain Damage (OBD) and the
Optimal Brain Surgeon (OBS) algorithms that use second-order
approximation to predict how the training error depends upon a weight,
and eliminate the weight that leads to the smallest increase in training
error.

[0165]Optimal Brain Damage and Optimal Brain Surgeon share the same basic
approach of training a network to local minimum error at weight w, and
then pruning a weight that leads to the smallest increase in the training
error. The predicted functional increase in the error for a change in
full weight vector δw is:

[0166]is the Hessian matrix. The first term vanishes because we are at a
local minimum in error; third and higher order terms are ignored. The
general solution for minimizing this function given the constraint of
deleting one weight is:

[0167]Here, uq is the unit vector along the qth direction in weight
space and Lq is approximation to the saliency of the weight q--the
increase in training error if weight q is pruned and the other weights
updated δw. These equations require the inverse of H. One method to
calculate this inverse matrix is to start with a small value,
H0-1=α-1I, where α is a small
parameter--effectively a weight constant. Next the matrix is updated with
each pattern according to

[0168]where the subscripts correspond to the pattern being presented and
am decreases with m. After the full training set has been presented, the
inverse Hessian matrix is given by H-1=Hn-1. In
algorithmic form, the Optimal Brain Surgeon method is:

[0169]The Optimal Brain Damage method is computationally simpler because
the calculation of the inverse Hessian matrix in line 3 is particularly
simple for a diagonal matrix. The above algorithm terminates when the
error is greater than a criterion initialized to be θ. Another
approach is to change line 6 to terminate when the change in J(w) due to
elimination of a weight is greater than some criterion value.

[0171]In some embodiments of the present invention, support vector
machines (SVMs) are used to classify subjects using feature values of the
markers described herein. SVMs are a relatively new type of learning
algorithm, which are generally described, for example, in Cristianini and
Shawe-Taylor, 2000, An Introduction to Support Vector Machines, Cambridge
University Press, Cambridge; Boser et al., 1992, "A training algorithm
for optimal margin classifiers," in Proceedings of the 5th Annual ACM
Workshop on Computational Learning Theory, ACM Press, Pittsburgh, Pa.,
pp. 142-152; Vapnik, 1998, Statistical Learning Theory, Wiley, New York;
Mount, 2001, Bioinformatics: sequence and genome analysis, Cold Spring
Harbor Laboratory Press, Cold Spring Harbor, N.Y., Duda, Pattern
Classification, Second Edition, 2001, John Wiley & Sons, Inc.; and
Hastie, 2001, The Elements of Statistical Learning, Springer, New York;
and Furey et al., 2000, Bioinformatics 16, 906-914, each of which is
hereby incorporated by reference in its entirety. When used for
classification, SVMs separate a given set of binary labeled data training
data with a hyper-plane that is maximally distance from them. For cases
in which no linear separation is possible, SVMs can work in combination
with the technique of `kernels`, which automatically realizes a
non-linear mapping to a feature space. The hyper-plane found by the SVM
in feature space corresponds to a non-linear decision boundary in the
input space.

[0172]In one approach, when a SVM is used, the feature data is
standardized to have mean zero and unit variance and the members of a
training population are randomly divided into a training set and a test
set. For example, in one embodiment, two thirds of the members of the
training population are placed in the training set and one third of the
members of the training population are placed in the test set. The
expression values for a combination of markers described herein is used
to train the SVM. Then the ability for the trained SVM to correctly
classify members in the test set is determined. In some embodiments, this
computation is performed several times for a given combination of
markers. In each iteration of the computation, the members of the
training population are randomly assigned to the training set and the
test set. Then, the quality of the combination of biomarkers is taken as
the average of each such iteration of the SVM computation.

Predictive Analysis of Microarrays (PAM)

[0173]One approach to developing an analytical process using expression
levels of markers disclosed herein is the nearest centroid classifier.
Such a technique computes, for each class (e.g., healthy and
atherosclerotic), a centroid given by the average expression levels of
the markers in the class, and then assigns new samples to the class whose
centroid is nearest. This approach is similar to k-means clustering
except clusters are replaced by known classes. This algorithm can be
sensitive to noise when a large number of markers are used. One
enhancement to the technique uses shrinkage: for each marker, differences
between class centroids are set to zero if they are deemed likely to be
due to chance. This approach is implemented in the Prediction Analysis of
Microarray, or PAM. See, for example, Tibshirani et al., 2002,
Proceedings of the National Academy of Science USA 99; 6567-6572, which
is hereby incorporated by reference in its entirety. Shrinkage is
controlled by a threshold below which differences are considered noise.
Markers that show no difference above the noise level are removed. A
threshold can be chosen by cross-validation. As the threshold is
decreased, more markers are included and estimated classification errors
decrease, until they reach a bottom and start climbing again as a result
of noise markers--a phenomenon known as overfitting.

Multiple Additive Regression Trees

[0174]Multiple additive regression trees (MART) represents another way to
construct an analytical process that can be used in the methods disclosed
herein. A generic algorithm for MART is:

[0181]Specific algorithms are obtained by inserting different loss
criteria L(y,f(x)). The first line of the algorithm initializes to the
optimal constant model, which is just a single terminal node tree. The
components of the negative gradient computed in line 2(a) are referred to
as generalized pseudo residuals, r. Gradients for commonly used loss
functions are summarized in Table 10.2, of Hastie et al., 2001, The
Elements of Statistical Learning, Springer-Verlag, New York, p. 321,
which is hereby incorporated by reference. The algorithm for
classification is similar and is described in Hastie et al., Chapter 10,
which is hereby incorporated by reference in its entirety. Tuning
parameters associated with the MART procedure are the number of
iterations M and the sizes of each of the constituent trees Jm, m=1,
2, . . . , M.

Analytical Processes Derived by Regression

[0182]In some embodiments, an analytical process used to classify subjects
is built using regression. In such embodiments, the analytical process
can be characterized as a regression classifier, preferably a logistic
regression classifier. Such a regression classifier includes a
coefficient for each of the markers (e.g., the expression level for each
such marker) used to construct the classifier. In such embodiments, the
coefficients for the regression classifier are computed using, for
example, a maximum likelihood approach. In such a computation, the
features for the biomarkers (e.g., RT-PCR, microarray data) is used. In
particular embodiments, molecular marker data from only two trait
subgroups is used (e.g., healthy patients and atherosclerotic patients)
and the dependent variable is absence or presence of a particular trait
in the subjects for which marker data is available.

[0183]In another specific embodiment, the training population comprises a
plurality of trait subgroups (e.g., three or more trait subgroups, four
or more specific trait subgroups, etc.). These multiple trait subgroups
can correspond to discrete stages in the phenotypic progression from
healthy, to mild atherosclerosis, to medium atherosclerosis, etc. in a
training population. In this specific embodiment, a generalization of the
logistic regression model that handles multicategory responses can be
used to develop a decision that discriminates between the various trait
subgroups found in the training population. For example, measured data
for selected molecular markers can be applied to any of the
multi-category logit models described in Agresti, An Introduction to
Categorical Data Analysis, 1996, John Wiley & Sons, Inc., New York,
Chapter 8, hereby incorporated by reference in its entirety, in order to
develop a classifier capable of discriminating between any of a plurality
of trait subgroups represented in a training population.

Logistic Regression

[0184]In some embodiments, the analytical process is based on a regression
model, preferably a logistic regression model. Such a regression model
includes a coefficient for each of the markers in a selected set of
markers disclosed herein. In such embodiments, the coefficients for the
regression model are computed using, for example, a maximum likelihood
approach. In particular embodiments, molecular marker data from the two
groups (e.g., healthy and diseased) is used and the dependent variable is
the status of the patient for which marker characteristic data are from.

[0185]Some embodiments of the disclosed methods provide generalizations of
the logistic regression model that handle multicategory (polychotomous)
responses. Such embodiments can be used to discriminate an organism into
one or three or more classifications. Such regression models use
multicategory logit models that simultaneously refer to all pairs of
categories, and describe the odds of response in one category instead of
another. Once the model specifies logits for a certain (J-1) pairs of
categories, the rest are redundant. See, for example, Agresti, An
Introduction to Categorical Data Analysis, John Wiley & Sons, Inc., 1996,
New York, Chapter 8, which is hereby incorporated by reference.

Linear Discriminant Analysis

[0186]Linear discriminant analysis (LDA) attempts to classify a subject
into one of two categories based on certain object properties. In other
words, LDA tests whether object attributes measured in an experiment
predict categorization of the objects. LDA typically requires continuous
independent variables and a dichotomous categorical dependent variable.
For use with the disclosed methods, the expression values for the
selected set of markers across a subset of the training population serve
as the requisite continuous independent variables. The group
classification of each of the members of the training population serves
as the dichotomous categorical dependent variable.

[0187]LDA seeks the linear combination of variables that maximizes the
ratio of between-group variance and within-group variance by using the
grouping information. Implicitly, the linear weights used by LDA depend
on how the expression of a marker across the training set separates in
the two groups (e.g., a group that has atherosclerosis and a group that
does not have atherosclerosis) and how this expression correlates with
the expression of other markers. In some embodiments, LDA is applied to
the data matrix of the N members in the training sample by K genes in a
combination of genes described in the present invention. Then, the linear
discriminant of each member of the training population is plotted.
Ideally, those members of the training population representing a first
subgroup (e.g. those subjects that do not have atherosclerosis) will
cluster into one range of linear discriminant values (e.g., negative) and
those member of the training population representing a second subgroup
(e.g. those subjects that have atherosclerosis) will cluster into a
second range of linear discriminant values (e.g., positive). The LDA is
considered more successful when the separation between the clusters of
discriminant values is larger. For more information on linear
discriminant analysis, see Duda, Pattern Classification, Second Edition,
2001, John Wiley & Sons, Inc; and Hastie, 2001, The Elements of
Statistical Learning, Springer, New York; Venables & Ripley, 1997, Modern
Applied Statistics with s-plus, Springer, New York.

Quadratic Discriminant Analysis

[0188]Quadratic discriminant analysis (QDA) takes the same input
parameters and returns the same results as LDA. QDA uses quadratic
equations, rather than linear equations, to produce results. LDA and QDA
are roughly interchangeable (though there are differences related to the
number of subjects required), and which to use is a matter of preference
and/or availability of software to support the analysis. Logistic
regression takes the same input parameters and returns the same results
as LDA and QDA.

Decision Trees

[0189]One type of analytical process that can be constructed using the
expression level of the markers identified herein is a decision tree.
Here, the "data analysis algorithm" is any technique that can build the
analytical process, whereas the final "decision tree" is the analytical
process. An analytical process is constructed using a training population
and specific data analysis algorithms. Decision trees are described
generally by Duda, 2001, Pattern Classification, John Wiley & Sons, Inc.,
New York. pp. 395-396, which is hereby incorporated by reference.
Tree-based methods partition the feature space into a set of rectangles,
and then fit a model (like a constant) in each one.

[0190]The training population data includes the features (e.g., expression
values, or some other observable) for the markers across a training set
population. One specific algorithm that can be used to construct an
analytical process is a classification and regression tree (CART). Other
specific decision tree algorithms include, but are not limited to, ID3,
C4.5, MART, and Random Forests. CART, ID3, and C4.5 are described in
Duda, 2001, Pattern Classification, John Wiley & Sons, Inc., New York.
pp. 396-408 and pp. 411-412, which is hereby incorporated by reference.
CART, MART, and C4.5 are described in Hastie et al., 2001, The Elements
of Statistical Learning, Springer-Verlag, New York, Chapter 9, which is
hereby incorporated by reference in its entirety. Random Forests are
described in Breiman, 1999, "Random Forests--Random Features," Technical
Report 567, Statistics Department, U.C. Berkeley, September 1999, which
is hereby incorporated by reference in its entirety.

[0191]In some embodiments of the disclosed methods, decision trees are
used to classify patients using expression data for a selected set of
markers. Decision tree algorithms belong to the class of supervised
learning algorithms. The aim of a decision tree is to induce an
analytical process (a tree) from real-world example data. This tree can
be used to classify unseen examples which have not been used to derive
the decision tree.

[0192]A decision tree is derived from training data. An example contains
values for the different attributes and what class the example belongs.
In one embodiment, the training data is expression data for a combination
of markers described herein across the training population.

[0193]The following algorithm describes a decision tree derivation:

[0194]Tree (Examples,Class,Attributes)

[0195]Create a root node

[0196]If all Examples have the same Class value, give the root this label

[0197]Else if Attributes is empty label the root according to the most
common value

[0198]Else begin

[0199]Calculate the information gain for each attribute

[0200]Select the attribute A with highest information gain and make this
the root attribute

[0201]For each possible value, v, of this attribute

[0202]Add a new branch below the root, corresponding to A=v Let
Examples(v) be those examples with A=v

[0203]If Examples(v) is empty, make the new branch a leaf node labeled
with the most common value among Examples

[0204]Else let the new branch be the tree created by Tree
(Examples(v),Class,Attributes-{A})

[0205]end

[0206]A more detailed description of the calculation of information gain
is shown in the following. If the possible classes vi of the examples
have probabilities P(vi) then the information content I of the actual
answer is given by:

I ( P ( V 1 ) , , P ( V n ) ) = i = 1
n - P ( v i ) log 2 P ( v i )

[0207]The I-value shows how much information is needed in order to be able
to describe the outcome of a classification for the specific dataset
used. Supposing that the dataset contains p positive (e.g. has
atherosclerosis) and n negative (e.g. healthy) examples (e.g.
individuals), the information contained in a correct answer is:

[0208]where log2 is the logarithm using base two. By testing single
attributes the amount of information needed to make a correct
classification can be reduced. The remainder for a specific attribute A
(e.g. a marker) shows how much the information that is needed can be
reduced.

Remainder ( A ) = i = 1 v p i + n i p + n
I ( p i p i + n i , n i p i + n i )

[0209]where "v" is the number of unique attribute values for attribute A
in a certain dataset, "i" is a certain attribute value, "pi" is the
number of examples for attribute A where the classification is positive
(e.g. atherosclerotic), "ni" is the number of examples for attribute
A where the classification is negative (e.g. healthy).

[0210]The information gain of a specific attribute A is calculated as the
difference between the information content for the classes and the
remainder of attribute A:

Gain ( A ) = I ( p p + n , n p + n ) - Remainder
( A )

[0211]The information gain is used to evaluate how important the different
attributes are for the classification (how well they split up the
examples), and the attribute with the highest information.

[0212]In general there are a number of different decision tree algorithms,
many of which are described in Duda, Pattern Classification, Second
Edition, 2001, John Wiley & Sons, Inc. Decision tree algorithms often
require consideration of feature processing, impurity measure, stopping
criterion, and pruning. Specific decision tree algorithms include, cut
are not limited to classification and regression trees (CART),
multivariate decision trees, ID3, and C4.5.

[0213]In one approach, when an exemplary embodiment of a decision tree is
used, the expression data for a selected set of markers across a training
population is standardized to have mean zero and unit variance. The
members of the training population are randomly divided into a training
set and a test set. For example, in one embodiment, two thirds of the
members of the training population are placed in the training set and one
third of the members of the training population are placed in the test
set. The expression values for a select combination of markers described
herein is used to construct the analytical process. Then, the ability for
the analytical process to correctly classify members in the test set is
determined. In some embodiments, this computation is performed several
times for a given combination of markers. In each iteration of the
computation, the members of the training population are randomly assigned
to the training set and the test set. Then, the quality of the
combination of molecular markers is taken as the average of each such
iteration of the analytical process computation.

[0214]In addition to univariate decision trees in which each split is
based on an expression level for a corresponding marker, among the set of
markers disclosed herein, or the expression level of two such markers,
multivariate decision trees can be implemented as an analytical process.
In such multivariate decision trees, some or all of the decisions
actually comprise a linear combination of expression levels for a
plurality of markers. Such a linear combination can be trained using
known techniques such as gradient descent on a classification or by the
use of a sum-squared-error criterion. To illustrate such an analytical
process, consider the expression: 0.04x1+0.16x2<500

[0215]Here, x1 and x2 refer to two different features for two
different markers from among the markers disclosed herein. To poll the
analytical process, the values of features x1 and x2 are
obtained from the measurements obtained from the unclassified subject.
These values are then inserted into the equation. If a value of less than
500 is computed, then a first branch in the decision tree is taken.
Otherwise, a second branch in the decision tree is taken. Multivariate
decision trees are described in Duda, 2001, Pattern Classification, John
Wiley & Sons, Inc., New York, pp. 408-409, which is hereby incorporated
by reference.

[0216]Another approach that can be used in the present invention is
multivariate adaptive regression splines (MARS). MARS is an adaptive
procedure for regression, and is well suited for the high-dimensional
problems addressed by the methods disclosed herein. MARS can be viewed as
a generalization of stepwise linear regression or a modification of the
CART method to improve the performance of CART in the regression setting.
MARS is described in Hastie et al., 2001, The Elements of Statistical
Learning, Springer-Verlag, New York, pp. 283-295, which is hereby
incorporated by reference in its entirety.

[0217]Clustering

[0218]In some embodiments, the expression values for a selected set of
markers are used to cluster a training set. For example, consider the
case in which ten markers are used. Each member m of the training
population will have expression values for each of the ten markers. Such
values from a member m in the training population define the vector:

X1mX2mX3mX4mX5mX.sub.6mX7mX8mX.sub.9mX.-
sub.10m

[0219]where Xim is the expression level of the ith marker in
subject m. If there are m organisms in the training set, selection of i
markers will define m vectors. Note that the methods disclosed herein do
not require that each the expression value of every single marker used in
the vectors be represented in every single vector m. In other words, data
from a subject in which one of the ith marker is not found can still
be used for clustering. In such instances, the missing expression value
is assigned either a "zero" or some other normalized value. In some
embodiments, prior to clustering, the expression values are normalized to
have a mean value of zero and unit variance.

[0220]Those members of the training population that exhibit similar
expression patterns across the training group will tend to cluster
together. A particular combination of markers is considered to be a good
classifier in this aspect of the methods disclosed herein when the
vectors cluster into the trait groups found in the training population.
For instance, if the training population includes healthy patients and
atherosclerotic patients, a clustering classifier will cluster the
population into two groups, with each group uniquely representing either
healthy patients and atherosclerotic patients.

[0221]Clustering is described on pages 211-256 of Duda and Hart, Pattern
Classification and Scene Analysis, 1973, John Wiley & Sons, Inc., New
York, which is hereby incorporated by reference in its entirety for such
teachings. As described in Section 6.7 of Duda, the clustering problem is
described as one of finding natural groupings in a dataset. To identify
natural groupings, two issues are addressed. First, a way to measure
similarity (or dissimilarity) between two samples is determined. This
metric (similarity measure) is used to ensure that the samples in one
cluster are more like one another than they are to samples in other
clusters. Second, a mechanism for partitioning the data into clusters
using the similarity measure is determined.

[0222]Similarity measures are discussed in Section 6.7 of Duda, where it
is stated that one way to begin a clustering investigation is to define a
distance function and to compute the matrix of distances between all
pairs of samples in a dataset. If distance is a good measure of
similarity, then the distance between samples in the same cluster will be
significantly less than the distance between samples in different
clusters. However, as stated on page 215 of Duda, clustering does not
require the use of a distance metric. For example, a nonmetric similarity
function s(x, x') can be used to compare two vectors x and x'.
Conventionally, s(x, x') is a symmetric function whose value is large
when x and x' are somehow "similar." An example of a nonmetric similarity
function s(x, x') is provided on page 216 of Duda.

[0223]Once a method for measuring "similarity" or "dissimilarity" between
points in a dataset has been selected, clustering requires a criterion
function that measures the clustering quality of any partition of the
data. Partitions of the data set that extremize the criterion function
are used to cluster the data. See page 217 of Duda. Criterion functions
are discussed in Section 6.8 of Duda.

[0225]Principal component analysis (PCA) has been proposed to analyze
biomarker data. More generally, PCA can be used to analyze feature value
data of markers disclosed herein in order to construct a analytical
process that discriminates one class of patients from another (e.g.,
those who have atherosclerosis and those who do not). Principal component
analysis is a classical technique to reduce the dimensionality of a data
set by transforming the data to a new set of variable (principal
components) that summarize the features of the data. See, for example,
Jolliffe, 1986, Principal Component Analysis, Springer, New York, which
is hereby incorporated by reference.

[0226]A few examples of PCA are as follows. Principal components (PCs) are
uncorrelate and are ordered such that the kth PC has the kth
largest variance among PCs. The kth PC can be interpreted as the
direction that maximizes the variation of the projections of the data
points such that it is orthogonal to the first k-1 PCs. The first few PCs
capture most of the variation in the data set. In contrast, the last few
PCs are often assumed to capture only the residual `noise` in the data.

[0227]PCA can also be used to create an analytical process as disclosed
herein. In such an approach, vectors for a selected set of markers can be
constructed in the same manner described for clustering. In fact, the set
of vectors, where each vector represents the expression values for the
select markers from a particular member of the training population, can
be considered a matrix. In some embodiments, this matrix is represented
in a Free-Wilson method of qualitative binary description of monomers
(Kubinyi, 1990, 3D QSAR in drug design theory methods and applications,
Pergamon Press, Oxford, pp 589-638), and distributed in a maximally
compressed space using PCA so that the first principal component (PC)
captures the largest amount of variance information possible, the second
principal component (PC) captures the second largest amount of all
variance information, and so forth until all variance information in the
matrix has been accounted for.

[0228]Then, each of the vectors (where each vector represents a member of
the training population) is plotted. Many different types of plots are
possible. In some embodiments, a one-dimensional plot is made. In this
one-dimensional plot, the value for the first principal component from
each of the members of the training population is plotted. In this form
of plot, the expectation is that members of a first group (e.g. healthy
patients) will cluster in one range of first principal component values
and members of a second group (e.g., patients with atheroclerosis) will
cluster in a second range of first principal component values (one of
skill in the art would appreciate that the distribution of the marker
values need to exhibit no elongation in any of the variables for this to
be effective).

[0229]In one example, the training population comprises two groups:
healthy patients and patients with atherosclerosis. The first principal
component is computed using the marker expression values for the selected
markers across the entire training population data set. Then, each member
of the training set is plotted as a function of the value for the first
principal component. In this example, those members of the training
population in which the first principal component is positive are the
healthy patients and those members of the training population in which
the first principal component is negative are atherosclerotic patients.

[0230]In some embodiments, the members of the training population are
plotted against more than one principal component. For example, in some
embodiments, the members of the training population are plotted on a
two-dimensional plot in which the first dimension is the first principal
component and the second dimension is the second principal component. In
such a two-dimensional plot, the expectation is that members of each
subgroup represented in the training population will cluster into
discrete groups. For example, a first cluster of members in the
two-dimensional plot will represent subjects with mild atherosclerosis, a
second cluster of members in the two-dimensional plot will represent
subjects with moderate atherosclerosis, and so forth.

[0231]In some embodiments, the members of the training population are
plotted against more than two principal components and a determination is
made as to whether the members of the training population are clustering
into groups that each uniquely represents a subgroup found in the
training population. In some embodiments, principal component analysis is
performed by using the R mva package (Anderson, 1973, Cluster Analysis
for applications, Academic Press, New York 1973; Gordon, Classification,
Second Edition, Chapman and Hall, CRC, 1999.). Principal component
analysis is further described in Duda, Pattern Classification, Second
Edition, 2001, John Wiley & Sons, Inc.

[0232]Nearest Neighbor Classifier Analysis

[0233]Nearest neighbor classifiers are memory-based and require no model
to be fit. Given a query point x0, the k training points x.sub.(r),
r, . . . , k closest in distance to x0 are identified and then the
point x0 is classified using the k nearest neighbors. Ties can be
broken at random. In some embodiments, Euclidean distance in feature
space is used to determine distance as:

d.sub.(i)=∥x.sub.(i)-x0∥

[0234]Typically, when the nearest neighbor algorithm is used, the
expression data used to compute the linear discriminant is standardized
to have mean zero and variance 1. For the disclosed methods, the members
of the training population are randomly divided into a training set and a
test set. For example, in one embodiment, two thirds of the members of
the training population are placed in the training set and one third of
the members of the training population are placed in the test set.
Profiles of a selected set of markers disclosed herein represents the
feature space into which members of the test set are plotted. Next, the
ability of the training set to correctly characterize the members of the
test set is computed. In some embodiments, nearest neighbor computation
is performed several times for a given combination of markers. In each
iteration of the computation, the members of the training population are
randomly assigned to the training set and the test set. Then, the quality
of the combination of markers is taken as the average of each such
iteration of the nearest neighbor computation.

[0235]The nearest neighbor rule can be refined to deal with issues of
unequal class priors, differential misclassification costs, and feature
selection. Many of these refinements involve some form of weighted voting
for the neighbors. For more information on nearest neighbor analysis, see
Duda, Pattern Classification, Second Edition, 2001, John Wiley & Sons,
Inc; and Hastie, 2001, The Elements of Statistical Learning, Springer,
New York, each of which is hereby incorporated by reference in its
entirety.

Evolutionary Methods

[0236]Inspired by the process of biological evolution, evolutionary
methods of classifier design employ a stochastic search for an analytical
process. In broad overview, such methods create several analytical
processes--a population--from measurements such as the biomarker
generated datasets disclosed herein. Each analytical process varies
somewhat from the other. Next, the analytical processes are scored on
data across the training datasets. In keeping with the analogy with
biological evolution, the resulting (scalar) score is sometimes called
the fitness. The analytical processes are ranked according to their score
and the best analytical processes are retained (some portion of the total
population of analytical processes). Again, in keeping with biological
terminology, this is called survival of the fittest. The analytical
processes are stochastically altered in the next generation--the children
or offspring. Some offspring analytical processes will have higher scores
than their parent in the previous generation, some will have lower
scores. The overall process is then repeated for the subsequent
generation: The analytical processes are scored and the best ones are
retained, randomly altered to give yet another generation, and so on. In
part, because of the ranking, each generation has, on average, a slightly
higher score than the previous one. The process is halted when the single
best analytical process in a generation has a score that exceeds a
desired criterion value. More information on evolutionary methods is
found in, for example, Duda, Pattern Classification, Second Edition,
2001, John Wiley & Sons, Inc.

Bagging, Boosting, and the Random Subspace Method

[0237]Bagging, boosting, the random subspace method, and additive trees
are data analysis algorithms known as combining techniques that can be
used to improve weak analytical processes. These techniques are designed
for, and usually applied to, decision trees, such as the decision trees
described above. In addition, such techniques can also be useful in
analytical processes developed using other types of data analysis
algorithms such as linear discriminant analysis. In addition, Skurichina
and Duin provide evidence to suggest that such techniques can also be
useful in linear discriminant analysis.

[0238]In bagging, one samples the training datasets, generating random
independent bootstrap replicates, constructs the analytical processes on
each of these, and aggregates them by a simple majority vote in the final
analytical process. See, for example, Breiman, 1996, Machine Learning 24,
123-140; and Efron & Tibshirani, An Introduction to Bootstrap, Chapman &
Hall, New York, 1993, which is hereby incorporated by reference in its
entirety.

[0239]In boosting, analytical processes are constructed on weighted
versions of the training set, which are dependent on previous analytical
process results. Initially, all objects have equal weights, and the first
analytical process is constructed on this data set. Then, weights are
changed according to the performance of the analytical process.
Erroneously classified objects get larger weights, and the next
analytical process is boosted on the reweighted training set. In this
way, a sequence of training sets and classifiers is obtained, which is
then combined by simple majority voting or by weighted majority voting in
the final decision. See, for example, Freund & Schapire, "Experiments
with a new boosting algorithm," Proceedings 13th International Conference
on Machine Learning, 1996, 148-156.

[0240]To illustrate boosting, consider the case where there are two
phenotypic groups exhibited by the population under study, phenotype 1
(e.g., poor prognosis patients), and phenotype 2 (e.g., good prognosis
patients). Given a vector of molecular markers X, a classifier G(X)
produces a prediction taking one of the type values in the two value set:
{phenotype 1, phenotype 2}. The error rate on the training sample is

err = 1 / N i = 1 N I ( y i ≠ G ( x i
) )

[0241]where N is the number of subjects in the training set (the sum total
of the subjects that have either phenotype 1 or phenotype 2). For
example, if there are 35 healthy patients and 46 sclerotic patients, N is
81.

[0242]A weak analytical process is one whose error rate is only slightly
better than random guessing. In the boosting algorithm, the weak
analytical process is repeatedly applied to modified versions of the
data, thereby producing a sequence of weak classifiers Gm(x), m=1,
2, . . . , M. The predictions from all of the classifiers in this
sequence are then combined through a weighted majority vote to produce
the final prediction:

G ( x ) = sign ( m = 1 M α m G m
( x ) )

[0243]Here α1, α2, . . . , αm are
computed by the boosting algorithm and their purpose is to weigh the
contribution of each respective Gm(x). Their effect is to give
higher influence to the more accurate classifiers in the sequence.

[0244]The data modifications at each boosting step consist of applying
weights w1, w2, . . . , wn to each of the training
observations (xi, yi), i=1, 2, . . . , N. Initially all the
weights are set to wi=1/N, so that the first step simply trains the
analytical process on the data in the usual manner. For each successive
iteration m=2, 3, . . . , M the observation weights are individually
modified and the analytical process is reapplied to the weighted
observations. At stem m, those observations that were misclassified by
the analytical process Gm-1(x) induced at the previous step have
their weights increased, whereas the weights are decreased for those that
were classified correctly. Thus as iterations proceed, observations that
are difficult to correctly classify receive ever-increasing influence.
Each successive analytical process is thereby forced to concentrate on
those training observations that are missed by previous ones in the
sequence.

[0248](a) Fit an analytical process Gm(x) to the training set using
weights wi.

[0249](b) Compute

err = i = 1 N w i I ( y i ≠ G m (
x i ) ) i = 1 N w i

[0250](c) Compute αm=log((1-errm)/errm).

[0251](d) Set wiwi
exp[αmI(yi≠Gm(xi))], i=1, 2, . . . , N.

[0252]3. Output

G ( x ) = sign m = i M α m G m (
x )

[0253]In the algorithm, the current classifier Gm(x) is induced on
the weighted observations at line 2a. The resulting weighted error rate
is computed at line 2b. Line 2c calculates the weight αm given
to Gm(x) in producing the final classifier Gm(x) (line 3). The
individual weights of each of the observations are updated for the next
iteration at line 2d. Observations misclassified by Gm(x) have their
weights scaled by a factor exp(αm), increasing their relative
influence for inducing the next classifier Gm+I(x) in the sequence.
In some embodiments, modifications of the Freund and Schapire, 1997,
Journal of Computer and System Sciences 55, pp. 119-139, boosting method
are used. See, for example, Hasti et al., The Elements of Statistical
Learning, 2001, Springer, New York, Chapter 10. In some embodiments,
boosting or adaptive boosting methods are used.

[0254]In some embodiments, modifications of Freund and Schapire, 1997,
Journal of Computer and System Sciences 55, pp. 119-139, are used. For
example, in some embodiments, feature preselection is performed using a
technique such as the nonparametric scoring methods of Park et al., 2002,
Pac. Symp. Biocomput. 6, 52-63. Feature preselection is a form of
dimensionality reduction in which the markers that discriminate between
classifications the best are selected for use in the classifier. Then,
the LogitBoost procedure introduced by Friedman et al., 2000, Ann Stat
28, 337-407 is used rather than the boosting procedure of Freund and
Schapire. In some embodiments, the boosting and other classification
methods of Ben-Dor et al., 2000, Journal of Computational Biology 7,
559-583 are used in the disclosed methods. In some embodiments, the
boosting and other classification methods of Freund and Schapire, 1997,
Journal of Computer and System Sciences 55, 119-139, are used.

[0256]As indicated at the beginning of this section, the statistical
techniques described above are merely examples of the types of algorithms
and models that can be used to identify a preferred group of markers to
include in a dataset and to generate an analytical process that can be
used to generate a result using the dataset. Further, combinations of the
techniques described above and elsewhere can be used either for the same
task or each for a different task. Some combinations, such as the use of
the combination of decision trees and boosting, have been described.
However, many other combinations are possible. By way of example, other
statistical techniques in the art such as Projection Pursuit and Weighted
Voting can be used to identify a preferred group of markers to include in
a dataset and to generate an analytical process that can be used to
generate a result using the dataset.

Determining Optimum Number of Dataset Components to be Evaluated in
Analytical Process

[0257]When using the learning algorithms described above to develop a
predictive model, one of skill in the art may select a subset of markers,
i.e. at least 3, at least 4, at least 5, at least 6, up to the complete
set of markers, to define the analytical process. Usually a subset of
markers will be chosen that provides for the needs of the quantitative
sample analysis, e.g. availability of reagents, convenience of
quantitation, etc., while maintaining a highly accurate predictive model.

[0258]The selection of a number of informative markers for building
classification models requires the definition of a performance metric and
a user-defined threshold for producing a model with useful predictive
ability based on this metric. For example, the performance metric may be
the AUC, the sensitivity and/or specificity of the prediction as well as
the overall accuracy of the prediction model.

[0259]The predictive ability of a model may be evaluated according to its
ability to provide a quality metric, e.g. AUC or accuracy, of a
particular value, or range of values. In some embodiments, a desired
quality threshold is a predictive model that will classify a sample with
an accuracy of at least about 0.7, at least about 0.75, at least about
0.8, at least about 0.85, at least about 0.9, at least about 0.95, or
higher. As an alternative measure, a desired quality threshold may refer
to a predictive model that will classify a sample with an AUC (area under
the curve) of at least about 0.7, at least about 0.75, at least about
0.8, at least about 0.85, at least about 0.9, or higher.

[0260]As is known in the art, the relative sensitivity and specificity of
a predictive model can be "tuned" to favor either the selectivity metric
or the sensitivity metric, where the two metrics have an inverse
relationship. The limits in a model as described above can be adjusted to
provide a selected sensitivity or specificity level, depending on the
particular requirements of the test being performed. One or both of
sensitivity and specificity may be at least about at least about 0.7, at
least about 0.75, at least about 0.8, at least about 0.85, at least about
0.9, or higher.

[0261]As described in Examples 5, 11 and 12, various methods are used in a
training model. The selection of a subset of markers may be via a forward
selection or a backward selection of a marker subset. The number of
markers to be selected is that which will optimize the performance of a
model without the use of all the markers. One way to define the optimum
number of terms is to choose the number of terms that produce a model
with desired predictive ability (e.g. an AUC>0.75, or equivalent
measures of sensitivity/specificity) that lies no more than one standard
error from the maximum value obtained for this metric using any
combination and number of terms used for the given algorithm.

Use of Results Generated by Analytic Process

[0262]As described above, datasets from containing quantitative data for
components of the dataset are inputted into an analytic process and used
to generate a result. The result can be any type of information useful
for making an atherosclerotic classification, e.g. a classification, a
continuous variable, or a vector. For example, the value of a continuous
variable or vector may be used to determine the likelihood that a sample
is associated with a particular classification.

[0264]Further details regarding the appropriate type of reference or
training data to be used to develop predictive models for various
atherosclerotic classifications and how to use such models to predict
certain types of atherosclerotic classifications is described below.

[0265]In a preferred embodiment, the result is used for diagnosis or
detection of the occurrence of an atherosclerosis, particularly where
such atherosclerosis is indicative of a propensity for myocardial
infarction, heart failure, etc. In this embodiment, a reference or
training set containing "healthy" and "atherosclerotic" samples is used
to develop a predictive model. A dataset, preferably containing protein
expression levels of markers indicative of the atherosclerosis, is then
inputted into the predictive model in order to generate a result. The
result may classify the sample as either "healthy" or "atherosclerotic".
In other embodiments, the result is a continuous variable providing
information useful for classifying the sample, e.g., where a high value
indicates a high probability of being an "atherosclerotic" sample and a
low value indicates a low probability of being a "healthy" sample.

[0266]In other embodiments, the result is used for atherosclerosis
staging. In this embodiment, a reference or training dataset containing
samples from individuals with disease at different stages is used to
develop a predictive model. The model may be a simple comparison of an
individual dataset against one or more datasets obtained from disease
samples of known stage or a more complex multivariate classification
model. In certain embodiments, inputting a dataset into the model will
generate a result classifying the sample from which the dataset is
generated as being at a specified cardiovascular disease stage. Similar
methods may be used to provide atherosclerosis prognosis, except that the
reference or training set will include data obtained from individuals who
develop disease and those who fail to develop disease at a later time.

[0267]In other embodiments, the result is used determine response to
atherosclerotic disease treatments. In this embodiment, the reference or
training dataset and the predictive model is the same as that used to
diagnose atherosclerosis (samples of from individuals with disease and
those without). However, the instead of inputting a dataset composed of
samples from individuals with an unknown diagnosis, the dataset is
composed of individuals with known disease which have been administered a
particular treatment and it is determined whether the samples trend
toward or lie within a normal, healthy classification versus an
atherosclerotic disease classification.

[0268]In another embodiment, the result is used for drug screening, i.e.,
identifying compounds that act via similar mechanisms as known
atherosclerotic drug treatments (Examples 6-7). In this embodiment, a
reference or training set containing individuals treated with a known
atherosclerotic drug treatment and those not treated with the particular
treatment can be used develop a predictive model. A dataset from
individuals treated with a compound with an unknown mechanism is input
into the model. If the result indicates that the sample can be classified
as coming from a subject dosed with a known atherosclerotic drug
treatment, then the new compound is likely to act via the same mechanism.

[0269]In preferred embodiments, the result is used to determine a
"pseudo-coronary calcium score," which is a quantitative measure that
correlates to coronary calcium score (CCS). CCS is a clinical
cardiovascular disease screening technique which measures overall
atherosclerotic plaque burden. Various different types of imaging
techniques can be used to quantitate the calcium area and density of
atherosclerotic plaques. When electron-beam CT and multidetector CT are
used, CCS is a function of the x-ray attenuation coefficient and the area
of calcium deposits. Typically, a score of 0 is considered to indicate no
atherosclerotic plaque burden. >0 to 10 to indicate minimal evidence
of plaque burden, 11 to 100 to indicate at least mild evidence of plaque
burden, 101 to 400 to indicate at least moderate evidence of plaque
burden, and over 400 as being extensive evidence of plaque burden. CCS
used in conjunction with traditional risk factors improves predictive
ability for complications of cardiovascular disease. In addition, the CCS
is also capable of acting an independent predictor of cardiovascular
disease complications. Budoff et al., "Assessment of Coronary Artery
Disease by Cardiac Computed Tomography," Circulation 113: 1761-1791
(2006).

[0270]A reference or training set containing individuals with high and low
coronary calcium scores can be used develop a model, e.g., Example 8, for
predicting the pseudo-coronary calcium score of an individual. This
predicted pseudo-coronary calcium score is useful for diagnosing and
monitoring atherosclerosis. In some embodiments, the pseudo-coronary
calcium score is used in conjunction with other known cardiovascular
diagnosis and monitoring methods, such as actual coronary calcium score
derived from imaging techniques to diagnose and monitor cardiovascular
disease.

[0271]One of skill will also recognize that the results generated using
these methods can be used in conjunction with any number of the various
other methods known to those of skill in the art for diagnosing and
monitoring cardiovascular disease.

Reagents and Kits

[0272]Also provided are reagents and kits thereof for practicing one or
more of the above-described methods. The subject reagents and kits
thereof may vary greatly. Reagents of interest include reagents
specifically designed for use in production of the above described
expression profiles of circulating protein markers associated with
atherosclerotic conditions.

[0273]One type of such reagent is an array or kit of antibodies that bind
to a marker set of interest. A variety of different array formats are
known in the art, with a wide variety of different probe structures,
substrate compositions and attachment technologies. Representative array
or kit compositions of interest include or consist of reagents for
quantitation of at least two, at least three, at least four, at least
five or more protein markers are selected from M-CSF, eotaxin, IP-10,
MCP-1, MCP-2, MCP-3, MCP-4, IL-3, IL-5, IL-7, IL-8, MIP1a, TNFa, and
RANTES.

[0274]In other embodiments, a representative array or kit includes or
consists of reagents for quantitation of at least three protein markers
selected from the following group: f MCP-1, MCP-2, MCP-3, MCP-4, eotaxin,
IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1. The at least
three protein markers may comprise or consist of a marker set selected
from the following group: MCP-1, IGF-1, TNFa; MCP-1, IGF-1, M-CSF; ANG-2,
IGF-1, M-CSF; and MCP-4, IGF-1, M-CSF.

[0277]The kits may further include a software package for statistical
analysis of one or more phenotypes, and may include a reference database
for calculating the probability of classification. The kit may include
reagents employed in the various methods, such as devices for withdrawing
and handling blood samples, second stage antibodies, ELISA reagents;
tubes, spin columns, and the like.

[0278]In addition to the above components, the subject kits will further
include instructions for practicing the subject methods. These
instructions may be present in the subject kits in a variety of forms,
one or more of which may be present in the kit. One form in which these
instructions may be present is as printed information on a suitable
medium or substrate, e.g., a piece or pieces of paper on which the
information is printed, in the packaging of the kit, in a package insert,
etc. Yet another means would be a computer readable medium, e.g.,
diskette, CD, etc., on which the information has been recorded. Yet
another means that may be present is a website address which may be used
via the internet to access the information at a removed site. Any
convenient means may be present in the kits.

EXAMPLES

[0279]Below are examples of specific embodiments for carrying out the
present invention. The examples are offered for illustrative purposes
only, and are not intended to limit the scope of the present invention in
any way. Efforts have been made to ensure accuracy with respect to
numbers used (e.g., amounts, temperatures, etc.), but some experimental
error and deviation should, of course, be allowed for.

[0280]To investigate the multimarker approach in distinguishing subjects
with active coronary artery disease from those without disease, we
utilized a large clinical epidemiological study which included 400 cases
of clinically significant ASCVD and 930 control subjects. The study was
designed to examine risk factors and other novel determinants of
atherosclerosis. Serum samples collected at the time of enrollment were
used for simultaneous measurement of multiple inflammatory markers using
a protein microarray. The exact methodology used for the pilot studies
was utilized here (discussed in details in examples in WO97/002677
"Methods and Compositions for Diagnosis and Monitoring of Atherosclerotic
Cardiovascular Disease"). Concentrations of a subset of the analytes
tested were significantly higher in case subjects. Classification
algorithms using the serum expression profile of these markers accurately
stratified CAD subjects compared to controls. Moreover, the unique
signature pattern of the biomarkers significantly improved the predictive
capacity of other known markers of CAD. This larger trial replicated our
prior findings but also provided with more examples for use of
multimarker approach for accurate prediction and diagnosis of
atherosclerotic cardiovascular disease and its various clinical sequelae.

[0281]The selection of a number of informative markers for building
classification models requires the definition of a performance metric and
a user-defined threshold for producing a model with useful predictive
ability based on this metric. In the following section we defined the
target quantity to be the "area under the curve" (AUC), the sensitivity
and/or specificity of the prediction as well as the overall accuracy of
the prediction model. This is the approach we used for selecting the
number of terms for building a predictive model in the absence of any
clinical variables and/or adjusting factors. The process was as follows:
We first randomly split our training data into ten groups, each group
containing subjects identified as "Healthy" or "Diseased" in proportion
to the number of these labels in the complete sample. Each subject was
represented by its 26 marker measurements and the label that identifies
the state of disease (absent, i.e. "Healthy" of present, i.e.
"Diseased"). We chose nine of the groups and for each of the 26 markers
(TIMP1, RANTES, MCP-1, IGF-1, TNFa, IL-5, M-CSF, MCP-2, IP10, MCP-4, IL3,
IFNg, Ang-2, IL-7, IL-10, Eotaxin, IL-2, IL-4, ICAM-1, IL-6, IL-12p40,
MIP1a, IL-5, MCP-3, IL13, IL1b) we trained a model using a given
supervised algorithm, e.g., Linear Discriminant Analysis, Quadratic
Discriminant Analysis, Logistic Regression on all the data of the 9
groups (i.e. we created a training supergroup). We then applied the model
to the tenth group that was excluded from the training procedure and we
estimated the testing error "e" and or a number of prediction quality
measures described earlier. We repeated the same process 10 times,
sampling randomly 9 groups each time for generating a training sample and
using the 10th group for estimating the testing error "e" and the
prediction quality measures. From the sample of the 10 numbers we then
estimated the expected value for each of the prediction quality measures
and/or error, as a well the variance of our estimates. Given these
values, the marker that improves the average prediction ability of the
model as chosen as the first term in the model.

[0282]As an alternative, we also used another measure of improvement
instead of the average value of the prediction quality measure, for
example we instead selected the term with the highest value of the ratio
of the expected quality measure to its variance estimate. Once the first
term was added to the model, we repeated the process for the remaining
markers that did not make it in the current selection step. Thus, in the
second step we repeated the aforementioned calculations for the remaining
markers. The selection of the second model term was accomplished by
choosing the term that mostly improves our target prediction quality
measure or using some combination of the expected value of the current
model minus the new model normalized by the errors of those measures.

[0283]FIG. 1 shows the results of applying this process to a set of 1300
subjects. We selected the threshold of AUC>0.85 as our target
prediction quality measure and we selected the terms using a Logistic
Regression model. The quality threshold was satisfied using the following
marker: TIMP1, MCP-1 and RANTES.

[0284]FIG. 2 shows the results of selecting the terms using a Linear
Discriminant Analysis model while keeping the discovery sample and
quality thresholds the same. The comparison with the previous example
indicates that the two models agree on the selected terms that satisfy
our performance criteria.

[0285]Another option for term addition, in a forward fashion, to each
model is to use the misclassification error, accuracy or log-likelihood
of the data. The process was started by adding the first term in the
model. This term was selected so that (i) the misclassification rate was
the smallest from all the rates obtained with any single marker, (ii) the
accuracy was the highest or (iii) the log-likelihood of the data was the
highest. Using 10-fold cross-validation the expected value of this metric
and its standard error was estimated. Once the model with the first term
was created, we again selected the next term by: a) creating a two term
model where the best term from the previous step was combined with each
one of the remaining available markers and b) by finding the marker that
in combination with the term that was already in the model provided the
smallest misclassification error among the remaining markers, the highest
accuracy or the highest increase in log-likelihood. The expected out-of
sample expected value and its standard error for the model of size two
were again estimated using a 10-fold cross-validation. We continuously
added terms until we have used all the terms and estimate the expected
value and standard error for all nested models. Then we chose the
smallest model that was within one standard error from the best value of
the quality measure used for the term selection. The overall approach is
summarized in FIG. 9. In this figure, Model 1,2, . . . N represents any
of the classification algorithms described earlier. The 10-fold cross
validation can be any of 3-fold, 5-fold, 10-fold, . . . (N-1)-fold
(leave-one-out) cross-validation. A demonstration of this approach using
accuracy as the quality criterion is shown in FIG. 10.

[0286]Based on the literature, subjects with CCS<10 are in low risk for
adverse events while subjects with CCS>400 are at high risk for
adverse events. Based on these criteria we built classification models
for these two populations to predict high and low pseudo-coronary calcium
score. We assigned the label "upper" for the subjects with CCS>400 and
the label "lower" for the subjects with CCS<10. We then used the AIC
criterion to identify the terms of the Logistic Regression model that
best separates the two groups. For this application, we allowed clinical
variables to be included in the model if selected based on the AIC
criterion. FIG. 3 shows the order in which terms were dropped. The
clinical variables are the most significant predictors but the minimum of
the selection path is obtained only when protein markers are included
(MCP-1, IFNg.). FIG. 4 shows the selection process for the same
classification problem using the cross-validation approach.

Additional Examples

[0287]The following Examples demonstrate various applications using twenty
four of the markers from Example 1 (excluding RANTES and TIMP1). Any of
the following Examples can be performed using RANTES and/or TIMP1 as
additional biomarkers.

Example 3

AIC Selection Criteria

[0288]As an example of a different selection criterion, we present the
results obtained using the AIC criterion within the framework of a
Logistic Regression model. This criterion is usually used in the context
of selecting the optimum number of terms for a Logistic Regression model.
The criterion balances the error increase due to the removal of a term
with the reduction of the number of degrees of freedom that this term
contributed to the model. Usually, the process of term elimination starts
with the full model and terminates when the removal of a term increases
the AIC value. The results of term elimination as a function of the AIC
criterion are presented in FIG. 5a (the term elimination process is
presented past the optimum point). The AUC predictions for a model
incorporating increasing number of terms are presented in FIG. 5b. The
addition of terms in the aforementioned model is performed in the reverse
order of term removal from the complete model, i.e., a model including
only 24 of the above markers that the application of the AIC criterion
dictates in the term selection process. The latter approach produces a
Logistic Regression model with expected AUC>0.75 using at least one
marker (MCP-1).

[0289]The process of term selection can be accomplished either with a
forward selection (first, second and third examples within this working
example) or a backward selection (fourth example within this working
example), or a forward/backward selection strategy. This strategy allows
for testing of all the terms that have been removed in a previous step in
the current reduced model.

[0290]The same selection process can be extended to include both markers
and clinical variables. The next two figures, present the results for the
case that the candidate variables for a Logistic Regression model include
"Hyperlipidemia" (DC912) and "Use of lipid-lowering medication within 160
days before index day" (FIG. 6) or "Statin use," "ACE blockers use" (FIG.
7) along with all 16 markers. These examples demonstrate that the markers
in the set of at least 3 markers required for obtaining an AUC>0.75
can be replaced with clinical variables in the set. The combination of
Hyperlipidemia (DC912) and MCP-4 produces a model with expected value of
AUC˜0.85.

[0291]Using the aforementioned methods we can also select the number of
markers that will optimize the performance of a model without the use of
all the markers. One way to define the optimum number of terms is to
choose the number of terms that produce a model with average predictive
ability (measured as AUC, or equivalent measures of
sensitivity/specificity) that lies no more than one standard error from
the maximum value obtained for any combination and number of terms used
for the given algorithm. Looking back at FIG. 7, a Logistic Regression
model that includes the following markers satisfies these requirements:
Beta Blockers ("DC512"), Statins ("DC3005"), MCP-4, IGF-1, M-CSF, IL-5,
MCP-2, IP-10.

Example 4

ACE Inhibitor Response Prediction Models

[0292]Using the methods described in Examples 1 and 3, we derived models
using Logistic Regression or Linear Discriminant Analysis that classify
samples according to the use of ACE inhibitors. These models were
adjusted for the status of the subject (Control or Case) since the
overall level of the markers depends on whether we deal with a healthy
individual or not. The models find use in a variety of methods such as,
e.g., screening compounds to identify other agents that act as ACE
inhibitors or on convergent pathways, and for monitoring the efficacy of
ACE inhibitor therapy. In the first example, the compound is provided to
a mammalian subject, one or more samples are taken from the subject and
datasets are obtained from the sample(s). The datasets are run through an
ACE Inhibitor Response Prediction model and the results are used to
classify the sample. If the sample is classified as coming from a subject
dosed with an ACE inhibitor, then the compound is likely to be a
presumptive ACE inhibitor. In the second example, one or more samples are
obtained from a subject and datasets from those samples are run through
an ACE Inhibitor Response Prediction model. If the sample is classified
as coming from a subject dosed with an ACE inhibitor then the therapy is
likely to be efficacious. If multiple samplings over time indicate time
dependent changes in the value of a predictor obtained from the model,
then the therapeutic efficacy of the medication therapy is likely
changing, the direction of the change being indicated by a predictor
value trending more toward the medication use classification or the
no-medication use classification. The protein markers used in the
exemplified models are set out in Tables 2 and 3, below, along with the
models' performance characteristics.

[0293]Using the methods described in Examples 1 and 3, we derived models
using Logistic Regression or Linear Discriminant Analysis that classify
samples according to the use of ACE inhibitors or statins. These models
were adjusted for the status of the subject (Control or Case) since the
overall level of the markers depends on whether we deal with a healthy
individual or not. The models find use in a variety of methods such as,
e.g., screening compounds to identify other agents that act as ACE
inhibitors or statins or on convergent pathways, and for monitoring the
efficacy of ACE inhibitor or statin therapy. In the first example, the
compound is provided to a mammalian subject, one or more samples are
taken from the subject and datasets are obtained from the sample(s). The
datasets are run through an ACE Inhibitor or Statin Use Prediction model
and the results are used to classify the sample. If the sample is
classified as coming from a subject dosed with an ACE inhibitor or
statin, then the compound is likely to be a presumptive ACE inhibitor or
statin. In the second example, one or more samples are obtained from a
subject and datasets from those samples are run through an ACE Inhibitor
or Statin Use Prediction model. If the sample is classified as coming
from a subject dosed with an ACE inhibitor or statin then the therapy is
likely to be efficacious. If multiple samplings over time indicate time
dependent changes in the value of a predictor obtained from the model,
then the therapeutic efficacy of the medication therapy is likely
changing, the direction of the change being indicated by a predictor
value trending more toward the medication use classification or the
no-medication use classification. The protein markers used in the
exemplified models are set out in Tables 4 and 5, below, along with the
models' performance characteristics.

Biomarker Profile for Medication Use Responsiveness

[0294]We demonstrate that a panel of markers can be used for monitoring
the medication effect on the level of inflammation of a subject.
Inspecting the distribution of values for a number of markers
(IL-2,IL-5,IL-4) we demonstrate a dosage effect as a function of the
number of medications that a control subject is treated with (i.e. no
medication vs. one medication vs. two medications). As an example for
this approach, we use three medication responsive markers as a panel
(IL-2,IL-4 and IL-5). In order to create a single combined score, we
create a linear discriminant analysis model where the response variable
takes the following levels: "Untreated", "ACE or Statin", "ACE and
Statin" and we use the first discriminant variate as a surrogate for a
combined score. FIG. 8 presents the results from the subjects that are
considered "Healthy" ("Controls") as boxplots for each of the three
"treatment" groups. The grey sections of each boxplot extend from the
first to the third quantile of the value distribution for each class. The
"notches:" around the medians are included for facilitating visual
inspection of differences in the level of the median between the classes.
The whiskers extend to 1.5 times the interquantile distance. The outliers
have not been included in the graph. Clearly the combined score shows a
downward trend with increased number of medications. The fact that the
notches for the groups are barely overlapping indicates that the
differences in the median are rather significant. A panel of biomarkers
performs better than any single biomarker alone.

[0295]A similar analysis can be performed by creating a single score from
multiple markers using Hottelling's T2 method. In this case we can
estimate the covariance matrix from the data for the untreated group and
calculate the "distance" of each subject based on Hottelling's formula.
The later approach can be used not only for creating a "combined
distance" from many markers for monitoring medication dosage effect but
also for hypothesis testing of the dosage effect. (see Hotelling, H.
(1947). Multivariate Quality Control. In C. Eisenhart, M. W. Hastay, and
W. A. Wallis, eds. Techniques of Statistical Analysis. New York:
McGraw-Hill., herein incorporated by reference).

[0296]Using the methods described in Examples 1 and 3, we derived models
using Logistic Regression or Linear Discriminant Analysis that classify
samples according to a predicted coronary calcium score. The protein
markers used in the exemplified models are set out in Tables 6 and 7,
below, along with the models' performance characteristics.

[0297]Using the methods described in Examples 1 and 3, we derived models
using Logistic Regression or Linear Discriminant Analysis that classify
samples into stable (i.e., angina) or unstable (i.e., myocardial
infarction) categories. The protein markers used in the exemplified
models are set out in Tables 8 and 9, below, along with the models'
performance characteristics.

[0298]Using the methods described in Examples 1 and 3, we derived models
using Logistic Regression or Linear Discriminant Analysis that classify
samples into disease (i.e., angina or myocardial infarction) or healthy
control categories. The protein markers used in the exemplified models
are set out in Tables 10 and 11, below, along with the models'
performance characteristics. Tables 10 and 11 also indicate how the
performance of the models change as combinations of markers are
substituted.

[0299]We classified a patient into a "Control" or "Disease" category based
on the values of the following markers MCP-1, IGF-1 and TNFa. The costs
of misclassification are taken to be equal for the two classes. Based on
an LDA approach, a new subject with values x of the aforementioned
markers is categorized into the "Disease" category if the left side of
equation (1) is greater than the right side of the equation where:

[0300]a) index 2 corresponds to the "Disease" state

[0301]b) index 1 corresponds to the "Control" state

[0302]c) N is the total size of the training set

[0303]d) N1,N2 are the number of "Control" and "Disease" subjects in the
training set

[0304]e) Σ is the covariance matrix as estimated from the training
set

[0305]f) μ1,2 are the mean vectors of the "Control" and "Disease"
sample respectively

[0306]In order to build an LDA model for the prediction we used a training
set containing the three marker values for 398 subjects that were
identified as "Control" and 398 subjects that were identified as
"Disease." The marker values are first log 10 transformed and the
resulting values are used to estimate the required terms of Eq. 1. The
covariance matrix and mean marker vectors for the training set are equal
to:

[0309]We classified a subject with the following values (transformed using
a log 10transformation):

TABLE-US-00016
Subject 1:
MCP-1 IGF-1 TNFa
0.716998 1.316101 0.287882

[0310]Based on these values and Eq. 1, the left side of the equation is
equal to: 0.5291794 while the right side of the equation is equal to
3.232524. Based on the fact that the left side is less than the right
side, the subject was classified into the "Control" category.

[0311]We classified a second subject with the following log 10transformed
marker values:

Based on these values and using equation 1, the left side is equal to
4.461167 and the right hand side remains 3.232524. Based on this
comparison the subject was classified into the "Disease" category.

[0312]Reference for this and the following example is made to "The
elements of Statistical Learning. Data Mining, Inference and Prediction",
Hastie, T., Tibshirani, R., Friedman, J., Springer Series in Statistics,
2001), herein incorporated by reference.

Example 10

Classification using a Logistic Regression Model

[0313]We classified a patient into a "Control" or "Disease" category based
on the values of the following markers MCP-1, IGF-1 and M-CSF. The costs
of misclassification are taken to be equal for the two classes. Based on
a Logistic Regression approach, a new subject with values x of the
aforementioned markers will be categorized as Disease if the log ratio of
the posterior probabilities of class k (=Disease) to class K(=Control) is
greater than zero, otherwise it is categorized as Control (Equation 2).

[0314]In order to fit a Logistic Regression model we used a training set
composed of 398 subjects identified as "Control" and 398 subjects
identified as "Disease." The values of the three markers for each subject
were first log 10transformed. The Logistic Regression fit provides the
following coefficients:

TABLE-US-00018
b0 b1 b2 b3
-4.95059 3.334 -1.27675 1.279328

[0315]A new subject with the following values for the three markers was
classified:

TABLE-US-00019
MCP-1 IGF-1 M-CSF
Subject 1 1.679931 3.493781 1.169145

[0316]The following calculation b0+b1*`MCP-1`+b2*`IGF-1`+b3*`M-CSF` equals
-2.031. Based on the previous discussion this subject has a linear
predictor value less than zero and was classified into the "Control"
category.

[0317]Another subject was classified, based on the following values:

TABLE-US-00020
MCP-1 IGF-1 M-CSF
Subject 2 2.108252 1.7149 0.539566

[0318]Using the same coefficients and formula the linear predictor equals
0.5799186 and Subject 2 was classified into the "Disease" category.

[0319]Each publication cited in this specification is hereby incorporated
by reference in its entirety for all purposes. In addition to those
publications listed throughout the body of this specification, the
following also is hereby incorporated by reference in its entirety for
all purposes: Tabibiazar R, Wagner R A, Deng A, Tsao P S, Quertermous T.
Proteomic profiles of serum inflammatory markers accurately predict
atherosclerosis in mice. Physiol Genomics. 2006 Apr. 13; 25(2):194-202.