The most challenging problem for physicians treating AIDS patients with anti-HIV drugs is that the virus almost inevitably evolves toward resistance against any administered drug therapy. Once resistance is manifest, the physician must change the therapy regimen, which typically consists of a combination of anti-HIV drugs. Here, we describe bioinformatical methods supporting the choice of an effective follow-up therapy. Using underlying clinical-resistance databases and statistical-learning methods, we identify as-yet-undescribed resistance mutations, predict the level of resistance of a viral variant extracted from the blood of an AIDS patient against anti-HIV drugs, and estimate the expected mutational path of the virus toward resistance against specific combination drug therapies. This computational method enables us to rank possible therapies with respect to their expected effectiveness. We also offer a computational test for the expected effectiveness of a new drug capable of blocking viral cell entry.

Our analyses, which are freely available on the Internet via the server http://www.geno2pheno.org, are used routinely for treating about two-thirds of AIDS patients in Germany.

AIDS is a major scourge worldwide, causing millions of deaths annually. Whereas due to education and preventive measures, the number of new infections in the developed world is comparatively limited, other parts of the world (notably Sub-Saharan Africa) exhibit very high infection rates. The disease is on the rise globally.20

The AIDS pathogen—the Human Immunodeficiency Virus, or HIV—crossed over to humans from apes as recently as 100 years ago. The pathogen and its new host apparently have not yet adapted through co-evolution. Consequently, HIV is highly pathogenic in humans, unlike chimpanzees, which exhibit very high infection rates with the Simian Immunodeficiency Virus, or SIV, without presenting debilitating symptoms.

AIDS is especially lethal for a number of reasons. For the human population, one danger involves the fact that symptoms develop slowly, so hosts can be infectious for extended periods without their contacts knowing. For infected patients one problem involves the fact that the virus inserts its genome into the genome of the infected cell. These people cannot be cleared of the virus. As we describe here, the virus evolves dynamically. Thus it is difficult to produce vaccines against HIV, and no vaccine against HIV is in sight. Since there are major obstacles to curing AIDS, the objectives of drug therapy are to ease symptoms and delay progress of the disease by suppressing viral replication.

Since the virus continually changes in a patient, physicians are chasing a moving target. Given a particular drug therapy, the virus evolves toward resistance. The drug therapy then has to be changed to suppress what is now the prevalent viral variant in the patient. The underlying biological relationships between the viral genotype—the particular genome sequence of the virus—and the viral resistance phenotype—its ability to escape antiviral drugs—are complex and not well understood. Therefore, drug therapies are selected not so much on the basis of understanding the underlying biology as they are on the basis of clinical experience.

Clinical experience in treating AIDS patients with antiviral drugs has been collected for the past 20 years and assembled in sizeable resistance databases. The complexity of the relationship between viral genotype and resistance phenotype suggests using statistical-learning methods to support computational models for predicting the resistance phenotype from the viral genotype. For this purpose, we have developed the Web server geno2pheno (http://www.geno2pheno.org), offering such analysis for free on the Web.

Replication Cycle of HIV

HIV is not an autonomous organism but rather an enveloped piece of genome, roughly 10,000 letters of genomic text (bases) in protein packaging. This tiny genomic text (compared to three billion letters of the human genome) defines one of the most vicious biological killers. The structure of the HIV virus particle (virion) is known in detail.9

As with all viruses, to replicate, HIV uses the cells it infects, usually those of the human immune system (such as T-lymphocytes). Knowledge of the replication cycle of HIV (see Figure 1) is the basis for all drug therapies in use today. The genome of HIV does not consist of DNA (as in humans) but of the close relative RNA that in humans is used for translating genomic information and regulating cellular processes. The replication cycle of HIV begins with HIV using its surface protein gp120 to bind to surface proteins of the host cell. This binding event triggers a cascade of structural changes of the participating proteins that result in HIV entering the host cell. Once inside, HIV sheds its molecular envelope and uses a special viral protein—the reverse transcriptase (RT)—to copy its RNA genome to DNA. The DNA is then transported into the cell nucleus where it is spliced into the genome of the host cell with the help of a second viral protein—the integrase (IN). At this stage, the viral DNA is called a "provirus." Once the cell begins to divide, as it does within an immune response, it manufactures all components of the virus. These components assemble near the cell surface, and a new still-immature virion buds from the cell. In a final maturation step, strings of viral proteins in the immature virion (the so-called polyproteins) are cleaved to yield the functional viral proteins. This renders the virion infectious. The protein performing the cleavage is the viral protease (PR). Each host cell is able to produce thousands of virions for a long period before inevitably dying.

Drug Therapies Against HIV

More than two dozen drugs against HIV are in clinical use; see http://www.fda.gov/oashi/aids/virals.html for the current list of U.S. Food and Drug Agency-approved anti-HIV drugs. All are small molecules that block (inhibit) the function of a specific protein involved in the viral replication cycle, the so-called target protein. One way to block a protein is to bind to it in a place that deactivates the protein, either by replacing its natural binding partner or by interfering with essential protein movements. Target proteins can be viral or human. The classical target proteins are viral, namely RT and PR. Originally, viral proteins were preferred because one does not want to interfere with unknown functions of human target proteins. However, viral target proteins have the disadvantage that the virus can quickly change them through mutation and thus evolve toward drug resistance. More recently, human proteins have also been targeted by antiviral drugs.

Toward Resistance

If the virus were not so variable, one or two AIDS drugs would suffice. But the virus changes its genome with practically every copy. The reason for such flexibility is that RT lacks a proofreading mechanism and does not repair copy errors. Mutations in the HIV genome can result in changes in the composition of its proteins. Most of these changes are detrimental or even lethal to the virus, but with many millions to even billions of virus copies produced daily in the same patient, chances are high that a viral variant will arise quickly whose target protein remains functional even in the presence of a drug. Such a virus is resistant to the drug.

Suppressing viral replication means reducing the number of experiments the virus can perform to produce a resistant variant. In order to increase the barrier of the virus to escape toward resistance, several drugs targeting different viral proteins are given simultaneously. This scheme, called highly active antiretroviral therapy, or HAART, renders therapies effective for much longer periods of time. The virus always wins. Most current therapies remain effective for only months to a few years.

Antiviral Therapies

Once the virus is resistant, the treating physician must select a new drug therapy that effectively suppresses the present viral variant. The standard of care today is to use diagnostic tools for selecting a new therapy regimen. There are two fundamental approaches toward this goal:

Phenotypic resistance testing. Phenotypic resistance testing basically provides a lab test, essentially exposing the virus taken from a patient's blood serum in cell culture to increasing drug concentrations and observing quantitatively how quickly the replication rate of the virus declines. The decline is compared with the decline of the replication rate of a nonresistant reference virus. The comparison yields a quantitative measure of viral resistance against individual drugs, the resistance factor. This measure is the drug concentration that cuts the replication rate of the patient's virus in half divided by the drug concentration that cuts the replication rate of the reference virus in half. Large resistance factors mean greater resistance.

Phenotypic resistance testing meets with major obstacles when used in clinical practice, mainly because such testing is restricted to labs with high security levels and is thus difficult to standardize and not sufficiently accessible. Cost is another issue.

Genotypic resistance testing. In contrast, genotypic resistance testing determines the genomic sequences of the relevant parts of the viral genome taken from a patient's blood serum. The relevant genome sequence can be obtained cheaply, quickly, and with standardized procedures by many laboratories. However, it is not easy to infer the resistance phenotype from the viral genotype. Virologists used to perform this interpretation by hand with the help of a so-called mutation table; mutation tables are offered and continually updated by such authorities as the International AIDS Society,10 collecting the global knowledge on mutations observed to cause resistance against specific drugs. Figure 2 is an excerpt from a mutation table covering three protease inhibitors. The blue bar represents the protein sequence, here the protease with 99 amino-acid positions. Numbers inside the blue bar indicate protein-sequence positions. The amino acid of the reference virus at that position is given above the number. Resistance mutations at that position are indicated below the number. Each row pertains to a single drug named to the left of the row. Mutations enter the table as a result of committee consensus. More recently, the tables have been turned into expert systems that provide more complex rules. These systems can also express interactions between different mutations that result in resistance or susceptibility of the virus to a given drug.16

Computational Biology

One problem with mutation tables and expert systems is they are the result of a consensus among human experts, rather than being systematically derived from the underlying clinical data. This is where the contribution of computational biology comes in. If we can render the clinical resistance databases computer-readable, we can apply statistical-learning methods to systematically derive estimates of the resistance phenotype from the viral genotype. We can also assess not only the level of resistance of the virus present in the patient but also estimate the path the virus will take toward resistance in the future if presented with a specific drug therapy, along with the time the virus will take to get there.

Since 1988, we have been partners in a number of consortia collecting HIV-resistance data comprising viral genotypes, associated clinical markers (such as counts of virus and immune cells in the blood), and phenotypic-resistance data where available. We did this nationally in Germany through the Arevir database.17 In 2004, we co-founded the EuResist consortium, whose database is the result of integrating several large resistance databases for all of Europe.18 To our knowledge, the EuResist database is the largest HIV-resistance database worldwide, harboring data on just under 100,000 therapies for almost 34,000 patients. Paired data on viral-mutations and clinical response to treatment is available for more than 5,000 therapies.

Identifying new resistance mutations. Given an HIV-resistance database, we use statistical methods to systematically find resistance mutations. A resistance mutation is one, such that viruses resistant (against a given drug) are highly enriched among the viral variants with the mutation, unlike the ones without the mutation. The "information content" a mutation harbors on viral resistance against a given drug can be quantified in various ways, including mutual information and distance from the decision boundary in a discriminatory classifier. Using such methods, we have uncovered new, that is, as-yet-undescribed resistance mutations.19 That study won a Best Presentation award at the Third European HIV Resistance Workshop, Athens, Greece, in 2005. This peer recognition reflects how much virologists and clinicians are interested in approaches to identifying resistance mutations beyond the classical mutation tables.

Resistance prediction based on complete viral genomes. The second class of models incorporates multivariate analysis to systematically deduce the kind of information offered less systematically by rule-based expert systems. We have produced many such models, including classifiers (into the resistance classes resistant and susceptible) and regression models that estimate the numerical value of the resistance factor. All models are trained on the data available in our resistance databases, notably geno-type-phenotype pair data, that is, viral variants for which we have both the viral genotype and the resistance factor. We employed decision trees6 and random forests to determine the classifications. For regression we found support-vector machines are most effective.5 Our statistical-learning methods are state-of-the-art and adapted to the respective problem; the sidebar "Statistical Learning Methods" outlines two such methods: mutagenetic trees and support-vector machines. Modeling and feature selection are the focus of the effort. Appropriate statistical validation of the resulting models represents another major aspect of our research.

Figure 3 is a decision tree for the resistance of HIV against the PR inhibitor saquinavir. The branching nodes are labeled with amino-acid positions in the target protein PR. Terminal nodes are labeled with the classes "resistant" and "susceptible," respectively. Edges leaving a node are labeled with amino acids found at these positions. The amino acid of the reference virus (no mutation) is in red. The path leading from the root of the tree (top) to the blue arrow indicates a single mutation at position 54 from the reference Isoleucine (I) to Valine (V). (All other edges along the path represent the reference virus.) The resulting virus is resistant according to the model (red terminal node). However, if in position 72, there is also a mutation from the reference Isoleucine (I) to Valine (V) (red arrow), then the virus is susceptible (green terminal node) to treatment with the drug. Such resensitization events present interactions between different mutations and are derived systematically from the procedure of learning decision trees for drug resistance. Cross-validation helps us show that our decision trees make accurate predictions in approximately 90% of all cases.

Our resistance models are the basic service of the geno2pheno server. Practicing physicians and laboratory virologists paste in the nucleic acid sequence of the relevant genes of the viral variant extracted from a patient's blood. The analysis responds with the kind of output listed in Figure 4, where each row represents a drug. Column 1 names the drug. Column 2 gives the estimated resistance factor. Column 3 gives a normalized value reflecting the significance of the resistance value. Column 4 lists mutations found in the input sequence, red if they strengthen the resistance of the virus and green if they weaken it. The data in the figure points to strong resistance against many inhibitors of RT and therapy options targeting PR. The Geno2pheno server is the basis for supporting treatment decisions in about two-thirds of HIV-infected patients treated in Germany.13 This means at least 12,000 decisions for treatment selection per year in Germany involve geno2pheno or its findings.

Chasing the virus. This analysis treats each drug separately. Given the output in Figure 4, the physician assesses the resistance level of the virus against each individual drug and manually composes the combination drug therapy that is (hopefully) effective against the present virus. We also look into the future of the virus. Presented with a given combination drug therapy, how will it react? What are its mutational escape paths and how long will the therapy stay effective? The virus does not just randomly introduce mutations. Rather, it follows more-or-less established mutational escape paths; Figure 5 outlines two favored paths from a therapy with the single AIDS drug zidovudine (ZDV, AZT). (The notation is analogous to that of Figure 3.) We denote with K70R the mutation of K to R in position 70 (of RT). Hence, one escape path is K70R followed by K219E/Q.

The biological reasons for the virus following these paths are not well understood. But the paths show up in a clinical HIV-resistance database. Finding them is simple if we have longitudinal data. The data comprises sequences of viral genotypes and clinical parameters from the same patient over long periods of time. However, such data is difficult to come by. Our databases are dominated by cross-sectional data involving only a few or single data points for each patient. Nevertheless, we are still able to identify favored escape paths from cross-sectional data, as in Figure 5.

A database of cross-sectional data on therapies with zidovudine will not contain many viruses having mutation M41L but not the mutation T215F/Y. This mutational pattern indicates the direction of the escape path. We have developed statistical models that pinpoint the paths, so-called mixtures of mutagenetic trees, from the database4; Figure 6 outlines the trees derived from the database concerning zidovudine therapy.

Figure 6 outlines a mixture model of two mutagenetic trees, the bottom one expressing the two thymidine analogue mutations (TAM) escape paths and the top one (noise tree) expressing an unstructured escape to resistance. The mixture model indicates that 78% of the data is explained by the escape via the TAM paths; 22% can be viewed as noise. The sidebar explores mixtures of the mutagenetic trees model.

The analysis of viral escape is available on the geno2pheno server via the applet known as THEO (therapy optimization), which ranks all reasonable therapies by the probability of their staying effective for six months or longer for the Web-server version of the software. The statistical method for doing this is discussed in the sidebar section on support-vector machines. Figure 7 outlines the results of THEO on the same data as in Figure 5.

Training the model requires data encompassing the viral genotype, the drugs involved in the therapy, and clinical follow-up data on the effectiveness of the therapy. How to characterize a successful therapy is complex. We do not, for example, need the resistance factor to be input for each query. We can supply it through our computational-resistance prediction method discussed earlier. Also, the expected future viral evolution can be estimated through mutagenetic trees.

THEO, which has been validated extensively, improves the accuracy of therapy selection substantially1,2; for example, approximately 24% of the therapy selections reported in the 2006 version of our Arevir database turned out to be ineffective. Using THEO could have helped reduce the error rate in selecting effective therapies below 15%.

The EuResist project (http://www.euresists.org) adds two qualities to the research we discuss here: Data collection includes data from several European countries; and, on the EuResist prediction server, three independently developed prediction engines are executed and return individual results and a consensus prediction.18

New Drugs

Using sophisticated methods to administer antiviral combination drug therapies does not obviate demand for continually developing new drugs. For an individual patient, administering a drug provokes resistance mutations that accumulate within the virus genome. Eventually, only new drugs with new modes of action or even new target proteins will deliver additional effective drug therapies. Moreover, AIDS drugs age as resistance mutations accumulate in the global viral population, necessitating continuous development of new drugs. And clinical side effects enforce the development of new drugs with the same mode of action as existing "old" drugs. Such new drugs might replace the "old" drugs but might also provoke slightly different resistance mutations.

Drugs targeting RT and PR were the basis of AIDS therapy until the early 2000s. Since 2003, drugs targeting other proteins have come onto the market. Especially attractive targets for anti-HIV drugs are proteins facilitating cell entry of HIV. Such targets are chosen because blocking viral cell entry helps prevent integration of the viral DNA into the cellular genome. To understand how we block viral cell entry we must look at the process of HIV entering the cell in more detail (see Figure 8). First, the viral surface protein gp120 binds to the cellular receptor protein CD4. This leads to a conformational change in gp120 so it can then bind to an additional cellular protein, the so-called co-receptor. The binding of gp120 to the cellular co-receptor triggers the actual viral cell entry, during which the helical (corkscrew-like) viral surface protein gp41 penetrates the cellular membrane, and the hull of HIV fuses with the cellular membrane. HIV can use one of two cellular surface proteins—CCR5 or CXCR4—as a co-receptor; some viral strains use either. The co-receptor specificity of HIV is also called viral tropism. A virus using CCR5 is called R5-virus. Analogously, a virus using CXCR4 is called X4-virus. A virus using either co-receptor is called dual-tropic, or R5/X4-virus.

Viral tropism has important clinical consequences. For example, the initial infection results almost exclusively in an R5-virus population; we assume that X4-viruses may infect the patient but can be controlled initially by the immune system. Approximately 1% of the Caucasian population worldwide lacks a functional gene for CCR5, has no apparent symptoms, and is resistant to being infected by HIV. As the disease progresses, a virus using CXCR4 can become dominant.

Targets for drugs that block cell entry are the viral surface protein gp41 and the cellular co-receptor CCR5. The latter is targeted by the drug Selzentry/Celsentri, which contains the active substance maraviroc (developed by pharmaceutical manufacturer Pfizer). Regulatory agencies in Europe and the U.S. require viral tropism testing before administration of this drug. As with resistance analysis, there are again two options for a viral tropism test: One is a lab-based phenotypic test, the other a genotypic test with computer-based interpretation. The advantages and disadvantages of each are similar to those in resistance testing; for example, phenotypic tests are accurate but take a long time and are expensive and not always easily accessible. Moreover, and in contrast to phenotypic resistance tests, phenotypic tropism tests provide only a classification into X4-capable or not-X4-capable and no quantification of the risk of using the wrong co-receptor.

The main problem with genotypic tests is the elucidation of the genotype-phenotype relationship. The geno2pheno server offers a prediction for viral tropism from genotype. As with resistance analysis it is based on careful modeling of the input and on the development of a multivariate statistical model trained on genotype-phenotype pair data.12 In this instance, the phenotype is the viral tropism, not the resistance against a drug, though the co-receptor switch can be viewed as a way for HIV to evade drugs blocking CCR5. Three notable advantages of this genotypic approach are lower costs, wider availability, and a quantification of the risk of using the CXCR4 co-receptor.

Measuring the Viral Quasi-Species

A problem with genotypic data that seems more relevant for predicting viral tropism than for predicting drug resistance is that the patient harbors not a single viral variant but rather a diverse viral population, or so-called quasi-species. Classical genotypic measurements reduce the quasi-species to a single viral variant (the dominant one) or to a sequence consensus of a few frequent viral variants. However, minorities of X4-virus present in the viral quasi-species (but not detected by the genotypic test) can accumulate in the patient under therapy with CCR5-blockers. Detecting such minorities may be clinically important, and phenotypic tests are able to detect them. To enable genotypic tests to also detect them, we use new deep sequencing technology called pyrosequencing14 to generate data from which appropriate computational procedures reconstruct (with great accuracy) the profile of the whole quasi-species. One of our current research activities targets predicting viral tropism and its clinical consequences based on such data.

Since there are major obstacles to curing AIDS, the objectives of drug therapy are to ease symptoms, suppress viral replication, and delay progress of the disease.

Outlook

The work described here can now be extended in several directions. For example, a multitude of questions pertain to the statistical-modeling procedure, including those involving the representativeness of the clinical databases, how to improve prediction accuracy when sufficient training data is unavailable, and how to follow different notions of therapy success.

More fundamental, the technology applies to other viral infections, the pathogens of which exhibit dynamic evolutionary development, a property shared by Hepatitis C (caused by HCV) and Hepatitis B (caused by HBV). In both cases, drug development and the collection of resistance data has not advanced as far as it has for HIV. We are involved in projects that collect such data, intending to transfer our technology to these diseases. We have gone beyond infectious diseases and applied the mutagenetic-trees technology to assessing the status of tumor progression in cancers from data on the evolutionary degeneration of the genomes of the related tumor cells.15

Thus far, our analysis is based mostly on pattern matching with limited concrete biology in the form of mechanistic models of the creation of the viral phenotype. Methods from experimental virology and systems biology can be used to generate data that facilitates development of such models. Incorporating them into the prediction of viral resistance and therapy effectiveness should increase the accuracy of the relevant prediction procedures and help further our understanding of how the viral phenotype develops.

Finally, though not included in our present analysis, host factors, including a patient's immunotype, also play a role in disease development and the effectiveness of drug therapy. For instance, it is under debate whether the immune system initially suppresses the enrichment of preexisting X4-viruses in the viral quasi-species. If this is the case, solely detecting X4 minorities need not be clinically significant; such detection does not necessarily predict the breakthrough of the viral variants, as long as the immune system is intact. Indeed, we and others have observed that the risk of X4-virus emerging rises with decreasing immune-cell count, reflecting the decreased intensity of the patient's immune response. Such observations strongly encourage construction of a comprehensive model that includes information on all three players—pathogen, drug, and host.

17. Roomp, K. et al. Arevir: A secure platform for designing personalized antiretroviral therapies against HIV. In Proceedings of the Third International Workshop on Data Integration in the Life Sciences (Hinxton, U.K. July 20–22). Springer Verlag, Berlin, Heidelberg, 2006, 185–194.

Mixtures of mutagenetic trees. A mutagenetic tree is a tree-shaped Bayesian model; two are included in Figure 6. The tree is rooted, and its root represents the viral wildtype, or the absence of mutation. Each other tree node represents a mutation. The edges of the tree are directed downward and labeled with conditional probabilities. Given the presence of all mutations along the path from the root of the tree to the source node of an edge, the label of the edge indicates the probability that the mutation at its target node takes place.

In principle, a mutagenetic tree can be used to generate a set of viral variants by performing a random experiment based on the probabilities at the edges of the tree. We are not interested in explicitly performing such an experiment. Rather, given a set of viral variants (such as the subset of viral genotypes in our resistance database that has seen a certain drug, like saquinavir) we are looking for the mutagenetic tree that generates that set with greatest probability (maximum likelihood model). This tree best represents the escape of the virus toward resistance against the drug saquinavir.

Desper et al.8 presented a method for finding a mutagenetic tree that is optimal under restricted circumstances and good (only) in the general case; the result is derived not in a viral context but in the context of cancer research. We extended the method to be able to generate several trees,4 because viral escape paths do not usually submit to a single tree model, as reflected in Figure 6. Again, this method is heuristic; it does not find the best model but just a reasonably good model. Rather than labeling the edges of a mutagenetic tree with conditional probabilities, we can also annotate them with expected times for the relevant mutation to occur. Labeling affords a route to analyzing the times the virus takes to escape toward resistance.3 We use this model to assess therapy effectiveness.

Classifying therapy success with support-vector machines. Different versions of THEO have used different multivariate statistical-learning methods to come up with accurate classifiers. Among them are logistic model trees11 and support-vector machines.7 Support-vector machines are a recent, popular method for classifying data that regards data as points in a (usually high-dimensional) Euclidean space. In our case, each data point represents a therapy change episode, or event where physicians assign a new therapy based on a viral genotype seen in a patient. Some therapy selections are successful, others are not. This dichotomy represents our binary classification problem. The question of what is a successful therapy and what is a failure, both medically and methodically, is beyond our scope here.

A linear support-vector machine defines a hyperplane that best separates the set of points indicating therapy successes from the points indicating therapy failures. The hyperplane divides the Euclidean space into two half-spaces, one for therapy success, one for therapy failure. What is the "best" hyperplane (for minimizing risk of wrong predictions) is defined in terms of two criteria:

Discriminating between therapy successes and failures. As few therapy data points as possible should be located "on the wrong side" of the hyperplane, that is, we do not want to see therapy failures in the half-space for the successes and vice versa. The further a point is in the wrong half or away from the hyperplane on the wrong side, the more it reduces the quality of the model; and

Maximizing prediction reliability. The hyperplane should be as distant as possible from the closest correctly classified points. Since the hyperplane represents the "decision boundary," points lying close to it represent uncertain decisions, and small changes in the data or in the location of the hyperplane can reverse their classification.

Quadratic programming techniques are used to find the optimal hyperplane according to these criteria.

While we have taken state-of-the-art versions of support-vector machines developed by others, our main objective here is to define the Euclidean space to which we apply the support-vector machine. We must therefore address the following issues:

Representing viral genotypes. Should we use binary indicator variables? Which mutations should we consider? Considering all possible mutations leads to a high-dimensional space and is thus infeasible; and

Additional information for the method. The therapy we want to apply is a necessary input. Additional input includes predictions of resistance factors against single drugs, the probability that the virus will achieve resistance against a drug in a certain time interval (estimated via the mutagenetic trees), and previous antiretroviral drugs to which the patient was exposed.

Addressing them is difficult, as we must balance the amount of information we present to the method against the available data. The more information we present, the more complex are the resulting models. However, we must find the best model on the basis of limited data. If models are too complex we incur the risk of overtraining the model. An overtrained model incorporates not only patterns pertaining to the phenomenon or process we want to analyze and whose results we want to predict (here viral resistance) but also idiosyncrasies of the particular data set on which we derived the model. Such idiosyncrasies do not generalize to future data. Thus an overtrained model suffers from reduced predictive power. We have performed several studies and reported our choices.1,2,18

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.