An Examination of the Clinical Research on the Medical Treatment of ADHD

In 1998, the National Institutes of Health held a consensus panel to examine the issue of attention deficit hyperactivity disorder (ADHD). The consensus panel concluded that "at the present time, there is a paucity of data providing information on long-term treatment beyond 14 months. Although trials combining drugs and behavioral modalities are under way, conclusive recommendations concerning treatment for the long term cannot be made easily."(1)

Children diagnosed with ADHD are typically medicated for years. The medications used, including Ritalin®, have been studied for over 40 years. Why, then, is there no long-term, definitive study? Is it sufficient to study treatment outcomes for about a year or less? In the absence of definitive evidence that medical treatment is effective in the long-term, is it ethical to continue to treat children with potent stimulants for years on end?

In 1999, the results of the Multimodal Treatment Study of Children with ADHD (MTA) were published. The National Institutes of Mental Health (NIMH) regarded the MTA as having critical importance in the study of the treatment of ADHD.(2) This study followed children for 14 months. More recently, the NIMH indicated:

Effective treatments for children with attention deficit hyperactivity disorder (ADHD) exist, but a major gap in our knowledge is the lack of adequate data on the long-term effects of these treatments. For instance, it is not known if effective treatment of ADHD symptoms results in improved educational achievement, decreased antisocial behavior, reduced substance abuse, or better occupational status. Likewise, it is not known if exposure to amphetamine-like stimulant medications for extended periods of time during child development may carry negative consequences, as manifested by an increased use of illicit drugs, higher incidence of mania, psychosis, or other manifestations of psychopathology. Data from naturalistic follow-up of clinical samples are limited by lack of appropriate controls and self-selection biases that are difficult to account for.(3)

Despite this caution from the National Institutes of Mental Health, the organization Children and Adults with Attention Deficit Disorder (CHADD) considers the MTA study to be definitive and conclusive. The CEO of CHADD, John Heavener, states:

The NIMH's Multimodal Treatment Study of Children with Attention Deficit Hyperactivity Disorder (MTA), released in the December issue of the American Medical Association's Archives of General Psychiatry, is the longest and most thorough study ever completed comparing treatments for AD/HD. The study found that medication alone, or medication in combination with intensive behavioral therapy, significantly superior to other types of treatment. The MTA study is the first major clinical trial to look at childhood mental illness and the largest NIMH clinical trial to date.

These results allow the AD/HD community to move on from the ongoing debate about best types of treatment, and make real progress by ensuring that every individual with AD/HD is actually receiving the best type of treatment.(4)

The Center for the Study of Psychiatry and Psychology provides a different perspective. It lists numerous adverse effects of the use of stimulant medication in the treatment of ADHD. Peter Breggin, the founder of the Center, states:

Hundreds of animal studies and human clinical trials leave no doubt about how the medication works. First, the drugs suppress all spontaneous behavior. In healthy chimpanzees and other animals, this can be measured with precision as a reduction in all spontaneous or self-generated activities. In animals and in humans, this is manifested in a reduction in the following behaviors: (1) exploration and curiosity; (2) socializing, and (3) playing.(5)

Considering the number of children who regularly receive prescription stimulants, it must be asked why none of the children were followed in clinical trials to examine long-term outcomes of continuos medication. There have been an enormous number of studies of ADHD (at one time called hyperkinetic disorder) as shown in Figure 1. There have also been a large number of studies in the use of Ritalin® (Figure 2).

The use of stimulant medication also has racial and gender disparities. The prevalence rate varies from 9:1 to 6:1 for males over females.(6) Caucasian boys are 2.5 times more likely to receive medication than African Americans, and the disparity increases as the children age. High school Caucasians are 5.2 times more likely to receive stimulant medication than African Americans.(7) To the best of our knowledge, no study exists that examines the reasons for this disparity. Socioeconomic differences may also exist in the children diagnosed with ADHD, although this also does not appear to have been studied.

This paper gives a qualitative analysis of the studies of the use of Ritalin® to determine trends in clinical trials and methods and to examine the extent of knowledge of the treatment for ADHD. First, we examine the improper use of statistical methods that leads to incorrect conclusions. Then, we discuss the MTA study and the details of its design.

Statistical Problems

Most studies on the use of Ritalin® in children tend to have a very small sample size and a very short time period. If the most advertised reason to use Ritalin® is to improve school performance, studies should be long enough to measure actual school achievement. Instead, artificial tasks are often given to the children with no acknowledgment that children often identify the artificial nature of the task and modify their behaviors accordingly.

Consider a study by Sykes et.al. published in 1972:

Using a double-blind crossover design, the effect of methylphenidate on the performance of 23 hyperactive children on four tasks measuring different aspects of attention was investigated. While receiving methylphenidate the hyperactive children showed a significant improvement in all aspects of their performance which had, in comparison to a control group of normal children, been initially impaired. Furthermore, methylphenidate produced a significant improvement is performance in those behaviors which had not been initially impaired.(8)

However, the method involved the use of push-button tasks to receive a stimulus card. There are certain weak aspects of this study that are continually reproduced in other studies: the small sample size, the short period of time the children were studied (during the administration of four tests of attention and four days to set medication doses). Moreover the children are given artificial tasks to perform for purposes of measuring their performance. The children were not asked about their interest in such artificial activities.

Next consider a study by Swanson et.al. published 26 years later:

In a randomized double-blind crossover study of children with attention deficit hyperactivity disorder (ADHD), the time course effects of four doses of Adderall (5,10,15, and 20 mg) and an inactive (placebo) control, and an active (Ritalin) control were evaluated. A laboratory classroom setting was established in which subjective (teacher ratings of deportment and attention) and objective (scores on math tests) measurements were taken every 1.5 hours across the day.(9)

Again, the sample size was small with 36 children and the attention tasks were artificial. The study continued for 7 weeks. In this case, children were required to spend 8 hours on a weekend taking math tests. Most children, particularly after a full week of school, would be totally unwilling to spend 8 hours working math tests. This would be boring, and lead to considerable passive-aggressive non-compliance. The children themselves were not volunteers to the task; they were "volunteered" by parents and medical researchers. It would appear that children under Ritalin become more docile to performing inane tasks.

Lack of Power

A power analysis is routinely performed to determine the sample size needed to establish conclusive outcomes.(10) For example, to identify a difference between two treatments (or a treatment and a placebo), a sample size of at least 25 is required for the statistical analysis to find a 40 percent difference in outcomes. To find a 20 percent difference in outcomes between two treatments, at least 58 subjects are needed, evenly divided between the two groups. To identify as little as a 10 percent difference, at least 250 subjects are needed. Yet the majority of studies examined concerning the use of Ritalin® used small samples. The tendency is to under-power studies. Consider the following conclusion:

Methylphenidate improved attentional performance for children who had poor predrug scores on the vigilance task, but did not produce a statistically significant change on the scores of children with normal predrug performance.(11)

With only 45 children in the above study, the difference in outcomes would have to be greater than 20 percent for this study to be able to conclude that a difference exists. Yet it concluded that the result was not statistically significant, meaning that the difference found was less than 20 percent. Because of the small sample size, the study is, in fact, not conclusive.

Introduction of Bias

Bias can be introduced into any study by the manner in which subjects are recruited, by the choice of outcome measures, and by the methods and parameters used for measurement. This is emphasized in a paper by Forness et.al.:

Optimal response to this drug was determined in double-blind, placebo, crossover trials, and measurement of response focused on procedures similar to those in actual practice. Response ranged from approximately 18 to 71 percent across the six measures, suggesting that whether a child can be considered a responder to methylphenidate depends greatly on choice of outcome measure.(12)

Rating scales are important to the diagnosis of ADHD, and also to the examination of outcomes. If the raters are aware of the type of treatment, the outcomes can be biased by that knowledge. For this reason, a placebo control needs to be in place. However, many of the studies on Ritalin® did not use placebos and blinding. For example, Kempton et.al. reports a study that concludes that "ADHD is associated with deficits in executive function. Stimulant medication is associated with better executive function performance."(13) The procedure included the following:

The ADHD children were tested at the Maroondah Hospital Child and Adolescent Mental Health Service. Testing took approximately 2 to 3 hours and was conducted over two sessions: during the first session, a trained clinical psychologist administered the achievement tests to the child. The computerized neuropsychological tests were administered in a second session. The control children were assessed in their own homes and usually received both forms of testing on the same day.

Hence there was immediately a difference in the way the children were tested (at home vs. at a hospital). No blinding was used in the experiment. It also stands to reason that the children would be more relaxed in their own homes with a greater ability to focus because of the familiar environment. Hospitals can make many people, particularly children, jittery and nervous.

Reliance on a p-value of 0.05

The p-value is widely used with little real understanding of its statistical meaning. Consider the p-value in a slightly different context. In a criminal trial, a defendant is presumed innocent until proven guilty. The assumption of innocence represents a null hypothesis, the status quo. It is not really the hypothesis that the prosecutor wants to conclude but the one he must start with. The jury must determine if the evidence is beyond a reasonable doubt. In other words, assuming the defendant is innocent, just how likely is the actual evidence likely to occur? If the chance is less than 0.05 (the working p-value) then the obvious conclusion is that the null hypothesis is false and the defendant is declared guilty. In a civil case, only a preponderance of evidence is necessary (raising the p-value to say 0.10).

P-values represent just one measure of difference. Without a statistical power analysis to demonstrate that the number of patients is adequate, the lack of a p-value of 0.05 or greater results in an inconclusive outcome.(10) Nevertheless, investigators frequently conclude that there is no difference between groups when the small samples do not yield a p-value of 0.05 or less. In addition, not all statistical tests lend themselves to power analysis.

The p-value of 0.05 is not magical. It means that the probability of making an error when rejecting a null hypothesis is less than 0.05. The standard null hypothesis tested is that there is no difference in outcome between a group of patients receiving a treatment and a group of patients not receiving a treatment. The p-value is the chance of getting the observed value if the null hypothesis is true. Although most research papers do not specifically identify the null hypothesis, it is always understood in the context of a statistical test. A comparison between several different treatments also may be considered. It is valid to make such a test, called analysis of variance, when the study is randomized, and there is a sufficiently large patient group being studied. The study of multiple treatments is not particularly valid when the study is observational only. There are too many possible conflicting factors than can confound the outcome.

If an error rate of 5 percent (p-value of 0.05) is acceptable, why not 6 percent or 5.5 percent? Acceptable error should vary with the need. In a life-threatening illness, a 10 percent chance that the medication is not effective could be acceptable. However in a relatively non-threatening illness only a 0.1 percent chance that the medication will have very serious side effects may be acceptable.

Still, the main problem with using a p-value to come to a conclusion is that the conclusion can rarely be applied to an individual. To show why conclusions driven by p-values cannot be applied to individual patients, take the following exercise of using a screening test such as a child behavior rating scale for a single child:

Suppose that a hypothetical screening test has a specificity of 95 percent, and an illness prevalence of 5 percent. Then the probability of a correct positive diagnosis in the population is equal to:

This equation is the standard Bayes' formula. Even if the test has very high sensitivity and specificity, there is only about a 50-50 chance that the child will be correctly diagnosed. Therefore, the application of the screening test to any one individual is questionable without additional verification. Yet additional verification does not exist in the diagnosis process for ADHD. One study demonstrated that 57 percent of the children using stimulant medications are in fact false positives without ADHD.(14)

Human beings are very complex. It is next to impossible to reduce this complexity to one score or risk factor. However, this reduction is frequently done in medical research because statistical models are used that exclude the possibility of examining complexity.

Surrogate Endpoints

Because some results are difficult to examine in the short term, surrogate endpoints are often used. Temple's definition of a surrogate endpoint is clear:

A surrogate endpoint of a clinical trial is a laboratory measurement or a physical sign used as a substitute for a clinically meaningful endpoint that measures directly how a patient feels, functions or survives. Changes induced by a therapy on a surrogate endpoint are expected to reflect changes in a clinically meaningful endpoint.(15)

Fleming and DeMets discuss problems with the use of surrogate endpoints:

A correlate does not a surrogate make. It is a common misconception that if an outcome is a correlate (that is, correlated with the true clinical outcome) it can be used as a valid surrogate endpoint (that is, a replacement for the true clinical outcome). However, proper justification for such replacement requires that the effect of the intervention on the surrogate endpoint predicts the effect on the clinical outcome --- a much stronger condition than correlation.(16)

One of the primary reasons to prescribe Ritalin® to children with ADHD is to improve their academic achievement. However, achievement is a long-term outcome that is often difficult to measure. One of the few attempts to examine the true outcome was performed with only 13 subjects, making the study extremely under-powered:

This study is a continuation of the J. L. Alto and W. Frankenberger (1995) study that reported the effects of Ritalin on academic achievement from 1st to 2nd grade. The objectives of the current study were to identify the long-term effects of Ritalin on cognitive ability and academic achievement. A retrospective/longitudinal design was utilized with dependent measures being scores from the Iowa Test of Basic Skills (ITBS). The study included 13 experimental Ss [students] who were identified with attention deficit hyperactivity disorder (ADHD) and placed on Ritalin between first and second grade (ages 9-11). For each experimental child, a contrast child without ADHD was matched based on gender, Verbal IQ score, and family structure. Results of the study revealed that, generally, the Ritalin group's cognitive and achievement scores were lower before medication and the groups tended to diverge after medication was administered.(17)

However, although this study used actual versus surrogate markers, the sample size was far too small to be conclusive. In addition, there were two factors tested simultaneously and it is difficult if not impossible to determine if any difference is because of the use of Ritalin or because of the ADHD versus controls. If the two groups diverge after the use of medication, this study actually indicates that Ritalin has no real impact on academic achievement. However, such a conclusion can only be tentative because of the small sample size. It would have been a far better design to randomize children with ADHD into Ritalin versus placebo groups with followup and with a sufficient number of subjects.

Improper Use of Repeated Measures

The greater the sample size, the greater the number of degrees of freedom, and the greater the likelihood of finding a significant p-value. The number of degrees of freedom in an analysis of variance is always equal to the number of observations minus one. If measurements can be taken on the same subjects any number of times, it becomes relatively simple to inflate the number of degrees of freedom by taking as many measurements as possible. Enough measurements can be taken on the same subjects to virtually guarantee a significant p-value to demonstrate that the treatment is effective. The most extreme example is the use of one subject with multiple time measurements:

A single-case alternating treatments experimental design was employed for a total of 82 days. The dependent variable was the Conners' Abbreviated Symptom Questionnaire. Antecedent exercise failed to reduce hyperactive behavior. Methylphenidate produced significantly less hyperactive behavior than both placebo and antecedent exercise (p=0.0238).(18)

It is not possible to use statistical methods with one subject. The statistical tests all depend upon comparisons of variability and variability implies a minimum of two subjects.

Swanson's study discussed previously made a similar error. With 30 children completing the study, multiple measurements on multiple days for each subject resulted in up to 812 degrees of freedom in the statistical model. The most that are possible with 30 children is 29 degrees of freedom. The result of this inflation in the number of degrees of freedom is a p-value of 0.001. This problem is clearly stated in Milliken and Johnson:

Replication of a treatment is an independent observation of the treatment, and thus two replications of a treatment must involve two experimental units. It is very important that this definition be observed during an experiment. Too often researchers use duplicate or split samples to generate two observations and call them replicates, when, in reality, they are actually subsamples or repeated measures. They certainly are not replicates. The variation measured by subsamples is an index of the within-[unit] variation and not experimental-unit-to-experimental-unit variation. Thus, more and more parts do not aid in detecting differences between the means. One test for determining whether a part of an observation is a true replication is the following: If the researcher could have just as easily obtained more "replications" by splitting, then she is not obtaining true replications but is obtaining subsamples or repeated measures. Thus results from [analysis of variance measures] constructed by using the error variance computed from subsamples will be much larger than they should be, leading the experimenter to determine more differences as being significant than she should.(19)

Similarly, Biederman et.al. used paired t-tests to examine repeated measures, even though outcomes were collected at baseline, 1-year, and 4-years.(20) Using multiple t-tests will only increase the potential error, a problem usually ignored in such studies. In addition, the study was not randomized, instead comparing children with ADHD to identified "normals." The baseline scores indicate that the "normals" were functioning at a higher level cognitively than the children with ADHD, with an average difference in IQ of 13 points.

The Multimodal Treatment Study of Children with ADHD (MTA) Study

The MTA was considered a long-term study. Follow-up of patients occurred for 14 months after enrollment in the trial. The sample size was large enough to provide sufficient power. The design of the study is described at length in Arnold et.al.,(21) with considerable discussion in Breggin.(22) One of the major problems with this study was that medication treatment lasted for 14 months but psychosocial treatment was substantially truncated beyond the first six months of enrollment in the study.(23)

The study demonstrated that parent and teacher ratings improved more with medication than with behavior therapy. However, blind raters did not indicate any difference between the two groups. Parents and teachers not only knew which treatment was given to the children but they were also active participants in the training aspects of the treatment. The lack of inter-rater reliability brings into question the outcome of the study, particularly since no other measures indicated a difference in outcome between medication and behavior control. Because of the influence of knowledge on outcome, it has been suggested that all parents have an initial double-blinded trial to determine the effectiveness of treatment.(24) It is very troubling that the MTA study did not blind the responders.

Most importantly, the study did not demonstrate an improvement in academic performance as measured by standardized tests. Although this lack of significance (p-value greater than 0.05), as well as the lack of significance concerning the blinded raters is listed in the table of findings, it is not specifically discussed in the results paper.(25,26)

However, the results of the MTA study are considered by many to be definitive. For example, Jensen states: "Does this mean that the MTA results or conclusions are changing, or will change with additional analyses? Certainly not."(27) The improvement based on rating scales is considered a sufficient measure of success independent of any other measure.(28) Yet none of the subsequent papers mention this absence of statistical difference in the standardized measures of academic performance, nor in the ratings provided by the blind raters. This absence of a statistically significant difference is not given adequate importance by the researchers.

Conclusion

Since 1966, clinical investigations into the use of Ritalin and other drugs to treat ADHD have varied only a little. With the exception of the MTA study, the clinical trials have been with small samples and short-term follow-up. The major outcome examined is that of parent and teacher ratings with no blinded verification. The lack of long term followup means that there is no real indication that the use of medication improves academic achievement. Although several explanations for this lack of progress in the clinical study of ADHD have been outlined above, it still remains difficult to understand the snail's pace of progress and the continued use of medication treatment in the face of so little evidence that the treatment is effective, except for the flawed use of non-blinded rating scales. If no studies demonstrates improved academic achievement, support organizations should not use this outcome as a primary reason to medicate children.

It would be beneficial to see more long-term studies of ADHD treatments, including comparative studies using non-medication treatments. It also would be beneficial to conduct studies to examine possible socioeconomic differences in the children diagnosed with ADHD, and possible corresponding differences in treatments. There seem to be a disparity in diagnosis of ADHD based on race, and it should be useful to have studies to analyze the reasons for this disparity.

Dr. Barnes is a Professor of Mathematics at the University of Louisville in Louisville, Kentucky. His e-mail is: george.barnes@louisville.edu. Dr. Cerrito is a Professor of Mathematics and Biostatistician at the University of Louisville and is also affiliated with the Jewish Hospital Center for Advanced Medicine in Louisville, Kentucky. Her e-mail is: cerrito@louisville.edu. Dr. Levi is an Associate Dean and a Professor of Mathematics at the University of Louisville also. Her e-mail is: inessa.levi@louisville.edu.