[Larry Light is Director of Research at Batten, Barton, Durstine & Osborn, Inc.]

A variety of copy testing systems are available for pretesting commercials. Most of these systems can be described as falling into one of two major categories: systems primarily designed to measure communication and systems primarily designed to measure persuasion. Over the years, there has been considerable controversy regarding which methodology offers the best measure of commercial effectiveness--i.e., which can validly predict the sales potential of a commercial. An extensive investigation of traditional testing techniques led to the conclusion that, relative to other measures, communication testing is the most reliable, vs id, and practical measure of a commercial's effectiveness. However, communication testing is useful only when it is used as the final step in a disciplined copy development system.

INTRODUCTION: COMMUNICATION VS. PERSUASION

The purpose of advertising is to increase sales. And the purpose of the advertising test--specifically, the commercial test--is to produce a yes or no decision on the ability of a commercial to increase sales before that commercial is released for broadscale, on-air use. The question then becomes: What is the best way to test a commercial's effectiveness? What kind of a test will determine whether a commercial has the potential to stimulate purchases?

The theoretical answer is relatively easy to state. First, we need a technique which is reliable--i.e., one which is reproducible so that we can be confident that the results observed are due to the performances of the commercials being tested rather than random fluctuations in the system. We also need a technique which is valid. A valid technique will provide predictions that both correlate with real world observations and relate to future purchase behavior.

How do we assess whether a system is reliable and valid? Reliability (or lack of it) can be demonstrated without great difficulty once sufficient test-retest data have been collected. Validity, however, poses a problem because of the time, cost and complexity of conducting the type of controlled experiment necessary to verify a technique. How, then, are we to know whether or not a system is valid--whether it will accurately separate the good commercials from the bad?

A variety of methods are currently available for testing commercials. Some use the criterion of communication (brand name and/or copy recall), some use persuasion (increased brand interest). While still others use a combination of both, there is usually heavier reliance on one measure or the other, depending upon the philosophy under which the system or the user of the system is operating. All systems claim to be reliable and all systems claim to be valid. Yet, among members of the advertising community there is certainly no agreement that all systems are reliable and valid...and therefore equally acceptable; nor is there agreement as to which one system is most reliable and valid.

Therefore, after many years of experimentation and discussion, the debate about commercial testing continues--specifically, which is a more accurate measure of commercial effectiveness: communication or persuasion?

The following pages represent the results of BBDO's comprehensive search to answer the "communication vs. persuasion" question. Our own experimentation with various techniques, combined with the experiences of our Clients, provided us with a considerable bank of primary data (over 2,300 separate commercial tests among more than 460,000 respondents). The purpose of this document is to share some of the findings that resulted from our review of the major types of commercial testing systems and to present some of the conclusions based on these findings.

HOW ACCURATE ARE COMMERCIAL TESTING SYSTEMS?

Reliability

How accurate are commercial testing systems? By accurate, we mean reliable &nd valid. Taking reliability first, the question, put simply, is: How sure can we be in a commercial's test score? Will a commercial initially judged as good (or poor), when retested on the same system, again be evaluated as good (or poor)? If not--if a scoring system involves unpredictable fluctuations that result from factors other than differences in the commercials being tested--then the results are highly suspect.

We approached the issue of reliability by first examining our experience with our own testing system. In the past, much of BBDO's commercial testing was conducted via its Channel One facility. This was an on-air system. Test commercials were run in the context of an actual, regularly scheduled TV program which the Agency purchased in one or two cities. Within three hours after the program broadcast, random telephone calls were made and program viewers were questioned regarding awareness and recall of the test commercial(s). In the seven years BBDO used Channel One, 633 commercial tests were undertaken-including 106 retest cases where the same commercial was tested twice. This is one of the industry's largest sources of test-retest information and therefore allows us to look with confidence at the reliability data developed.

In order to assess the reliability of our own Channel One technique, all 633 commercials were spread out on a scale from low-scoring to high-scoring. The result was a normal (bell-shaped) curve with an average awareness score of 33.4%. Next, we examined the 106 commercials that had been tested twice and found that the difference between the first and second test scores averaged 6.4%. In other words, a commercial's second test score, on average, was likely to have varied up to 6.4 points, either above or below its first test score. By taking the test-retest variation of + 6.4 points, and superimposing it back on the distribution of all Channel One scores, we were able to mark out an area of uncertainty--the unreliability zone. Figure 1 shows the unreliability zone for the Channel One system--an area extending from 27.4 to 39.%, around the 33.46 mean.

What were the implications of this finding? The evaluation of commercials scoring either below or above the 27.0% to 39.86 unreliability zone would not be expected to change when retested. For example, if a commercial scoring 6.4 on its initial test were retested, and the second score dropped the average 6.4 points, that commercial would still fall on the positive side of the mean. The same reasoning holds for a poor commercial whose original score was lower than 27%. Even with an increase of 6.4 points in a retest, it would still remain on the negative side of the distribution. But a score within the unreliability zone, differing by as much as + 6.4 points, could be expected to swing from the "good" side of the mean to the "poor" side in any test-retest situation. Therefore, only those scores falling outside the unreliability zone could be used with confidence for a go/no-go decision. And those commercials-i.e., those falling outside the unreliability zone--represented 50% of all commercials tested. Only 50% of all commercial tests yielded a definitive decision, as is shown in Figure 2.

PERCENTAGE OF COMMERCIALS FALLING OUTSIDE THE CHANNEL ONE UNRELIABILITY ZONE

How do other testing techniques co_pare in terms of statistical reliability? Following the reliability analysis of Channel One data, BBDO conducted the same analysis where sufficient data were either supplied by our Clients or available from our own files. We examined two kinds of measures: (1) communication data and (2) persuasion data. The communication data could be further subdivided into conditions of exposure:

- Natural (on-air), and

- Forced-exposure (theater)

Similarly, persuasion data could be further subdivided; for while measures of persuasion most commonly utilize forced exposure viewing, one technique we reviewed involved attitude shift (pre to post) and the other used behavior data. The two methods of data collection for persuasion, then, are:

- Pre-post attitude shift (theater), and

- Behavior (coupon redemption)

Table 1 summarizes the results obtained after examining various testing techniques for reliability. Also shown in the table is the time interval, used by each system, between exposure of a commercial and measurement of effects.

What did we learn from the reliability analysis? Three very important observations resulted: (1) Commercial systems, in general, have a wide unreliability zone. (2) Some measuring techniques are statistically more reliable than others. (3) The critical difference separating the more reliable systems from the others analyzed is the delay factor between time of exposure to the commercial and time of measurement. Thus, 24-hour on-air recall and 72-hour forced exposure recall can be viewed as the most reliable testing systems of those we studied.

Validity

If we consider that the basic purpose of advertising is to produce sales, then the basic aim of the ideal commercial test is to give the decision-maker an accurate prediction of the selling effectiveness of his commercial. A valid measuring system must meet two criteria:

- Predictions should correlate with res world observations.

- What is measured should relate to future purchase behavior.

As a first step, we can get some idea of comparative validity among various systems by investigating the belief that each commercial testing system claims to represent the most vs id way of testing commercials. In other words, if s testing systems are valid, then regardless of which is used, a good commercial on one system should also be a good commercial on another and vice versa. Since sales results attributable to commercial exposure are limited, we began by looking at the question of validity in terms of the comparative ability of various testing systems to agree as to how good or bad the same commercial is.

To accomplish this, we collected the experience BBDO and its Clients have had in testing the same commercial on different commercial testing systems. Due to the great variation among the different systems, we classified each commercial's test result in terms of its score relative to the norm for the particular system used to evaluate it. The test data were divided into three levels of classification: commercials characterized as good, as average, and as poor.

Validity: On-air communication vs. theater communication. First we compared results from on-A;r tests vs. theater tests--using communication as the criterion measure. On-air testing has the advantage of being a real-world observation in that respondents view the commercial in a natural, at-home setting and are unaware at the time of exposure that questioning will occur. The price paid for this advantage is the cost of a finished film (unfinished film may be used in theater testing) plus the media expenses incurred. Our question was: Does the same evaluation occur when two different techniques are used to measure a commercial on a single criterion--communication? Our case histories consisted of 31 observations in which a given commercial was evaluated on two systems-on-air and theater. The result: in 24 out of 31 instances, the evaluation of a commercial on one system was inconsistent with its evaluation on another system. That is, 77% of the time, an advertiser would have come to a different decision about the communication effectiveness of a commercial, depending on the system chosen to aid in the evaluation.

We next analyzed the results of 11 commercials which had been measured for communication within one on-air system (24-hour recall) and one theater system (immediate recall). The correlation of on-air recall to theater recall was +.06. Not very encouraging!

Finally, comparing the results of on-air recall (24-hour) to theater recall when delayed (72-hour), for eight separate commercials, we found a correlation of +.59. Delay, again, seemed an important variable. Its use in the theater test had the effect of bringing the two systems (on-air and theater) closer together.

Validity: Communication vs. persuasion. Since 24-hour communication data from inconsistent with immediate recall data from theater tests, we next examined persuasion measures. E eater testing, after all, was developed primarily to provide a reading on the persuasiveness of a commercial. This reading, in most theater testing systems, is the pre-post shift-i.e., the relative ability of a commercial to generate increased interest in a given brand. Thus, proponents of the theater testing system generally acknowledge that memorability of commercial elements under forced exposure conditions may not correlate perfectly with memorability under natural viewing conditions. However, an effective commercial must do two things: it must communicate something about the brand (if only the brand name), and it must persuade. Consequently, communication levels are looked at diagnostically (certain minimum levels of communication are required) and persuasion tends to be regarded as the Primary criterion measure.

Realizing the different premise behind each system, however, does not help to answer our basic question. On-air communication techniques claim to be valid and theater persuasion techniques claim to be valid. Our next step, then, was to compare the results of both techniques. Was there any relationship between a high-scoring commercial on one system and a high-scoring commercial on another system? Would both systems agree on what is a good commercial and what is a poor one?

Three separate analyses suggest that the two measures, in fact, produce different conclusions regarding commercial effectiveness.

1. One Client tested each of eight commercials via on-air (to produce a 24-hour recall score) and via theater (to produce an immediate pre-post attitude shift score). The results from the two systems, measured against their respective norms, did show a correlation, but a negative one, of -.60.

2. Next we examined v more pairs of tests in which a given commercial was tested for both on-air recall and theater pre-post shift (a different system than the one discussed under Point 1). Here the correlation of on-A;r recall to pre-post shift was +.26.

3. These findings are very much in line with a study reported by Young (1972) which compared the relationship of 15 recall scores to 15 attitude shift scores. In this case, both sets of data were derived from on-air tests; still, the resulting correlation of +.05 indicated no relationship.

The conclusion: there is no positive correlation between recall and persuasion; the two seem to be measuring different effects.

Validity: Is an on-air persuasion measure feasible? Can a commercial's effectiveness be predicted through the use of an on-air persuasion measure? An on-air measure would meet the criterion of real-world observation and have the advantage of combining two kinds of data: brand preference and communication. Unfortunately, an on-air persuasion measure is truly "easier said than done."

BBDO conducted a study to determine whether a change in reported brand preference could predict the effect of one commercial on future purchase. Overall, approximately 3,500 respondents were questioned about 16 product categories. A test group and a control group were drawn from the same cities. The test group was screened for program and commercial viewership; the control group saw neither program nor commercials. Both groups were probed for (l)brand preference and (2) last purchase. After a three-week interval, respondents were called back and asked about brands actually purchased. A comparison was then made of the brands they originally said they would buy and those they subsequently reported having bought. Only consumers having purchased a brand of a product in the interim between interviews were included in the analytical sample.

The study found the most important factor in determining future purchase to be previous purchase. Eighty-three percent (83%) of future purchases were predicted by previous purchase. Fourteen percent (14%) of the purchases could not have been forecast by either past purchase or brand preference. And, importantly, only 3% of future purchases could have been forecast by brand preference alone. In other words, among the latter 1 % who switched brands, the preference measure correctly predicted less than one out of five purchases.

A BBDO Client tested over 100 commercials in an effort to validate on-air preference measures. Results showed that one exposure to an average commercial converted 4% of non-users. The preference measures (both lottery and constant sum were used) again detected less than one out of five of those people who were actually converted to trial of the test brand. What does this mean to the advertiser? If his brand has 20% usage, then one exposure of his commercial will convert 46 of the remaining 80% of households (4% X 80%)--or 3.2% of the households. Using an on-air persuasion test, he will validly predict the change in behavior of one out of five of the converted households (2% X 3.2%)--or 0.4 of the households. To do this reliably would require a sample size of 25,000.

Is an on-air persuasion measure feasible? After reviewing our own experience with traditional measurements, the answer was negative. Consumers bring to any viewing situation their own brand loyalties, brand images, biases, etc. Even if a commercial is remembered, one real world exposure produces very small delayed changes in attitude and/or purchase behavior. These changes, after only one exposure, are too small to measure economically.

Validity and sales. What have we learned so far? Traditional on-air persuasion measures are not feasible--at least not economically feasible. Results from communication tests vary depending on whether the methodology is on-air exposure or forced exposure. Communication and persuasion measures-regardless of exposure conditions--do not correlate. In short, different systems produce different conclusions. And, if two systems come to different conclusions, they cannot both be correctly predicting the same thing--in this case, potential sales effectiveness.

The sales test, under controlled conditions, must therefore be the final arbiter of the "communication vs. persuasion" question. Unfortunately, because of the difficulty, time, and expense of conducting such tests under carefully controlled conditions, there is no massive body of data available to provide a ready answer. However, some case histories do offer noteworthy insights.

One Client attempted to determine the validity of on-air testing. Four commercials were tested for 24-hour recall; two were classified as high-scoring commercials and two were average-scoring efforts. Each commercial was subsequently run, over time, in a controlled experiment and changes in sales were measured. Table 2 shows the results: the high-scoring commercials produced 22 to 3 times greater increases in sales than the average-scoring commercials.

The on-air tests of these commercials also included a persuasion measure, using a pre-post design. As indicated in Table 3, the persuasion measure proved less predictive of sales than the communication measure.

Another case history involves two food commercials--commercials Y and Z. Both were on-air tested for recall, and each was judged to be average. Subsequently, the two commercials were tested via three different theater systems and evaluated against the appropriate norm. On Theater System #1, though both commercials continued to perform equally, they were regarded as high-scoring efforts. Theater system #2 broke the tie; commercial Y, with a low score, was judged inferior to commercial Z which scored high. On E eater system #3, the results were reversed. Commercial Y, with a high score, was evaluated as superior to Z which scored at the average. Not knowing which was the "true" reading, the Client went to a sales test--using commercial Y in one market and commercial Z in another market. The result: no sales increase for the brand in commercial Y's test market and no sales increase for the brand in commercial Z's test market. As Table 4 shows, on-air recall testing had been predictive of sales in both cases.

As final evidence, a study by Bogart, Tolley and Orenstein (1970), examined the relationship between measures of sales response and measures of communication response. Although their experiment used newspaper advertising, the results are relevant to this discussion. Twenty-four (24) ads for packaged goods were measured for sales and recall within a 30-hour period. The conclusion, based on over 25,000 observations, was that respondents who could prove recall reported buying far more of the advertised brand than those who did not read anything. Further, purchase was directly correlated with degree of recall, as shown in Table 5.

PURCHASE OF THE ADVERTISED BRAND, AMONG THOSE WHO RECALL IT TO VARYING DEGREES

Taken together, the above data provide a strong argument favoring the validity of on-air communication testing. In all three cases, recall and sales were highly correlated.

What have we learned about copy testing systems? We first saw that delayed communication measures provided the most statistically reliable results. On-air testing is a delayed measurement system. In addition, it is conducted under natural (i.e., real) viewing conditions and it appears to have sales validity.

On-air communication, therefore, seems to offer the advertiser the best predictive tool when it is used as the final step in evaluating commercials before broadscale on-air use.

CONCLUSION: PERSUASION AND COMMUNICATION

Are we then concluding that communication is a better tool for sales prediction than persuasion? Yes and no! First, we would suggest that the traditional question of persuasion vs. communication is inappropriate. Experience indicates that the issue requires two questions: (1) persuasion, when? and (2) communication, when? A persuasive idea which is not executed in a memorable and understandable way is as wasteful of advertising dollars as a memorable and understandable execution that says nothing motivating. We need both: persuasion and communication. But, instead of looking for a single, multi-purpose copy testing technique, what is needed is a reliable and valid copy development system. We need not one measure, but a series of systematically-related measures.

A disciplined copy development system is based on a sound philosophy of how advertising works. BBDO's philosophy is that effective advertising memorably communicates to prime prospects that the advertised product or service solves a problem they have with the category. At BBDO, this philosophy has been translated into four copy development steps:

The key to this system is knowing that a strategy is motivating before developing advertising executions. Valid procedures exist for accomplishing this objective. Persuasion, i.e., the ability of an idea to generate purchase interest, should indeed be measured. It should be measured each and every time a new concept is considered. But, persuasion should be determined before, not after, copy is written. Persuasion, then, should be measured at the strategy development stage.

Now, the question remains: Has the strategy been accurately translated in the advertising execution? Is the message clear and easy to understand? Several screening techniques are available to tell us which of various executional approaches does the best job in delivering the intended message.

Finally, on-air communication testing is the last step in a total copy development system. Why do we need this last step, if we already know the strategy is motivating and the execution communicates that strategy? We need on-air testing because it provides information on how a commercial performs in the environment in which it is to be used. On-air testing tells us whether a commercial memorably communicates the intended message in the name of the brand.

The validity of any particular technique must be evaluated in the context of the development system within which it will be used. The system that is adopted is a direct function of a company's philosophy regarding how advertising works. While there is considerable variation in corporate philosophies, there is no disagreement regarding what advertising is supposed to accomplish. Effective advertising, regardless of what it says or how it affects attitudes, must ultimately produce increases in sales. How can we discuss validity without discussing sales effectiveness? How can we discuss validity and sales effectiveness without measuring real world effects? Certain companies are investing in this kind of research--comparing measurement results to real world results. Without this kind of information, BBDO's review would not have been possible.