Receive free daily summaries of new U.S. Court of Appeals for the Second Circuit opinions.

23 Fair Empl.prac.cas. 909,23 Empl. Prac. Dec. P 31,154the Guardians Association of the New York City Policedepartment, Inc., the Hispanic Society of the New York Citypolice Department, Inc., Nydia I. Diaz, James Michaelhidalgo, Wilfred Cebellero, Andre Lopez, Reinaldo Salgado,denise Santos, Deborah Holmes and Pamela Obey, Individuallyand on Behalf of All Those Similarly Situated, Plaintiffs-appellees, v. Civil Service Commission of the City of New York, Departmentof Personnel of the City of New York, and the Newyork City Police Department, Defendants-appellants, 630 F.2d 79 (2d Cir. 1980)

H. Elliot Wales, New York City, submitted a brief for Seven Civil Service Organizations as amicus curiae.

Before MANSFIELD and NEWMAN, Circuit Judges, and SIFTON,* District Judge.

NEWMAN, Circuit Judge:

This employment discrimination suit pursuant to Title VII of the Civil Rights Act of 1964, 42 U.S.C. § 2000e-2, once again requires this Court to venture into the complex realm of testing and test validation. The test at issue was designed by New York City officials and administrated on June 30, 1979 to 36,797 applicants for positions on the City's police force. Plaintiffs are the Guardians Association of the New York City Police Department, Inc., an organization of Black police officers, the Hispanic Society of the New York City Police Department, Inc., an organization of Hispanic police officers, and eight individual Black or Hispanic applicants. Defendants are the New York City Department of Personnel, which performed much of the test preparation, the New York City Civil Service Commission, and the New York City Police Department. The United States District Court for the Southern District of New York (Robert L. Carter, Judge) found that use of the test unjustifiably discriminates against Blacks and Hispanics in violation of Title VII. Guardians Ass'n v. Civil Service Commission, 484 F. Supp. 785 (S.D.N.Y. 1980). The Court ordered a broad remedy, including a 50% minority hiring quota. We affirm the District Court's finding that the City's specific use of the test violates Title VII, but vacate the remedy and remand for entry of a revised decree.

The test in question, designated Exam No. 8155, was designed to select candidates for hiring as entry-level police officers. Those who pass the exam are selected, in rank order of their test scores, to complete the other aspects of the hiring process-a medical examination, a physical agility test, a psychological test, and a character investigation. These last four components of the hiring process are scored only on a pass/fail basis. Thus, an appellant's score on Exam No. 8155 is a major determinant of his prospects for becoming a police officer. It is also the only feature of the process alleged to have a discriminatory impact. Once an applicant scores high enough to be selected for the final four hiring steps and successfully completes those steps, he or she becomes a sworn police officer and enters the police academy for five months of training. While successful completion of the training program is a requirement of continuing as a police officer, the Department does not use the training program as a selection device, but anticipates that nearly all academy entrants will go on to active duty.

The exam was developed by a fairly elaborate two-stage process; the first stage was an analysis of the police officer's job, and the second was construction of the test itself. The job analysis consisted of five separate steps. First, the Department of Personnel identified 71 tasks that police officers generally perform, based on interviews with 49 police officers and 49 supervisors. Second, a panel of seven officers and supervisors reviewed the list to add any tasks that had been omitted, and to eliminate those items that were duplicative, or too specialized to be performed by entry-level officers. The result was a consolidated list of 42 entry-level tasks.

Third, a questionnaire was distributed to 5,600 police officers, requesting them to rate each of the 42 tasks on the basis of its frequency of occurrence, its importance, and the amount of time normally spent in performing it. The 2,600 responses that were received were then analyzed by computer to yield a ranking of the 42 tasks, according to the combined rating of all the responses. In addition, faculty members of John Jay College were asked to observe police officers during an entire tour of duty and record the tasks that they performed; their survey generally confirmed the identification of the 42 tasks.

In the fourth step of the job analysis, the Department of Personnel divided the list of 42 ranked tasks into clusters of related activities. Five such clusters were established: the arrest process, providing assistance to people, police operations, stationhouse activities, and handling unusual and other occurrences. The fifth step was an analysis of all five clusters, each one by a separate panel of police officers, to identify the "knowledge, skills and abilities" required to perform these tasks at the entry level, and to assign percentages reflecting the relative importance of each of the identified knowledges, skills, and abilities for the cluster as a whole. One panel listed five such qualities for its cluster, all of which are properly characterized as "abilities" or "skills" (hereafter referred to as "abilities"): recalling facts, filling out forms, understanding and applying statutory definitions of crimes, understanding written instructions and applying appropriate procedures, and human relations skills, including communication techniques. Each of the other four panels used the first panel's list of abilities, but developed its own percentages to express the relative importance of each ability to the tasks within its cluster.

The second major stage in developing Exam No. 8155, the process of test construction, consisted of four identifiable steps. First, the percentages of the five abilities necessary to perform each of the five task clusters were multiplied by the weightings that had been given to each task in Step 3 of the job analysis on the basis of frequency, importance, and time spent. This yielded a general measurement for the importance of each of the five abilities for performance of the job of police officer. As a result of this computation, the Department of Personnel concluded that on a test with 100 questions, 15 questions should test for the ability to recall facts, 9 questions for filling out forms, 14 questions for understanding and applying sections of the criminal law, 32 questions for understanding written instructions and applying appropriate procedures, and 30 questions for human relations skills. Next, a group of eleven police officers was selected to write multiple-choice questions that tested for the five abilities, as they related to the 42 identified tasks. The officers wrote many of these questions from Police Academy materials and similar sources, however, without having access to descriptions of the five identified abilities, or the 42 ranked tasks. In the third step, Department of Personnel staff members who did have access to the description of abilities and the ranking of tasks reviewed the questions written by the police officers to assure that the questions were not ambiguous, overly complex, overly specialized, or dependent on prior knowledge. As a result of this review, some questions were discarded, others were revised, and still others were added. Finally, the resulting questions were subjected to a further review by a panel of six police experts, and by various members of the Department of Personnel.

The test that resulted consisted of 100 multiple-choice questions, designed so that the candidate could answer correctly without knowledge of any information beyond what was provided on the test itself. The test materials were determined by the Department of Personnel to require an eighth-grade reading level, on the average, although the 14 questions on law required college-level reading ability. The estimated time for completing the exam was 1 1/2 hours, but 3 1/2 hours were allowed.

The first part of the exam, designed to measure the ability to recall facts, consisted of a page-and-a-half description of a burglary, and a series of 15 questions to be answered without referring back to the description. In the second part, testing ability to fill out forms, the candidates were given a simplified arrest form, and a page-long description of both a robbery and an arrested suspect, and then asked 9 questions about the proper entries to be made in filling out the form. Part three, intended to test ability to apply provisions of law, consisted of 14 questions, each briefly presenting the facts of an incident, and then requiring the candidate to identify the precise criminal offense involved on the basis of definitions provided in the test materials.1 The remaining 62 questions, of which 32 were intended to measure the ability to follow appropriate procedures and 30 were intended to measure human relations skills, consisted of general instructions as to procedures or appropriate responses for certain types of situations, a description of a specific situation, and then one or more questions asking the proper response to the situation presented. Three of the questions dealing with appropriate procedures, for example, involved the proper response to a bomb threat. Four of the questions in the human relations section involved the proper way to deal with a person who appears to be mentally ill.

The test was scored from zero to one hundred, with one point given for each correct answer, and bonus points given for veterans.2 The candidates were then rank-ordered on the basis of their scores. Scores were generally high, with 13% of the applicants scoring 98 or above, and fully 50% scoring 91 or above. Because of the number of candidates taking the test and the bunching of candidates in the upper range of scores, each point a candidate achieved made a substantial difference in his position on the rank-ordering list. More than 2,000 applicants achieved a score at each numerical grade from 92 to 97.

The passing grade was determined in the following manner. The Police Department first estimated that 4,000 police officers would be hired during the four-year period for which the eligibility list resulting from Exam No. 8155 would be valid. The Department further estimated that only one out of three applicants who passed Exam No. 8155 would successfully complete all the remaining steps in the hiring process. Therefore, if this eligibility list was to meet the Department's needs, 12,000 persons had to pass the exam to provide the 4,000 needed police officers. With all this in mind, the Department simply set the passing grade at the score achieved by the 12,000th highest scoring candidate, which turned out to be 94. Because of the bunching phenomenon, a large number of candidates, 2,124, achieved this same score, so that the actual number who received a passing grade was 13,749.

Of the 36,797 applicants who took the test, 6,142 identified themselves as Black, 5,239 identified themselves as Hispanic, 19,798 identified themselves as White, and 4,847, or 13.2% did not identify their race. Thus, identified Blacks constituted 16.7% of the total applicants, and 19.7% of all those who identified their race, while the equivalent percentages for identified Hispanics were 14.2% and 16.8%, and for identified Whites, 53.8% and 64.5%.

Of those who passed the exam, i. e., scored 94 or better, 7.6% had identified themselves as Black and 7.8% had identified themselves as Hispanic, for a known minority population in the passing group of 15.4%, against a known minority population in the applicant pool of 30.9%. In contrast, 66.6% of the passing applicants had identified themselves as White, although Whites comprised only 53.8% of the applicant pool.

Viewed in another and more revealing way, the figures show that, among those who had identified themselves by race, the passing rate for Whites was 45.9% compared to 17% for Blacks and 20.5% for Hispanics. The combined minority pass rate was thus about two-fifths of the pass rate for Whites.

The Police Department accepted 415 candidates from the list in November, 1979 and planned to hire another 380 in January, 1980. Of this group of 795 candidates, 89.2% were White, 3.5% were Black, and 6.8% were Hispanic. The selection rates (number chosen compared to number of applicants) for these first two uses of the list were 0.5% for Blacks, 1% for Hispanics, and 3.6% for Whites.

II. The District Court's Prior Proceedings and Decision

The plaintiffs filed their complaint in this suit, together with a motion for a preliminary injunction, in October, 1979, before any candidates had been accepted on the basis of the list. They charged that the intended use of the lists by the Police Department constituted discrimination against Blacks and Hispanics in violation of the Fourteenth Amendment, Title VII, and various other Federal and state laws. The District Court, by consent of the parties, consolidated the hearing on the preliminary injunction and the trial on the merits. This proceeding was held on November 13, 14, and 15, 1979, shortly after the Police Department's first use of the list. On January 11, 1980, three days before the Department intended to use the list to accept a second group of trainees, the District Court held a second hearing. That same day, the Court issued an opinion, which was subsequently re-issued in revised form on January 23.3

The Court's basic conclusion was that Exam No. 8155 violated Title VII. In reaching this conclusion, the Court used the common mode of Title VII analysis, in which the plaintiff is first required to establish a prima facie case on the basis of disparate impact, and then the defendant is required to rebut the plaintiff's case by proving that the disparity results from legitimate, job-related selection procedures. The Court first found that the disparity between the percentage of minority group members who achieved a passing score and the percentage of minority group members in the applicant pool was sufficient to establish a prima facie case. It based this finding of disparate impact on the standards developed by the Supreme Court in Castaneda v. Partida, 430 U.S. 482, 97 S. Ct. 1272, 51 L. Ed. 2d 498 (1977), and by the Equal Employment Opportunity Commission (EEOC) in its Uniform Guidelines on Employee Selection Procedures, 29 C.F.R. § 1607 (1979). Castaneda stated that, in cases involving large samples, "if the difference between the expected value (from a random selection) and the observed number is greater than two or three standard deviations," a prima facie case is established. 430 U.S. at 496 n.17, 97 S. Ct. at 1281 n.17.4 The Uniform Guidelines provide that "(a) selection rate for any race, sex, or ethnic group which is less than four-fifths ( 4/5) (or eighty percent) of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact." § 1607.4(D) (hereinafter Guideline sections are cited only by the subdivisions of 29 C.F.R. § 1607). The District Court then noted that the discrepancy between the percentage of minority group members in the applicant pool and the percentage of minority group members who passed the test was 39 standard deviations. The evidence also showed that the passing rate of the minority group members was 44.3% of the passing rate of Whites, or about two-fifths.

Having concluded that the plaintiffs had established a prima facie case, the Court next concluded that the test was not sufficiently valid to constitute a legitimate attempt to choose those applicants who would become better police officers. Relying primarily on the EEOC Guidelines, the Court stated that a determination of validity based on the content of the test would be inappropriate, first because the test purported to measure abilities that the accepted applicants would be trained to acquire, see Guidelines, § 14(C) (1), and second, because the test actually measured constructs, not abilities, see id.5 Moreover, the Court concluded that the job analysis was not sufficiently precise to satisfy the Guideline requirement even for content validation, see Guidelines § 14(C) (2). Since content validation was the only method of validation which the City attempted, the Court concluded that the test was invalid, and thus an inadequate rebuttal to the plaintiffs' prima facie case.

In fashioning relief, the District Court noted that a previous examination administered by the New York City Police Department had been found to be in violation of Title VII. Guardians Association v. Civil Service Commission, 431 F. Supp. 526 (S.D.N.Y.), vacated and remanded on other grounds, 562 F.2d 38 (2d Cir. 1977). Concluding that the defendants had "persisted in devising and utilizing testing procedures that continue to discriminate against blacks and hispanics," the District Court found that the defendants' "studied adherence to discriminatory procedures must at this point be deemed conscious and deliberate." 484 F. Supp. at 798. On this basis, the Court held that "affirmative action is mandated as an interim measure either until such discrimination has been totally eliminated or until defendants proceed to select police officers under procedures that are in full compliance with Title VII." Id. at 799. It enjoined the City from using Exam No. 8155, although permitting it to use the eligibility list from that exam for purposes designated by the Court. In its order, also issued on January 11, the Court ordered the Police Department to achieve at least 30% minority composition of the force, a level comparable to the percentage of minorities in the labor force of the relevant hiring area. To achieve this goal, the Court further ordered that the defendants should "as an interim goal appoint 50% of their entry level police officers from among qualified black and hispanic applicants." Finally, the Court awarded the plaintiffs attorneys' fees and costs, and retained jurisdiction "for such further relief or other orders as may be necessary or appropriate to enforce and insure rights to equal employment opportunity within the New York City Police Department."

The City moved to stay the District Court's order pending consideration of its appeal from the decision. This Court denied the motion. However, we granted a conditional stay, in view of the City's declared need to hire new police officers6 and set an expedited schedule for the appeal.

As the District Court concluded, the accepted procedure for Title VII cases is to require the plaintiffs to establish a prima facie case, and then to require the defendants to rebut this showing with proof that the test was legitimately job-related. See Albemarle Paper Co. v. Moody, 422 U.S. 405, 95 S. Ct. 2362, 45 L. Ed. 2d 280 (1975); McDonnell Douglas Corp. v. Green, 411 U.S. 792, 93 S. Ct. 1817, 36 L. Ed. 2d 668 (1973); Griggs v. Duke Power Co., 401 U.S. 424, 91 S. Ct. 849, 28 L. Ed. 2d 158 (1971). The Court correctly concluded that a prima facie case had been established. By any reasonable measure, including the standard deviation rule of Castaneda, supra, or the four-fifths rule of the EEOC Guidelines, Exam No. 8155 had a disparate racial impact.

The City also claims that finding a prima facie Title VII violation by state or local governments without a showing of discriminatory intent violates the Tenth Amendment. This view has been definitively rejected by the Seventh Circuit in United States v. City of Chicago, 573 F.2d 416, 422-24 (7th Cir. 1978), and we agree with that analysis. Congress may enforce the Fourteenth Amendment by legislation that prohibits practices the Amendment might not of its own force condemn. See Katzenbach v. Morgan, 384 U.S. 641, 86 S. Ct. 1717, 16 L. Ed. 2d 828 (1966).

The real issue in this case, therefore, is whether the defendants have rebutted the plaintiffs' prima facie case by proving that its test was job-related: that the test accurately selected applicants who would be better police officers. Adjudication of this issue presents a more complex problem in the present case than it has in many previous Title VII suits. Many of the previous suits involved tests that were so artlessly constructed that they could be judged invalid without extensive inquiry, fine distinctions, or a precise notion of where the line between validity and invalidity was located. See, e. g., Griggs, supra, 401 U.S. at 431, 91 S. Ct. at 853 (intelligence tests used "on the Company's judgment that they generally would improve the overall quality of the work force"); United States v. N. L. Industries, Inc., 479 F.2d 354, 371 (8th Cir. 1973) (test given to one applicant "consisted of four or five mathematical problems which a Company employee jotted down on a sheet of yellow paper"); Brito v. Zia Co., 478 F.2d 1200, 1205-06 (10th Cir. 1973) (test based almost entirely on subjective judgments of supervisors, not administered or scored under controlled and standardized conditions); Vulcan Society, supra, 490 F.2d at 396-98 (no job analysis, test measured abilities that were clearly of secondary importance to job).

Exam No. 8155, in contrast, is a "second generation" selection procedure. Despite the various flaws in construction of the test, it is clear that some attempt was made to develop the test with recognition of at least some of the standards that courts had established in the first wave of Title VII cases. Aware that the validity of the test would likely have to be demonstrated, the City performed an extensive job analysis, consciously used Guideline concepts in determining the qualities that were being tested for, and attempted to eliminate extraneous variables, such as the applicant's prior knowledge, his reading level, and his ability to complete the test in a relatively short amount of time.

Nevertheless the plaintiffs have alleged and the District Court has concluded that the construction and use of Exam No. 8155 failed in several respects to meet test validity standards, particularly those specified in the Guidelines. Whether or not these deficiencies are fatal, they are plainly more substantial than the defects deemed not to defeat validity in prior cases. Detroit Police Officers Association v. Young, 446 F. Supp. 979, 990-91, 1007-08 (E.D. Mich. 1978); Bridgeport Guardians v. Bridgeport Police Department, 431 F. Supp. 931 (D. Conn. 1977); cf. Washington v. Davis, 426 U.S. 229, 248-52, 96 S. Ct. 2040, 2051-53, 48 L. Ed. 2d 597 (1976) (Fourteenth Amendment case involving some Title VII concepts). Consequently, assessment of Exam No. 8155 necessarily carries this Court into difficult areas of judging test validity. We must determine, with some care, what the general standards are for judging validity, and how these standards are to be applied in a specific factual situation.

The study of employment testing, although it has necessarily been adopted by the law as a result of Title VII and related statutes, is not primarily a legal subject. It is part of the general field of educational and industrial psychology, and possesses its own methodology, its own body of research, its own experts, and its own terminology. The translation of a technical study such as this into a set of legal principles requires a clear awareness of the limits of both testing and law. It would be entirely inappropriate for the law to ignore what has been learned about employment testing in assessing the validity of these tests. At the same time, the science of testing is not as precise as physics or chemistry, nor its conclusions as provable. While courts should draw upon the findings of experts in the field of testing, they should not hesitate to subject these findings to both the scrutiny of reason and the guidance of Congressional intent.

The need to modify rigid technical conclusions from the field of testing is indicated by the view of certain testing experts, including those who testified for the plaintiffs in this case, that there is no test that can be considered completely valid to select candidates for any but the most rudimentary tasks. If this view guided interpretation of Title VII, then at the current stage of the technology of testing, no test that produces a disparate racial impact could be used for positions such as police officers.

The danger of too rigid an application of technical testing principles is that tests for all but the most mundane tasks would lack sufficient validity to permit their use. At least that is the risk given the current state of the art of employment testing. This risk can be appreciated by considering the one example even most test critics acknowledge to have substantial, though not complete, validity. This is a typing test given to a group of applicants for jobs as typists.9 Such a test substantially meets all the criteria suggested by plaintiffs' experts for content validation, but the very success of this test casts doubt on the usefulness of the example. To begin with, typing is a task that readily yields to quantitative measurement. The quality of a typist's job performance depends on two factors, both of which can be captured with precision in numbers: how fast he types, and how many errors he commits. Most jobs involve tasks whose performance can be evaluated only in the more subjective light of judgment. Surely this is true of nearly all the tasks required to be performed by police officers. In addition, there is a more basic problem with the typing test example. Typing is one of the few activities that a test-taker can perform in virtually the same manner as he will be required to perform on the job. That is obviously an ideal testing situation, but it is not one that is frequently available, and such "on-the-job" testing could not possibly be done to select police officers. Yet the force of the typing test example easily leads to one of the conclusions of the District Court in this case: that Exam No. 8155 lacked validity because it measured performance in an artificial classroom setting and did not necessarily indicate who would perform well on the job.10

Closely related to the question of the proper weight to be given to technical conclusions of testing theory is the question of the proper weight to be given to the EEOC Uniform Guidelines, which are largely based on these technical conclusions. See Guidelines § 5(C). The District Court drew its methodology from the Guidelines, concluding that the City's test was invalid because it failed to satisfy all of the Guidelines. The Supreme Court has relied upon some of the Guidelines in several of the leading cases, see Albemarle, supra, 422 U.S. at 431, 95 S. Ct. at 2378; Espinoza v. Farah Manufacturing Co., 414 U.S. 86, 94, 94 S. Ct. 334, 339, 38 L. Ed. 2d 287 (1973); Griggs, supra, 401 U.S. at 433-34, 91 S. Ct. at 854-55, but the Court has not ruled that every deviation from any of the Guidelines automatically results in a violation of Title VII. The Court appears to have applied the Guidelines only to the extent that they are useful, in the particular setting of the case under consideration, for advancing the basic purposes of Title VII. See Espinoza, supra, 414 U.S. at 94, 94 S. Ct. at 339; Guardians Association v. Civil Service Commission, 490 F.2d 400, 403 n.1 (2d Cir. 1973), United States v. Georgia Power Co., 474 F.2d 906, 913 (5th Cir. 1973). To the extent that the Guidelines reflect expert, but non-judicial opinion, they must be applied by courts with the same combination of deference and wariness that characterizes the proper use of expert opinion in general. See Albemarle, supra, 422 U.S. at 449, 95 S. Ct. at 2390 (Blackmun, J., concurring) (Guidelines "have never been subjected to the test of adversary comment. Nor are the theories on which the Guidelines are based beyond dispute.") Thus, the Guidelines should always be considered, but they should not be regarded as conclusive unless reason and statutory interpretation support their conclusions. As this Court has previously stated: "If the EEOC's interpretations go beyond congressional intent, the Guidelines must give way." Guardians Association, supra, 490 F.2d at 403 n.1.

In addition to their force as the expression of expert opinion, the Guidelines also possess legal force. But here too, it is necessary to keep their limits in mind. The primary purpose of the Guidelines is to indicate the standards that various Federal agencies, such as the EEOC, the Civil Service Commission, and the Department of Justice are to use in enforcing Title VII and related statutes. See Guidelines § 2(A). But the fact that an agency or group of agencies has announced the standards they will use does not convert those standards into mandatory legal rules.

A second legal basis for following the Guidelines is that they represent the "administrative interpretation of the Act by the enforcing agency," and are "entitled to great deference" on that basis. Griggs, supra, 401 U.S. at 433-34, 91 S. Ct. at 854-55; see Albemarle, supra, 422 U.S. at 431, 94 S. Ct. at 2378. However, the Court has also recognized that the Guidelines "are not administrative 'regulations' promulgated pursuant to formal procedures established by Congress." Ibid. They are entitled to deference, not obedience. See Espinoza, supra, 414 U.S. at 94, 94 S. Ct. at 339 (1973) (Guideline rule on discrimination against non-citizen "is no doubt entitled to great deference . . ., but that deference must have limits where, as here, application of the guideline would be inconsistent with an obvious congressional intent"). Moreover, the Court in Griggs was following the Guidelines only to make the straightforward distinction between general intelligence tests and job-related tests; it is not at all clear that Griggs requires observance of all the intricate details of the Guidelines. It might be desirable for all employers to follow the more careful practices required of the Federal Government, but there is no reason to think that Congress intended to impose such practices, in their full rigor, when it enacted Title VII.

With these considerations in mind, we turn to the validity of Exam No. 8155.

The threshold task in determining the validity of a challenged examination is to select the appropriate method for assessing its job-relatedness. The Guidelines describe three techniques: content validation, construct validation, and criterion-related validation. Guidelines §§ 5(B), 14. The Guidelines specify when each technique is appropriate and also specify the requirements for successfully validating an exam by use of each technique. Defendants have attempted to justify Exam No. 8155 by content validation, a technique appropriate for tests that measure "knowledges, skills or abilities" representative of the "content" of the job. Guidelines § 14(C) (1). Plaintiffs contend that construct validation must be used to assess this exam because, in their view, the exam attempts to measure "constructs," that is, inferences about mental processes or traits, such as "intelligence, aptitude, personality, commonsense, judgment, leadership and spatial ability." Ibid.

This content-construct distinction has a significance beyond just selecting the proper technique for validating the exam; it frequently determines who wins the lawsuit. Content validation is generally feasible while construct validation is frequently impossible. Even the Guidelines acknowledge that construct validation requires "an extensive and arduous effort." Guidelines § 14(D) (1). The principal difficulty with construct validation is that it requires a technique that includes a criterion-related study, Guidelines § 14(D) (4)-a demonstration from empirical data that the test successfully predicts job performance.11 Developing such data is difficult, and tests for which it is required have frequently been declared invalid.12 As a result, a conclusion that construct validation is required would often decide a case against a test-maker, once a disparate racial impact has been demonstrated.

To determine whether defendants are entitled to use content validation, we examine the Guidelines' criteria for that technique, but we do so bearing in mind our cautionary approach to the Guidelines, previously expressed. The Guidelines specify two basic conditions that must be met before content validation may be used. First, it must appear that what the test attempts to measure is knowledge or an ability, and not a general trait, such as intelligence. Guidelines § 14(C) (1). Second, the test must not measure knowledge or ability that an employee will be expected to learn on the job. Ibid.; see also § 5(F). The District Court rejected content validation, concluding both that Exam No. 8155 measures constructs, not abilities, and that, even if what was tested for could be considered abilities, they could be learned in the five-month training program.

In specifying how the selection of validation techniques is to be made, the Guidelines adopt too rigid an approach, one that is inconsistent with Title VII's endorsement of professionally developed tests. Taken literally, the Guidelines would mean that any test for a job that included a training period is almost inevitably doomed: if the attributes the test attempts to measure are too general, they are likely to be regarded as constructs, in which event validation is usually too difficult to be successful; if the attributes are fairly specific, they are likely to be appropriate for content validation, but this too will prove unsuccessful because the specific attributes will usually be learned in a training program or on the job.

The origin of this dilemma is not any inherent defect in testing, but rather the Guidelines' definition of "content." This definition makes too sharp a distinction between "content" and "construct," while at the same time blurring the distinction between the two components of "content" : knowledge and ability. The knowledge covered by the concept of "content" generally mean factual information. The abilities refer to a person's capacity to carry out a particular function, once the necessary information is supplied. Unless the ability requires virtually no thinking, the "ability" aspect of "content" is not closely related to the "knowledge" aspect of "content"; instead it bears a closer relationship to a "construct." Some researchers regard content tests as nothing more than assessments of particular kinds of constructs, e. g., Tenopyr, Content-Construct Confusion, 30 Personnel Psych. 47 (1977); others regard any ability that is evidenced by observable behavior as sufficiently non-inferential to be considered content, see Ebel, Comments on Some Problems of Employment Testing, 30 Personnel Psych. 55 (1977). See generally Catell, Validity and Reliability: A Proposed More Basic Set of Concepts, 55 J.Ed.Psych. 1 (1964). Whichever view is adopted, it would seem that abilities, at least those that require any thinking, and constructs are simply different segments along a continuum reflecting a person's capacity to perform various categories of tasks. This continuum starts with precise capacities and extends to increasingly abstract ones-from the capacity for filling out forms to the capacity for exercising judgment.

Recognition that abilities and constructs are not entirely distinct leads to a conclusion that a validation technique for purposes of determining Title VII compliance can best be selected by a functional approach that focuses on the nature of the job. The crucial question under Title VII is job relatedness-whether or not the abilities being tested for are those that can be determined by direct, verifiable observation to be required or desirable for the job. See Griggs, supra, 401 U.S. at 431, 91 S. Ct. at 853; Vulcan Society, supra, 490 F.2d at 394-95; Chance, supra, 458 F.2d at 1177. If the job in question involves primarily abilities that are somewhat abstract, content validation should not be rejected simply because these abilities could be categorized as constructs. However, if the test attempts to measure general qualities such as intelligence or commonsense, which are no more relevant to the job in question than to any other job, then insistence on the rigorous standards of construct validation is needed. Since tests of this kind are often biased in favor of a person's familiarity with the dominant culture, permitting them to be used without a showing of predictive validity would perpetuate the effects of prior discrimination. But as long as the abilities that the test attempts to measure are no more abstract than necessary, that is, as long as they are the most observable abilities of significance to the particular job in question, content validation should be available. To lessen the risks of perpetuating cultural disadvantages, the degree to which content validation must be demonstrated should increase as the abilities tested for become more abstract.

This functional approach, which adjusts the distinction between content and construct to the nature of the job being tested for, expands the opportunity for both employers and courts to rely on content validation. It also avoids making a threshold choice between content and construct validation based solely on the nature of the quality tested for unrelated to the job, a choice that might make content validation seem inappropriate. To base the content-construct determination on the nature of the job, it is necessary first to analyze the job to see if it requires abilities appropriate for content validation. Instead of choosing between content and construct validation at the outset, as the Guidelines seem to require, employers and courts can start the content validation inquiry and use its results to determine both whether content validation is appropriate and whether it has been achieved. Should the attempted content validation be found inadequate, the reason may be that this method of validation was not appropriate because of the pertinent job abilities revealed by the job analysis. On the other hand, this approach will sometimes indicate that content validation is appropriate, even though the abilities tested for could be considered constructs.13

Just as lessening the severity of the Guidelines' distinction between content and construct reduces the likelihood that a test is invalid because it measures constructs, so sharpening the distinction between knowledge and ability, now obscured by the Guidelines, reduces the problem that the test is invalid because it duplicates the training period, i. e., tests for what will later be learned. Unlike knowledge, some abilities are appropriate for testing confirmed by content validation despite their overlap with post-selection training. A valid measurement of some abilities can select applicants who will ultimately use their training to perform their tasks more effectively or who will more effectively perform similar tasks for which they have not been specifically trained. On the other hand, content validation remains inappropriate for tests that measure knowledge of factual information if that knowledge will be fully acquired in a training program. Approval of such tests, without predictive validation, risks favoring applicants with prior exposure to the information, a course likely to discriminate against a disadvantaged minority. For example, it would be duplicative of the Police Department's training program, and thus invalid, to test applicants for their knowledge of the Department's arrest form. Testing for their ability to fill out the form, however, can be expected to select applicants who can be successfully trained to perform well at that task and others like it.

Applying the approach just outlined, we conclude, at least as an initial matter, that content validation may properly be selected as the appropriate technique for assessing Exam No. 8155. The exam tests for three basic abilities (although it purports to test for five): the ability to remember details, the ability to fill out forms, and the ability to apply general principles to specific facts. This third ability is assessed in three contexts: the application of general statements of criminal offenses to the facts of specific events, the application of procedures and standards to the facts of specific policing activities, and the application of procedures and standards to the facts of specific situations involving human relations problems. These three basic abilities are not so abstract, on their face, as to preclude content validation, provided subsequent consideration of the job analysis does not demonstrate that important and more concrete abilities necessary for the job were needlessly omitted from those considered for measurement. Though all three abilities involve some inference about mental processes, they are based on observable behaviors and are far less abstract than such traits as intelligence, leadership, or judgment. Moreover, testing for these three abilities sufficiently avoids the objection that the test duplicates the Department's training program. Though all three abilities can be trained to some extent, the test-makers were entitled to select applicants with existing ability so that training would both enhance their abilities and prepare them for other tasks requiring similar talents. The vice of testing for knowledge readily taught in the training program was totally avoided.

B. Assessing the Content Validity of Exam No. 8155

Since content validation appears to be an appropriate method for assessing Exam No. 8155, we proceed to consider whether the use of this method indicates that the exam has sufficient validity to select applicants for the job of police officer. The Guidelines describe various aspects of content validation, but do not neatly list ingredients of an adequate exam. From our study of the Guidelines, we distill five attributes of an exam with sufficient content validity to be used notwithstanding its disparate racial impact. The first two concern the quality of the test's development: (1) the test-makers must have conducted a suitable job analysis, and (2) they must have used reasonable competence in constructing the test itself.14 The next three attributes are more in the nature of standards that the test, as produced and used, must be shown to have met. The basic requirement, really the essence of content validation, is (3) that the content of the test must be related to the content of the job. In addition, (4) the content of the test must be representative of the content of the job. Finally, the test must be used with (5) a scoring system that usefully selects from among the applicants those who can better perform the job. We consider each of these five matters in turn.

The Job Analysis

According to the Guidelines, a job analysis involves an assessment "of the important work behavior(s) required for successful performance and their relative importance." § 14(C) (2). The job analysis performed by the City, while somewhat flawed as the District Court pointed out, is nonetheless adequate to meet this standard. As far as the first part of the standard is concerned, the work behaviors involved in being a police officer were identified by extensive interviewing, and subjected to serious review (Job Analysis, Steps 1 and 2). The District Court found that these work behaviors "were not delineated with precision." 484 F. Supp. at 795. In fact, the descriptions of the 42 tasks that ultimately appeared on the job analysis list vary considerably in the level of precision. Some are complete and unambiguous, such as "1. Checks the condition of personal and department equipment such as radio, patrol car, weapons, etc."; "35. Attends training sessions." Others are more open-ended, but do manage to fulfill their function by defining the behaviors associated with the task, such as "3. Performs foot patrol"; "40. Controls various types of crowds." Still others are so vague that they communicate very little real information, such as "10. Interacts with juveniles in non-arrest situations"; "39. Performs duties in hostage situations."

While greater precision might have been achieved, a complete description of the observable tasks associated with being a police officer would be a reworded version of the entire training manual. The Police Department's list of tasks, despite some lapses in specificity, contains a sufficient amount of meaningful information to satisfy the relevant requirement.

The second part of the Guideline standard for a job analysis requires determination of the relative importance of the identified work behaviors. The City performed this function by means of an extensively distributed questionnaire, specifying the criteria to be used in ranking the 42 tasks (Job Analysis, Step 3). The process as a whole appears to be reasonably accurate, and neither the plaintiffs nor the District Court raised any serious objection to it.

Having determined the work behaviors and established their relative importance, the City then grouped the 42 tasks into five clusters and asked panels of police officers to identify the knowledges, skills, or abilities necessary to the effective performance of these tasks. (Job Analysis, Steps 4 and 5). This function was implemented in a much less satisfactory manner. Only one of the panels identified the abilities; the other four used the list of five abilities that the first panel had developed. This lessened the value of having five independent panels make this complicated and subjective determination. Moreover, no effort was made to explain the relationship between any of the five abilities and the 42 job tasks from which they were ostensibly derived.15

The plaintiffs criticize the required abilities identified by the City for being undefined. But the type of definition suggested by the Guidelines-one that describes the abilities in terms of "observable behaviors and outcomes," Guidelines § 15(C) (3)-seems repetitive, since the work behaviors are already defined in this way. The five identified abilities, with the possible exception of "human relations, including communication techniques," are comprehensible enough. Their appropriateness for measurement would have been considerably clearer, however, if each panel had explained which tasks required which abilities. While the Guidelines may be unnecessarily stringent in regarding the identification of this relationship of ability to task as "essential," see ibid., such identification does go far toward eliminating the ambiguities that are otherwise inherent in generalized descriptions of abilities. Only if the relationship of abilities to tasks is clearly set forth can there be confidence that the pertinent abilities have been selected for measurement.

The Test Construction Process

With a job analysis of questionable sufficiency, the City then proceeded to the test construction stage. As an initial matter, we note that Exam No. 8155 was developed "in-house," by staff members of New York City's Police Department and Department of Personnel; there was little input from any outside source, and no participation by anyone specializing in test preparation. Of course, the law should not be designed to subsidize specialists. But employment testing is a task of sufficient difficulty to suggest that an employer dispenses with expert assistance at his peril. Certainly, the decision to forgo such assistance should require a Court to give the resulting test careful scrutiny. See Kirkland, supra, 520 F.2d at 425-26; Vulcan Society, supra, 490 F.2d at 395-96.

While the determination of how many questions should be included for each identified ability was made by a fairly careful numerical analysis (Test Construction, Step 1), the process of writing the questions themselves was rather haphazard. The questions were initially framed by police officers, who may have had expertise in identifying tasks involved in their job but were amateurs in the art of test construction. In addition, the officers did not have access to the job analysis material during much of the process. Finally, the questions, although they were reviewed, were not tested on a sample population. To be sure, a complete determination of the questions' accuracy in measuring the identified abilities would be equivalent in its complexity to a criterion-related study. But the City did not even perform the minimal sample testing to ensure that the questions were comprehensible and unambiguous.

Not surprisingly, the test construction process did not fully succeed in meeting even its own goal of testing for all the identified abilities. As previously indicated, Exam No. 8155 does appear to test for the three identified abilities of remembering details, filling out forms, and applying general principles to specific facts. However, the fourth identified ability, human relations skill, proved more troublesome. In deciding how to test for this ability, the City faced a dilemma inherent in testing for all but the most mundane jobs. To be fully representative of the job, a test should measure all the significant abilities needed for successful job performance, yet some abilities, especially in jobs of any complexity, are far along the construct end of the content-construct continuum where successful validation is difficult. If a test tries to be representative and measure all significant abilities, including those that are clearly constructs, it risks the use of inadequate assessment devices, because the rigorous standard for construct validation will rarely be met. On the other hand, if the test-makers acknowledge the difficulty of satisfactorily measuring constructs and test only for those abilities that are appropriate for content validation, they encounter the objection that the test is not sufficiently representative of the job.

Recognizing the difficulty of construct validation, yet reluctant to omit assessment of an important characteristic of successful job performance, the City attempted to resolve the dilemma by treating human relations skill as an ability suitable for content validation and devoting 30 questions, nearly one-third of the exam, to an effort to assess this ability. Mindful of an important requirement of content validity, the City carefully avoided rewarding a test-taker's prior knowledge and, instead, supplied in the test itself all the information necessary to select the correct answers to the human relations questions. Included before each group of questions was a set of appropriate standards-essentially "do's" and "don'ts"-for handling a particular type of human relations matter. But supplying this guidance rendered the 30 questions primarily a further assessment of a candidate's ability to apply written standards to specific fact situations, and only slightly a measure of his talent for human relations. Anyone with minimal analytic ability needed to apply the standards to the various fact situations could select the one correct answer, even if his intuitive reaction to a human relations problem might be woefully inadequate.

Assessing human relations skill will always be a difficult enterprise, but the deficiency of the City's attempt does not mean that a content validation approach is necessarily impermissible nor impossible to achieve. As indicated above, at least within the middle range of the content-construct continuum, the distinction between content and construct should be determined functionally, in relation to the job. If the quality measured is not unduly abstract, and if it constitutes a significant aspect of the job, content validation of the test component used to measure that quality should be permitted. But that component must be designed in an extremely careful way. Test-makers will be well advised to obtain highly qualified assistance in constructing this portion of an exam.

One desirable approach would be to confront applicants with simulated real life situations and assess the appropriateness of their volunteered response. See Firefighters Institute for Racial Equality v. City of St. Louis, 616 F.2d 350 (8th Cir. 1980). That technique is normally too costly for large numbers of applicants, but might have usefulness as a testing device to be used toward the end of the overall selection procedure, after an initially large group of applicants has been narrowed down by the results of a written exam and a background check. If the test component is limited to traditional pencil and paper methods, it may be preferable to forgo any pretense of being able to make fine differentiation among candidates' human relations skills and instead adopt a pass/fail approach, rejecting those whose demonstrably inappropriate responses to human relations questions mark them as unsuitable for police work. Another possibility is to recognize that questions in this area for which only one answer is correct are likely to be too easy, as were most of the questions on Exam No. 8155, and therefore of little use in making selections from among applicants. Instead questions can be designed for which some answers are appropriate responses and others are inappropriate. As feasible techniques in this area evolve, employers will be expected to use them.

With these strengths and weaknesses of the job analysis and the test construction in mind, we now consider how well the test, as constructed and used, met the basic requirements of content validity.

The Direct Relationship Requirement

The central requirement of Title VII, relationship of test content to job content, was sufficiently satisfied by Exam No. 8155. The job analysis procedure provides adequate assurance that the identified tasks are in fact the tasks that a police officer performs. While the procedure for identifying the abilities required for those tasks was less satisfactory, the three abilities that were actually tested for appear adequately related to most of the identified tasks. The list of tasks confirms one's intuitive assumption that police officers are required to fill out forms (see, e. g., "16. Processes arrests using appropriate police department forms and notifications"), to remember facts (see, e. g., "18. Gives testimony in court (oral and written)"), and to apply general principles to specific fact situations (see, e. g., "26. Executes warrants.").

Moreover, these abilities are among the most concrete ones that can be derived from the list; they are certainly more concrete than human relations skills, which the test purported to measure, but did not. Two of the abilities tested for, filling out forms and remembering facts, are as specifically stated as they could be without resort to trivial distinctions about particular kinds of forms and facts. The ability to apply general standards is somewhat more problematical, since it is a relatively abstract skill that is relevant to many jobs. However, if there is any job for which ability in applying and following rules is an especially important requirement, it is the job of a law enforcement officer.16

The Representativeness Requirement

The second requirement established by the Guidelines is that the test must be a "representative sample of the content of the job." As presented by the Guidelines, this representativeness requirement has two different meanings. The first is that the content of the test must be representative of the content of the job; the second is that the procedure, or methodology, of the test must be similar to the procedures required by the job itself. The Guidelines express this dual requirement in the following somewhat inscrutable language: "For any selection procedure measuring a knowledge, skill, or ability the user should show that (a) the selection procedure measures and is a representative sample of that knowledge, skill, or ability. . . ." Guidelines § 14(C) (4) (emphasis added).

Both aspects of the representativeness requirement, if interpreted rigorously, would once again foreclose any possibility of constructing a valid test. The United States, as amicus, argues that the requirement that the content of the exam be representative means that all the knowledges, skills, or abilities required for the job be tested for, each in its proper proportion. This is not even theoretically possible, since some of the required capacities cannot be tested for in any valid manner. Even if they could be, the task of identifying every capacity and determining its appropriate proportion is a practical impossibility.

It is similarly impossible for the procedures of the test to be truly representative of the actual job procedures. Tests, by their nature, are a controlled, simplified version of the job activities, not the activities themselves. As a practical matter, virtually any realistic test, except one that directly measures a physical skill, like lifting 50-pound sacks, is likely to be a pencil and paper activity, quite different from the job it tests for. An elaborate effort to simulate the actual work setting would be beyond the resources of most employers, and perhaps beyond the capacities of even the most professional test-makers.

More reasonable interpretations of the representativeness requirement are appropriate in light of Title VII's basic purposes. The reason for a requirement that the content of the exam be representative is to prevent either the use of some minor aspect of the job as the basis for the selection procedure or the needless elimination of some significant part of the job's requirements from the selection process entirely; this adds a quantitative element to the qualitative requirement-that the content of the test be related to the content of the job. Thus, it is reasonable to insist that the test measure important aspects of the job, at least those for which appropriate measurement is feasible, but not that it measure all aspects, regardless of significance, in their exact proportions. The reason for a requirement that the test's procedure be representative is to prevent distorting effects that go beyond the inherent distortions present in any measuring instrument. For example, although all pencil and paper tests are dependent on reading, even if many aspects of the job are not, the reading level of the test should not be pointlessly high. Similarly, the instructions should not be overly complex, and the exam should not place candidates under excessive time pressure unless such time pressure is an identifiable aspect of the job.

Exam No. 8155 meets these representativeness requirements to an adequate degree. While it did not test for all the skills involved in being a police officer nor adequately test for the human relations skill that the job analysis identified as important, the ones it did measure-memory, the ability to fill out forms, and the ability to apply rules to factual situations-are all significant aspects of entry-level police work. To be sure, this conclusion would have been easier to reach if the City had spelled out the relationship between the abilities that were tested for and the job behaviors that had been identified. But the relationship is sufficiently apparent to indicate that the City was not seizing on minor aspects of the police officer's job as the basis for selection of candidates. The inadequate assessment of human relations skill lessens the representativeness of the exam and consequently lessens its degree of content validity, but this deficiency is not fatal, especially in light of the difficulty of assessing such an abstract ability. Though human relations skill was deemed so important as to warrant 30 of the exam's 100 questions, the City could just as plausibly have concluded that equally important for the job of policing are such other abstract qualities as common sense, leadership potential, sound judgment, or ability to resist provocation. When a police exam inadequately tests for any of these abstract abilities, it simply recognizes the limits of the art of testing. Indeed, the more a test concerns itself with relatively concrete abilities identified as necessary for successful job performance, the more likely it is to achieve a sound basis for assessment of applicants.

Similarly, the procedure that the test employed was not needlessly unrepresentative of the job itself. In electing to use a pencil and paper test, the City did not forgo any readily available and realistically feasible alternative procedure that would have been more representative of the job. Moreover, the risks of using a written test were substantially minimized. The reading level necessary to understand the questions was in some cases equal to, but generally well below, the training materials used in the Police Academy. The instructions were clear enough, and employed an ordinary four-answer, multiple-choice format, perhaps the most familiar standardized test technique. In addition, ample time was allowed for taking the exam, thereby avoiding an unnecessarily pressured situation.

Thus the exam is adequately related to the content of the police officer's job, and adequately representative. The combined effect of this assessment might support a conclusion that the exam as a whole has content validity, though it would be a close question whether a test with the disparate racial impact of this one can be validated when its development departs in some significant respects even from reasonably attainable requirements of the Guidelines. However, even if the construction of the exam passes muster, the way in which it was used to distinguish among candidates seriously departs from the third requirement for content validity and defeats any claim of validity for a testing process that produces disparate racial results.

The Scoring Requirement

Essentially, the City used the results of the exam to compile a rank-ordering of all the applicants, and then selected a passing score sufficient to generate the required number of potential trainees. Neither the rank-ordering nor the passing score conforms to even the most minimal standards for these two devices.

Rank-Ordering. The Guidelines provide that rank-ordering should be used only if it can be shown that "a higher score . . . is likely to result in better job performance." Guidelines § 14(C) (9). This requirement is reasonable and consistent with Title VII's provision that the "results" of a test may not be "used to discriminate." 42 U.S.C. § 2000e-2(h). If test scores do not vary directly with job performance, ranking the candidates on the basis of their scores will not select better employees. It is possible to read the Guidelines' standard for rank-ordering as if the required relationship between better scores and better job performance had to be demonstrated by a criterion-related study. However, the EEOC's interpretation of the Guidelines disclaims such a high standard. The relationship between higher scores and better job performance may permissibly rest on an inference, but where, as here, the test scores reveal a disparate racial impact, and that disparity is greater at high passing scores than at low passing scores, the appropriateness of inferring that higher scores closely correlate with better job performance must be closely scrutinized.

This close scrutiny is required because rank-ordering makes such a refined use of the test's basic power to distinguish between those who are qualified to perform the job and those who are not. If a test is content valid, it may be reasonable to infer that the test scores make some useful gross distinctions between candidates. Candidates with high scores may well be expected to perform the job better than candidates with low scores. See Science Research Associates, Validation: Procedures and Results (1972) (use of criterion "tails" identifying best and worst candidates more justifiable than continuous rating). And it may even be that within some range of scores, some incremental improvements in scores show some positive correlation with improvements in job performance. But neither of these propositions provides confidence for inferring that one-point increments among those who took Exam No. 8155 are a valid basis for making job-related hiring decisions, especially in the range of scores between 94 and 100. The reason such a precise inference cannot be so readily drawn is that content validity is not an all or nothing matter; it comes in degrees. A test may have enough validity for making gross distinctions between those qualified and unqualified for a job, yet may be totally inadequate to yield passing grades that show positive correlation with job performance.

Overlooking this point, the City earnestly contends that if the appropriate abilities were tested for, it makes eminent sense to select candidates strictly on the basis of ranked scores, even to the extent of concluding that a candidate scoring 98 will perform better as a police officer than a candidate scoring 97. The frequency with which such one-point differentials are used for important decisions in our society, both in academic assessment and civil service employment, should not obscure their equally frequent lack of demonstrated significance. Rank-ordering satisfies a felt need for objectivity, but it does not necessarily select better job performers. In some circumstances the virtues of objectivity may justify the inherent artificiality of the substantively deficient distinctions being made. But when test scores have a disparate racial impact, an employer violates Title VII if he uses them in ways that lack significant relationship to job performance.

Permissible use of rank-ordering requires a demonstration of such substantial test validity that it is reasonable to expect one- or two-point differences in scores to reflect differences in job performance. Our prior conclusion that the test itself may have had enough validity to be used does not, therefore, lead to approval of using its results for rank-ordered selections. On the contrary, the defects we noted in the job analysis and the test construction are substantial enough to preclude an inference that passing scores will correlate with job performance closely enough to justify rank-ordered selections. While we do not criticize the City's efforts as extensively as did the District Court, we agree that the identification of pertinent abilities, the demonstration of their relationship to the job tasks, and the process of developing the questions, were flawed.

These shortcomings take on added significance when it is recognized that the test just barely satisfied even our lenient construction of the Guidelines requirement of procedural representativeness. As the EEOC has advised, it is "easier" to make the inference of a relationship between higher scores and better job performance "(t)he more closely and completely the selection procedure approximates the important work behaviors." EEOC Questions and Answers, supra, Q. 62. Unlike the District Court, we are not willing to reject any use of a police exam simply because the pencil and paper procedure of the test is not a close approximation of the job. Nor are we willing to preclude rank-ordering because a pencil and paper procedure was used. Given the current state of the art in employment testing, we think it would be unrealistic to condemn pencil and paper tests. Alternative procedures have not been shown to be readily available within the limitations of time and resources confronting most employers. Nevertheless, we cannot ignore the Guidelines' criticism of assessing ability to perform complex tasks by a test procedure so different from the work setting. When the selection procedure does not closely approximate the important job tasks, it becomes especially important to insist upon a strong showing that other aspects of content validity have been demonstrated. And that demonstration must be very substantial when a test procedure that does not closely approximate the job is sought to reflect the fine gradations required for rank-ordering. In short, while we might not agree with the District Judge that the defects in the test preclude a finding of sufficient content validity to permit its use, we agree that content validity has not been shown to the extent necessary for rank-ordering.

In addition to inadequate demonstration of validity, the test may not be used for rank-ordered selections because of the total absence of any evidence that the exam possessed another vital feature-reliability, that is, the extent to which the exam would produce consistent results if applicants repeatedly took it or similar tests. Of course, there is no expectation that applicants will take any given test more than once. But if an exam lacks reliability to such an extent that results would be significantly inconsistent if the same applicants were to take it again, that is an important indication that the test is not especially useful in measuring their abilities. Although not explicitly mentioned in the Guidelines, reliability is prominently identified in the APA Standards (to which the Guidelines refer in § 5(C)) to be as basic for evaluating an exam as validity itself. See APA Standards, supra at 48-55. Like content validity, reliability is not an all or nothing matter. It too comes in degrees. What is required is not perfect reliability, but rather a sufficient degree of reliability to justify the use being made of the test results. Without some substantial demonstration of reliability it is wholly unwarranted to make hiring decisions, with a disparate racial impact, for thousands of applicants that turn on one-point distinctions among their passing grades.

Two aspects of reliability deserve consideration in assessing the use of rank-ordering. The first is the quality of the exam questions. The more skillfully they have been formulated, the more likely it is that results on one question will correlate with results on the other questions and that successive test scores would be consistent. This will avoid the tendency of scores to vary because of extraneous factors such as test administration. Whether this aspect of reliability has been achieved to an extent sufficient to justify rank-ordering need not be left to general consideration of the quality of the test construction process. A basic demonstration of this aspect of reliability can easily be made by the test-maker, before the test is administered to job applicants. The test-maker can pre-test his exam by giving it twice to a sample of persons generally approximating the characteristics of the population where the test is expected to be used for employment selection. To avoid distortion due to recollection, the test given at a later date to such a sample can use similar but not identical questions. Another somewhat useful indicator of reliability is a technique known as a split-half correlation-dividing each component of the test into equal halves and observing how consistent were an individual's scores on each half.17 This technique can also be used in the process of pre-testing the exam, before it is administered to job applicants. The technique also is easily used on actual test results to provide some minimal evidence of reliability. In this case the City offered no evidence to demonstrate the quality of the questions used in Exam No. 8155.

The second aspect of reliability concerns what testing experts call the error of measurement. See generally, H. Gulliksen, The Theory of Mental Test (1950). This is a statistical phenomenon indicating the degree to which scores on successive tests will be subject to inevitable random variation, no matter how carefully the test-makers have eliminated or at least lessened the effects of extraneous factors within their control. The error of measurement can be calculated by use of the standard deviation concept. For any test, regardless of how carefully it was prepared, statistical analysis, based on the normal distribution curve, shows that there is 68% probability that successive scores would fall within a range of one standard deviation from an actual score and a 95% probability (generally a satisfactory confidence level) that successive scores would fall within a range of two standard deviations from the actual score. It is also possible to estimate, again for any test, how many raw score points above and below the applicant's actual score are within the range of one or more standard deviations. This calculation, as explained in the margin,18 depends upon the applicant's score and the number of items on the test. Thus, though the test-maker can never eliminate the error of measurement, he can minimize its effect for all scores by increasing the number of questions.

NOTE: OPINION CONTAINS TABLE OR OTHER DATA THAT IS NOT VIEWABLE

The inevitable error of measurement for a test consisting of 100 items, like Exam No. 8155, has significance in assessing the use of rank-ordering. At the passing score of 94, one standard deviation is equivalent to a range between 2.4 points above and below 94. The range narrows as actual scores approach 100. At 97, for example, the range is plus or minus 1.7. Thus, to have 95% confidence that an applicant's grade has statistical reliability, grades within two standard deviations of his grade should theoretically be treated as equivalent to his grade, for in fact there is a 95% likelihood that each applicant at each grade would score within such a range on successive takings of equivalent tests. This means that the range in which a satisfactory confidence level is achieved for an applicant who scores 94 lies between 89 and 99, and even for one who scored 97, the range extends from 94 to 100. Care must be taken not to over-emphasize the significance of the error of measurement. Though grounded on sound principles of statistics, it remains an estimate, and it need not prevent the usual use of test scores that do not have a disparate racial impact. At a minimum, however, it should serve to illustrate the risks of making hiring decisions turn on one-point increments at scores where even a single standard deviation covers a raw score range greater than one point.

The most serious implication of error of measurement for Exam No. 8155 arises from the extraordinary extent to which high test scores were closely bunched.19 Each score from 94 to 97 was achieved by over 2,000 applicants.20 If the test questions had sufficient differentiating power to produce a somewhat even distribution of scores, or at least to avoid excessive bunching among the high scores, the error of measurement would not have affected the ultimate selection of such a significant portion of the applicants. But when 8,928 applicants, two-thirds of all who passed, are bunched between 94 and 97, the error of measurement makes the use of rank-ordering an extremely unreliable basis for hiring decisions.

If test scores produce disparate racial results, an employer who wants to use rank-ordering of the scores for hiring decisions faces a substantial task in demonstrating that rank-ordering is sufficiently justified to be used. But the task is by no means impossible. Even without resorting to a criterion-related study, the test-maker still has several ways to increase the justification for rank-ordering sufficiently to use it. First, he can conduct a job analysis and construct the test with a high degree of adherence to Guideline requirements. That would produce a much stronger showing of content validation than the City was able to demonstrate in this case. Even content validity sufficient for rank-ordering does not require literal compliance with every aspect of the Guidelines. But there must be a substantial demonstration of job relatedness and representativeness to show a sound basis for making rank-ordering hiring decisions. Second, the test-maker can achieve an adequate degree of reliability by careful design of the exam so that the questions will yield a satisfactory degree of consistent results. To guard against inconsistency based on extraneous factors, the test-maker can pre-test the exam by successive applications to an appropriate sample or at least analyze the results of split-half correlations. Inconsistencies revealed by these techniques can be lessened by redesign of needlessly unreliable questions or components of the exam. To reduce inconsistency based on random variation, the number of questions can be increased. Of course, the size of an exam must observe realistic limits of cost and time of administration, but in the case of Exam No. 8155, using 200 instead of 100 questions would have significantly increased reliability. Because some error of measurement is inevitable, even an increase in the number of questions will not eliminate all random variation. However, the effect of such random variation can be reduced by using questions that are shown to have significant differentiating power, so that scores are not bunched at the high end of the scale.21

Alternatively, the employer can acknowledge his inability to justify rank-ordering and resort to random selection from within either the entire group that achieves a properly determined passing score, or some segment of the passing group shown to be appropriate. The City itself, perhaps unwittingly, has acknowledged the reasonableness of this second alternative. Since each of the scores between 94 and 97 was achieved by more than 2,000 candidates, and since each training class can accommodate slightly more than 400 candidates, the test scores provide no basis for selecting from among candidates at each of these scoring levels. At oral argument, the City acknowledged that random selection would be used; for example, if all candidates scoring 98 or above have been selected, and 400 academy trainees are needed from the 2,000 candidates scoring 97, a random drawing from among all 2,000 would be used.22 Thus, even the City recognizes that when the test scores afford no job-related basis for making selections from within a group that passed the test, random selection is appropriate.

We do not conclude that Title VII requires random selection from among those who pass a content valid test. In some instances rank-ordering may be shown to be justified. But where it is not, random selection from within a group validly determined to have passed a content valid exam is simply an available option. See Association Against Discrimination in Employment v. City of Bridgeport, 594 F.2d 306, 313 n.19 (2d Cir. 1979). The City may prefer not to use it. However that may be, the City cannot use rank-ordering not shown to be job-related when test scores produce a disparate racial impact. Nor can the City justify the use of rank-ordering by reliance on what it contends are requirements of state law. See N.Y.Const. art. 5, § 6; N.Y.Civil Service Law § 61(1). Title VII explicitly relieves employers from any duty to observe a state hiring provision "which purports to require or permit" any discriminatory employment practice. 42 U.S.C. § 2000e-7 (1976).

If rank-ordering were the only unjustified use of test scores, it would be possible to limit a Title VII remedy to the elimination of this device. In other words, if an exam has adequate content validity and a passing score has been adequately determined, the employer could still limit selections to those within the group that passed, provided only that he abandons rank-ordered choices. But in this case, the impermissible use of the test scores extends beyond rank-ordering to the setting of the cutoff score.

Cutoff Score. The Guidelines state that a cutoff score "should normally be set so as to be reasonable and consistent with normal expectations of acceptable proficiency within the work force." Guidelines § 5(H). This also makes sense. No matter how valid the exam, it is the cutoff score that ultimately determines whether a person passes or fails. A cutoff score unrelated to job performance may well lead to the rejection of applicants who were fully capable of performing the job. When a cutoff score unrelated to job performance produces disparate racial results, Title VII is violated. See Association Against Discrimination, supra, 594 F.2d at 312-13; Bridgeport Guardians, Inc. v. Bridgeport Civil Service Commission, 482 F.2d 1333, 1338 (2d Cir. 1973). Consequently, there should generally be some independent basis for choosing the cutoff. As with rank-ordering, a criterion-related study is not necessarily required; the employer might establish a valid cutoff score by using a professional estimate of the requisite ability levels, or, at the very least, by analyzing the test results to locate a logical "break-point" in the distribution of scores. The City offered no such basis in this case. It merely chose as many candidates as it needed, and then set the cutoff score so that the remaining candidates would fail.

If it had been shown that the exam measures ability with sufficient differentiating power to justify rank-ordering, it would have been valid to set the cutoff score at the point where rank-ordering filled the City's needs. The justification would be that each incremental change in score represents an incremental change in job-related ability, so that, for any given cutoff (even one determined solely by hiring needs), those who passed would likely perform the job better than those who failed. But the City can make no such claim, since it never established a valid basis for rank-ordering.

Indeed, the problems of both validity and reliability, which prevent the justified use of rank-ordering, also cast serious doubt on the justification for the cutoff score of 94. Of all these problems, the unreliability attributable to the error of measurement has special significance for the cutoff score. As previously noted, the error of measurement had an especially extensive impact on the applicants because of the bunching of scores at the high end. The bunching occurred not only at passing scores from 94 to 97, but also at failing scores of 92 and 93, each of which was also achieved by more than 2,000 applicants. Scores within a range of two points above and below the passing grade of 94 were achieved by 10,731 applicants, 29% of the total. Had the scores been evenly distributed, 1,800 applicants, only 5%, would have fallen within this range. Selecting a cutoff score in the middle of the range in which the test scores were closely bunched meant that the inevitable error of measurement led to a much higher number of mistaken passes and failures than would otherwise have occurred.23 Perhaps an even distribution of scores cannot be readily achieved, but the impact of the error of measurement could have been held to acceptable limits if a cutoff score had been selected within some range where scores were not closely bunched. This does not mean that every person who fails a test by a single point necessarily has a claim for legal redress. A cutoff score, properly selected, is not impermissible simply because there will always be some error of measurement associated with it. But when an exam produces disparate racial results, a cutoff score requires adequate justification and cannot be used at a point where its unreliability has such an extensive impact as occurred in this case.

Primarily on the basis of Exam No. 8155's improper use of rank-ordering, and of the cutoff score, we affirm the conclusion of the District Court that the exam as used was invalid. Since we agree with the District Court that the exam had a significant disparate racial impact, we hold that the City's use of the exam violated Title VII.

The fashioning of relief in employment discrimination cases is always a sensitive matter, especially in cases like this one where the District Court endeavors to order some form of affirmative action, including the use of a quota. Our task of determining whether the District Court's remedy conforms to prevailing standards for Title VII relief has been made somewhat more difficult than usual because the precise effect of the Court's order is not clear and, in some respects, the order is not adequately supported by necessary findings or sufficient evidence.

The District Court's order is set out in full in the margin.24 It deals with several topics including the use of Exam No. 8155, the development and approval of a new selection procedure, hiring in the interim until a new selection procedure receives court approval, and long-term hiring. The City objects most strenuously to the provisions concerning interim and long-term hiring, since these provisions involve the use of a quota.

In considering the District Court's order, we find it useful to distinguish between those aspects of the order that are designed to assure compliance with Title VII and those aspects that provide affirmative relief as a remedy for past discrimination. Compliance involves restricting the use of an invalid exam, specifying procedures and standards for a new valid selection procedure, and authorizing interim hiring that does not have a disparate racial impact.25 Affirmative relief involves interim hiring at any ratio greater than what is necessary just to avoid a disparate racial impact and any required long-term hiring targets or ratios. Though standards in this difficult area are only beginning to emerge and have been a source of disagreement within this and other courts, we distill from the case law the following general principles, applicable to remedies for discrimination in entry-level hiring.26

2. Initial consideration should be given to relief for the plaintiffs and those similarly situated, that is, Black and Hispanic applicants who took Exam No. 8155. While relief in Title VII cases need not necessarily be limited to the applicant class nor framed in specific relation to that class, their interests obviously deserve consideration. See Castro v. Beecher, 459 F.2d 725, 736-37 (1st Cir. 1972); Carter v. Gallagher, supra, 452 F.2d at 328-31.

3. Interim hiring provisions, for the period prior to use of a valid selection procedure, should be considered and formulated separately from long-term hiring provisions.

4. Since interim hiring provisions, where needed to satisfy immediate personnel requirements, are to be used prior to the development and approval of a valid selection procedure, such provisions cannot meet Title VII standards by demonstrated job relatedness. Therefore, one appropriate way to assure Title VII compliance on an interim basis is to avoid a disparate racial impact. This means selecting from among adequately qualified applicants either on a random basis, see, e. g., Association Against Discrimination, supra, 594 F.2d at 313, n.19, or according to some appropriately noncompensatory ratio, see, e. g., Kirkland, supra, 520 F.2d at 429-30; Vulcan Society, supra, 490 F.2d at 398-99, normally reflecting the minority ratio of the applicant pool or the relevant work force.

5. Any use of a hiring ratio during the interim period to compensate for prior discrimination, that is, a ratio greater than the minority percentage in the applicant pool or the relevant work force, should be imposed only upon clear evidence and appropriate findings of the need to redress demonstrated prior discrimination of long standing that has had a significant impact on minority employment. See Association Against Discrimination, supra, 594 F.2d at 312; Patterson, supra, 514 F.2d at 776 (Feinberg, J., concurring); Vulcan Society, supra, 490 F.2d at 398-99.

6. If a hiring ratio is imposed beyond the interim period in which a valid selection procedure is developed in order to reach a required long-term target, the justification for its use must be especially compelling.27 See Bridgeport Guardians, supra, 482 F.2d at 1340. The prior discrimination warranting such a remedy must either be intentional, or it must plainly appear that significant discrimination has persisted for a substantial time. Gross disparity between minority employment and minority percentage in the relevant work force may imply such discrimination, especially when the minority employment is extremely low. Otherwise, the instances, impact, and duration of prior discrimination must be established.

With these principles in mind we turn to consideration of the order's provisions for compliance remedies and for affirmative relief.

Compliance Remedies

Paragraph 2 of the order enjoins the use of Exam No. 8155 as a selection procedure, except in connection with the implementation of the interim and long-term hiring provisions. Deferring for the moment the exception concerning the permissible use of the exam, we readily affirm the District Court's prohibition against the unqualified use of the exam. The exam as used violated Title VII, and it is obviously appropriate to bar its continued use, except on an interim basis with adjustments that eliminate its disparate racial impact and thereby avoid its unlawful effect.

The order prescribes four requirements for the development and approval of a new selection procedure. We affirm the requirement, in paragraph 6 of the order, that the City make extensive efforts in its search for a new procedure, including consideration of "all reasonably available alternative selection procedures" and broad consultation with appropriate professionals.

We also affirm the procedural requirement in paragraph 4 of the order that the new selection device must be approved by the District Court prior to its use. Once an exam has been adjudicated to be in violation of Title VII, it is a reasonable remedy to require that any subsequent exam or other selection device receive court approval prior to use. See, e. g., Bridgeport Guardians, supra, 482 F.2d at 1339. This situation is to be contrasted with a case like Guardians Ass'n v. Civil Service Commission, 490 F.2d 400 (2d Cir. 1973) (Guardians I or "the '68-'70 exams case"), where the exams had not yet been found to be invalid. In that case the City was obliged only to show the new exam to plaintiffs and afford them an opportunity to criticize it.

However, we reject the District Court's principal substantive standard for approval of the new selection procedure to the extent that it requires any new procedure to be validated in accordance with the Guidelines and consistent with the APA Standards. As discussed in part III of this opinion, we have concluded that literal compliance with the Guidelines and with professional testing criteria is not required by Title VII and can, in some instances, lead to results inconsistent with Title VII's explicit endorsement of "any professionally developed ability test." 42 U.S.C. § 2000e-2(h). We therefore conclude that the District Court, in determining the legality of a new selection procedure, should not require that it must conform in all respects to the Guidelines and the APA Standards; it will be sufficient if the new procedure conforms to the essential purposes of Title VII. We have endeavored to outline the extent to which the Guidelines are useful in carrying out those purposes and some of the respects in which excessive rigidity in application of the Guidelines may undermine those purposes. No all-encompassing formula is possible. The Guidelines remain useful as a source of guidance, but they need not be adhered to in every detail as if they were substantive regulations.

We also reject the District Court's substantive requirement, as expressed in paragraph 6 of the order, that the new selection procedure must have "the least adverse impact on minority applicants." This requirement appears to be an attempt to implement the principle expressed in Albemarle that once an employer has established the job-relatedness of a selection procedure that has a disparate racial impact, the plaintiff may still establish a Title VII violation by proving that "other tests or selection devices, without a similarly undesirable racial effect, would also serve the employer's legitimate interest in 'efficient and trustworthy workmanship.' " Albemarle Paper Co. v. Moody, supra, 422 U.S. at 425, 95 S. Ct. at 2375, quoting McDonnell Douglas Corp. v. Green, supra, 411 U.S. at 801, 93 S. Ct. at 1823. Of course, a decree may incorporate this principle into the standard for approving any new selection procedure, but the phrasing in paragraph 6 imposes a stricter and impermissible burden. To comply with Title VII a new selection procedure need not have the least adverse impact on minority applicants. That requirement would prohibit any exam with any disparate racial impact because random selection would always be a procedure with less adverse impact. What Albemarle contemplates, and what the decree may require, is that a selection procedure proposed by the City may not be used if the plaintiffs can establish the existence of an alternative procedure with an equivalent degree of job relatedness and a lesser disparate racial impact.

Affirmative Relief

Our initial concern with the provisions of the order concerning affirmative relief arises from our uncertainty as to precisely what the District Court has required. The District Court's January 11 order states, and its January 23 revised opinion appears to require, that the City take the affirmative action of hiring 50% of entry-level police officers from among qualified Black and Hispanic applicants. But the opinion and the order seem to contain different provisions about the length of time for which this 50% minority hiring ratio is to apply. The opinion characterizes the affirmative action required as an "interim" measure. 484 F. Supp. at 799. In the context of Title VII testing cases, "interim" has meant the time period between the date of a decree and the subsequent use of a valid selection procedure. See, e. g., EEOC v. Local 638, Sheet Metal Workers' Association, 532 F.2d 821, 829 (2d Cir. 1976); Kirkland, supra, 520 F.2d at 423, 429-30; Vulcan Society, supra, 490 F.2d at 398. It is to be contrasted with a long-term or permanent hiring requirement, which specifies a minority composition of the employer's work force that must be achieved, even if a valid testing procedure has been developed and approved before the targeted minority composition is reached.28

In contrast to the District Court's opinion, its order, which is the operative document brought here for review, appears to require numerical quotas that continue beyond interim relief. Paragraph 3 states that the defendants "shall" seek to achieve minority (Black and Hispanic) representation in the Police Department "comparable to that of the minority composition of the labor force in the relevant hiring area," a representation stated to be at least 30%. Paragraph 4 also indicates that the required 50% quota hiring29 may well last beyond the interim that ends with approval of a valid test. Paragraph 4 prescribes 50% minority hiring either until minority representation in the Department equals minority representation in the relevant labor force or until the District Court has both approved a valid selection procedure and, in addition, found that 50% quota hiring is no longer "appropriate." That such a finding might not be made until sometime after approval of a valid selection procedure and perhaps not until the long-term hiring goal has been reached is indicated by the specific reservation in Paragraph 4 of the plaintiffs' right to advocate the continued use of hiring quotas because "the continuing effects of past discrimination have not been eliminated."

In addition to creating uncertainty whether affirmative relief has been ordered on an interim or long-term basis, the record contains inadequate findings and, more significantly, inadequate evidence to support the hiring provisions of the order. The District Court determined that affirmative relief was warranted based upon conclusions concerning the prior employment practices of the defendants and their state of mind in preparing and using Exam No. 8155. The prior practices concern the defendants' use of an eligibility list compiled from the results of police exams given between 1968 and 1970. The continued use of the results of those exams after 1972, when Title VII was amended to include municipal employers, had previously been found to violate Title VII because the exams had a disparate racial impact and were not job-related. That conclusion had been reached in litigation concerning the Police Department's layoff policy, Guardians Ass'n v. Civil Service Commission, 431 F. Supp. 526 (S.D.N.Y.) (Guardians II or "the first layoff policy case"), vacated and remanded for reconsideration, 562 F.2d 38 (2d Cir. 1977), and reaffirmed in Guardians Ass'n v. Civil Service Commission, 466 F. Supp. 1273 (S.D.N.Y. 1979) (Guardians III or "the second layoff policy case"), aff'd in part, remanded in part, No. 79-7377, --- F.2d ---- (2d Cir. July 25, 1980).30

The District Court grounded its decision to impose affirmative relief on the conclusion that the defendants designed Exam No. 8155 "either with a deliberate intention to discriminate against blacks and hispanics or with reckless disregard of whether the test would have that result." 484 F. Supp. at 798-99. This serious indictment of responsible city and police administrators is unsupported and indeed contradicted by the record. The conclusion is based virtually exclusively on the fact that the Police Department failed to assemble a valid eligibility list from the '68-'70 exams and failed again in 1979.31 It would be contrary to Title VII's provision allowing the use of valid exams to hold that once an employer tries to construct such an exam and fails, any further failure to develop a valid exam constitutes intentional discrimination. Such a second attempt, by itself, is evidence only of a desire to make use of a technique that the law explicitly allows. Persistent use of exams with disparate racial effects would support an inference of intentional discrimination if proper test construction were not even attempted. But the record here indicates that the City's police and personnel officials made extensive efforts to understand and apply the Guidelines and develop a test they hoped would have the requisite validity.32 Their failure entitles the plaintiffs to some relief, but does not justify a remedy based upon an unwarranted inference of deliberate discrimination.

In the absence of intentional discrimination, affirmative relief requires some demonstrated pattern of significant prior discrimination. There are no adequate findings concerning such a pattern, and the record lacks sufficient evidence on which such findings could be based. The District Court referred to the City's prior Title VII violation in using the eligibility list resulting from the '68-'70 exams and concluded that the minority "imbalance" on the City's police force is "directly caused by past and current discriminatory practices." 484 F. Supp. at 799. Obviously Exam No. 8155 has had no significant effect upon the current minority proportion of the police force, because its results have been used to select only one class for the training academy. As to the use of the eligibility list from the '68-'70 exams, there are no findings and no evidence to indicate the extent to which use of that list has affected the minority proportion of the police force. The first layoff policy case provides some data as to the numbers of Whites and minority members hired as a result of two of the '68-'70 exams, 431 F. Supp. at 552-53, Tables 4 and 6, but even with that data, the record in this case does not disclose the minority percentage in the police department before and after the '68-'70 exams. Nor is there any evidence of the impact of hiring resulting from Exam No. 3014, administered in 1973, the validity of which has not been challenged. The '68-'70 exams undoubtedly made some contribution to the current racial imbalance of the police force, but the record does not contain even estimates of how the hiring prior to the 1973 exam currently affects the composition of the police force. Plaintiffs have failed to prove that prior use of discriminatory exams has created a situation warranting affirmative relief.

In the absence of proof of the specific impact of such prior discrimination, the only probative evidence in the record is that the current minority proportion of the police force is 12.7% compared to a relevant work force percentage of at least 30%. That is cause for some concern, but does not reveal the flagrant disparity shown in prior cases where long-term hiring quotas were in issue. Cf. Association Against Discrimination, supra, 594 F.2d at 308 (minorities constituted 0.2% of employees, 41% of population; quota vacated for reconsideration and findings); EEOC v. Local 14, International Union of Operating Engineers, 553 F.2d 251, 256 (2d Cir. 1977) (minorities constituted 2.8% of union members, at least 16.2% of relevant labor force; judgment including quota vacated for further findings); Patterson, supra, 514 F.2d at 770, 772 (minorities constituted 2.45% of union and union-affected job-seekers, 30% of relevant labor force; quota sustained); Bridgeport Guardians, supra, 482 F.2d at 1335 (minorities constituted 3.6% of employees, 25% of population; hiring quota sustained). If the disparity between existing minority employment and relevant work force percentage were extreme and long-standing, that circumstance alone might justify some affirmative relief, especially if minority employment were low. But where, as here, the disparity is not extreme and minority employment is not insubstantial, an affirmative hiring remedy must be based on detailed findings, supported by evidence, that there exists a pattern of prior discrimination warranting such relief. Cf. Association Against Discrimination, supra, 594 F.2d at 312-13; Kirkland, supra, 520 F.2d at 427-28.

We therefore conclude that the affirmative hiring provisions of the order must be set aside. This will require elimination of the affirmative hiring quota of 50%, both as interim and long-term relief, and elimination of the long-term hiring goal of 30%, a goal that obviously could be achieved only by affirmative hiring at a ratio above the minority percentage of the relevant work force. The only hiring remedy justified by this record is a compliance remedy, one designed to make sure that the City complies with the requirements of Title VII in making appointments to the police force. Such a remedy should permit the City, in the interim period prior to development of a new, valid selection procedure, to use the results of Exam No. 8155 in a way that avoids any disparate racial impact. This means selecting candidates from the eligibility list subject to the minority proportion of either the applicant pool or the relevant work force, a determination to be made upon remand.

To accomplish such interim hiring the City may assemble a minority pool and a majority pool of qualified candidates from the eligibility list. In assembling these pools, the City may use a cut-off score somewhat lower than 94, a score that was originally determined by the City's manpower needs, rather than by an independent estimate of adequate ability. Within the majority and minority pools, the City may choose candidates, maintaining the requisite proportion of minority candidates and taking into account those already hired as a result of this exam. The City is not obliged to hire on an interim basis, but it should have the option of doing so in order to meet its manpower needs.

We remand this case to the District Court for the entry of a revised decree consistent with this opinion. Pending the entry of that decree, we continue in effect the provisions of the stay order we previously entered, under which the City is afforded the option of hiring from those who scored 94 or above on Exam No. 8155 provided such hiring achieves a minority ratio of 33%, taking into account those already hired as a result of this exam.

Affirmed in part, vacated in part, and remanded.

SIFTON, District Judge, concurring:

I concur in the result and in the reasoning of Judge Newman's thorough opinion with the exception of his conclusion that Exam No. 8155 adequately tested a representative sample of the skills, knowledges and abilities required for police work in New York City. While I join in rejecting the argument of the United States as amicus, that the requirement of representativeness means that all the knowledges, skills and abilities needed for police work must be tested, each in its proper proportions, I disagree with the conclusion that an exam which contains omissions as extensive as the present exam meets the "representativeness requirements to an adequate degree."

As Judge Newman correctly points out, Exam No. 8155 failed to test for human relations skills despite the City's job analysis which found, not surprisingly, that such skills constitute a significant part of police work in New York City-sufficiently important to warrant devoting 30 of the exam's 100 questions to determining whether the applicants possessed them and in what degree. Unfortunately, as Judge Newman so well explains, the portion of the exam which attempted to test for human relations skills in fact simply replicated other portions of the exam which tested the applicants' abilities to remember details, fill out forms, and apply general principles in specifically described factual settings. In my view, an exam which does not test for a skill or ability determined to constitute close to one-third of the talents required for success on the job cannot be called representative in any meaningful sense.

The legal effect of excusing a lack of representativeness of these dimensions, as Judge Newman's opinion appears to do, because the record does not contain evidence that a test for the relevant skills or abilities is "readily available and realistically feasible," appears to me, with all due respect, to shift to plaintiffs the burden of explaining the exam's racially disparate impact. The practical effect of validating a test for New York City police work which does not examine for human relations skills is to leave out of the entry level employment decision an area of qualifications in which minority groups would, one must assume, perform well despite educational and other deprivations. The immediate consequence of the decision is that high performance in the area of human relations skills will not be available to alter the overall assessment of applicants who perform less well in other areas.

Nor will refinements in rank-ordering or in the determination of an appropriate cut-off score overcome the effect of an omission of these dimensions. Selection at random or in rank order without the benefit of any assessment of the applicants' performance in one large area of police work will reflect the same racial disparities as exist in the pool from which random selection is made or in the ranks established by the unrepresentative test.1

Chris Hart and Larry Burns are walking by a big museum late one night. Hart notices that, although the museum is closed to the public, one of the doors is unlocked. Hart suggests that for a prank they go into the museum to see the exhibits. Hart has a flashlight with him while Burns has an illegal gun hidden under his clothing. Hart does not know that Burns has a gun illegally in his possession. About five minutes after they enter the museum, they hear footsteps and leave the museum the same way they entered. According to the definitions given,

(A) Burns committed the crime of criminal trespass, but Hart did not

(B) Hart committed the crime of criminal trespass, but Burns did not

(C) both Hart and Burns committed the crime of criminal trespass

(D) neither Hart nor Burns committed the crime of criminal trespass.

The definition provided for criminal trespass is as follows:

The crime of criminal trespass is committed when a person knowingly enters or remains in a building in which he has no right to be and while in the building possesses, or knows that another person accompanying him possesses, an explosive or a gun.

It will be noted, simply as an indication of the difficulty of the test construction enterprise, that there is a slight ambiguity in this question. The applicant is told that Hart is unaware that "Burns has a gun illegally in his possession," but not whether Hart was unaware that Burns had a gun at all. This could affect the answer, since, according to the definition given, knowledge of any gun, even of a legal one, would render Hart's action a criminal trespass. While it would be unreasonable to suggest that a test must be free of every possible ambiguity in order to be acceptable, the ease with which such ambiguities can appear emphasizes the value of confirming the test's reliability by some empirical procedure. See Section IV, infra.

By the time of the November hearing, the Police Department had already proceeded to use the list to accept 415 trainees, as described above. In order to forestall any further hiring on the basis of the list, and to assist the City in making alternative arrangements, the District Court informed the parties on December 17, 1979 that the test violated Title VII, and orally enjoined the defendants from its further use. On December 27, 1979, the City filed an order to show cause and a motion for a stay, in which it stated that the Police Department needed to accept an additional 380 trainees from the list on January 14, 1980. When the Court denied its motion, the City filed a petition in this Court for a writ of mandamus. This Court granted the petition, ordering the District Court to issue findings of fact and conclusions of law, pursuant to Fed. R. Civ. P. 52(a), at least 48 hours before any injunction against the City was to take effect. The District Court then held its second hearing, which was devoted to the issue of relief. Since the Court's decision, issued that day, preceded the January 14 action by more than 48 hours, it fulfills the requirements of this Court's mandamus

The standard deviation for a particular set of data provides a measure of how much the particular results of that data differ from the expected results. In essence, the standard deviation is a measure of the average variance of the sample, that is, the amount by which each item differs from the mean. The number of standard deviations by which the actual results differ from the expected results can be compared to the normal distribution curve, yielding the likelihood that this difference would have been the result of chance. The likelihood that the actual results will fall more than one standard deviation beyond the expected results is about 32%. For more than two standard deviations, it is about 4.6% and for more than three standard deviations, it is about .03%. On this basis, the Supreme Court concluded in Castaneda that when actual results fell more than three standard deviations from the expected result (that is, a race-neutral selection), the deviation could be regarded as caused by some factor other than chance

A construct is generally defined as "an idea developed or 'constructed' as a work of informed, scientific imagination; that is, it is a theoretical idea developed to explain and to organize some aspects of existing knowledge." American Psychological Association, Inc., Standards for Educational & Psychological Tests 29 (1974) (hereinafter APA Standards). Neither the APA Standards nor the Guidelines appear to include a definition of content, apart from the concept of content validity, although the Guidelines do describe content as involving "knowledges, skills, or abilities." Guidelines § 14(C) (1)

The terms of this conditional stay were that, if the City wished to hire, pending appeal, it should establish two pools of candidates, one consisting of all the minority applicants who passed the exam, and the second consisting of all others who passed, and select trainees from these pools in the ratio of one minority applicant for every two others. The 415 applicants already hired were to be counted in determining whether the new hires conformed to this 1 to 2 ratio

The City hypothesizes some situations in which statistics could be misleading (e. g., if some of the candidates taking the test had not been eligible to apply for the position) but presents no evidence to show that this occurred. To accept such unsupported possibilities, and require the plaintiffs to refute every circumstance that could explain the disparate impact shown by the statistics, would create an onerous burden of proof, far in excess of the Title VII standards as interpreted by the Supreme Court. See Dothard v. Rawlinson, 433 U.S. 321, 329-30, 97 S. Ct. 2720, 2726-27, 53 L. Ed. 2d 786 (1977); Jones v. New York City Human Resources Administration, 528 F.2d 696, 698 (2d Cir. 1976)

Notwithstanding any other provision of this subchapter, it shall not be an unlawful employment practice . . . for an employer to give and to act upon the results of any professionally developed ability test provided that such test, its administration or action upon the results is not designed, intended or used to discriminate because of race, color, religion, sex or national origin.

In criticizing the questions involving application of the law, the District Court stated: "In a real situation, the officer sees activity and must determine rather quickly whether the activity is illegal, with no definitional aids before him. He must operate on instinct and experience." 484 F. Supp. at 797. That is true, but it is a criticism of all testing. In any situation, there are generally at least two steps that are necessary to produce the correct behavioral response. The first is to know what to do, and the second is to act accordingly. Clearly the limits of any test, no matter how well designed, is to determine whether the applicant knows or can determine what to do. Only a probationary period can determine if the applicant will act correctly in a real life situation

Such data may be obtained by studying correlations of test scores of accepted candidates with their subsequent job performances, or correlations of the test scores of present employees with their current job performances

A rare example of a criterion-related study that was found acceptable is Washington v. Davis, supra, 426 U.S. at 249-52, 251 n. 17, 96 S. Ct. at 2052-53, 2053 n. 17. This was not a Title VII case however, and the Court's use of Title VII concepts, in dictum, to assess the validity of the test in question under the Fourteenth Amendment does not indicate that the Court was reviewing the test with the stringency that Title VII requires. See Gudians Ass'n v. Civil Service Commission, 633 F.2d 232 at 245 - 246 (2d Cir., 1980). In fact, a less demanding standard was almost certainly being used, as is clear from the comparison between Davis and the Court's Title VII decision the preceding term in Albemarle.

There will be some tests whose character is sufficiently clear so that the content-construct distinction can be applied at the threshold, without the need to place the test in the context of the job it tests for. A general intelligence test, see Griggs, supra (Wonderlic Personnel Test), will almost always need to be assessed by construct validation, since it necessarily measures for an inferred ability, regardless of the context. The much-vaunted typing test, in contrast, can always be regarded as amenable to content validation. However, there are a large number of tests, including virtually all the "second generation" tests for jobs such as the one considered here, that will fall into the middle range

These considerations are particularly crucial to an exam validated on content grounds. See APA Standards at 29: "Content validity is determined by a set of operations, and one evaluates content validity by the thoroughness and care with which these operations have been conducted."

In some instances the relationship is obvious. Plainly the ability to fill out forms is needed for task 16, "Processes arrests using appropriate police department forms and notifications." But it is not evident, for example, why any of the five listed abilities are critical to task 32, "Searches for lost children, runaways, etc." or task 19, "Guards and transports prisoners."

The fact that the factual subject matter of the exam questions (as opposed to their purpose in measuring abilities) was related to the subject matter of the job is not a major indicator of the test's validity, since the test measured abilities, not knowledge. But it does suggest that the test has avoided the dangers inherent in using irrelevant factual material. Such material could skew the test for ability in directions unrelated to the job, a phenomenon that even the best designed test might not be able to avoid. The present test avoids that problem by ensuring that any distorting effects resulting from the subject matter of the questions are themselves job-related

See APA Standards, supra at 48-50; Kuder & Richardson, The Theory of Estimation of Test Reliability in Principles of Educational and Psychological Measurement 95 (W. Mehrens & R. Ebel, eds. 1967); Rulon, A Simplified Procedure for Determining the Reliability of a Test by Split Halves, in id. at 104

This effect can be demonstrated more precisely in psychometric terms, through the use of the standard error concept. The standard error is the raw score variance corresponding to a single standard deviation of the scores that would be obtained by a test-taker on successive, equivalent tests. It can be approximated by the quantity

where t is the test-taker's score and n is the number of items on the test. This formula is derived from the general formula for a standard deviation. See Lord, Do Tests of the Same Length Have the Same Standard Error of Measurement?, in Principles of Educational and Psychological Measurement 192 (W. Mehrens & R. Ebel, eds. 1967). The formula is an approximation because a particular applicant's error of measurement as defined by Lord, supra, is a function of his "true" score, that is, the score he would have obtained had no error been present. However, as the number of items on the test increases, the observed score will approach the true score, so that the approximation is permissible. In substituting observed score for actual score Lord introduces the refinement of reducing the denominator of his function by one to eliminate sampling bias, but this has an insignificant effect when the numbers are rounded off to the extent that they are in this discussion.

The reason this bunching occurred was that the exam was too easy. An exam that was too difficult might have had the same effect, except that the bunching would have occurred at the lower end of the scale. Neither excessive easiness nor excessive difficulty is necessarily fatal, but each magnifies effects that may make scoring arrangements unjustified

Set forth below are the total number of applicants who achieved each score from 110 to 70 and the number of White and minority (Black and Hispanic-surnamed) applicants at each of these scores. The White and minority figures do not always equal the total because some applicants were members of other minority groups and some applicants were not identified

The differentiating power of a question can be easily determined by means of an item analysis. The City could have tested each question on a sample population to determine its power to distinguish between different levels of ability. The most simple item analysis would have quickly revealed that this exam would produce a large number of closely bunched high scores. See Englehart, A Comparison of Several Item Discrimination Indices, reprinted in Principles of Educational and Psychological Measurement 387 (W. Mehrens & R. Ebel eds. 1967); Findley, A Rationale for Evaluation of Item Discrimination Statistics, reprinted in id. at 381

This comparison can be expressed mathematically by considering the statistical probability that a person who passed with a score of 94 or failed with a score of 93 would have achieved that same score on successive exams. If one were to administer successive exams and average the results, the test-taker would have to achieve an average score no more than one-half point above or below his original score to be regarded as achieving the same score. For a test-taker who scored 93 or 94 on Exam No. 8155, the standard error is approximately 2.4. This means that 2.4 points above or below 93 or 94 represents one standard deviation. Similarly, one-half a point above or below these scores represents . 5/2.4 or .21 standard deviations. Assuming that the normal distribution curve applies, and ignoring the effect of bonus points, it can be derived from the table of normal curve distributions that .21 standard deviations represent a 17% level of certainty. In other words about 17% of the test takers who scored 93 or 94 would achieve that same score, on the average, if they had taken successive equivalent exams. Of the remainder, we can assume that half would have scored higher and half would have scored lower. Since 94 was passing, these figures mean that 41.5% of those who scored 93 (41.5% being half the remainder, which is 100%-17%, or 83%), would have achieved an average score of 94 on a series of successive equivalent tests, and thereby passed the exam, while 41.5% of those who scored 94 would have failed

If the test-takers had been evenly distributed by score, which is ideal, or at least approximated an even distribution in the cutoff region, which is generally feasible, 729 persons would have scored 93 or 94, and 41.5% of these, or 300, would have been incorrectly placed. In fact, some 4,148 achieved these two scores, so that at least 1,721 of the test-takers were incorrectly placed, just counting those who achieved these two sets of scores. As one moves away from the cutoff, the percentage of incorrect placements becomes much less. Of those who scored 99, for example, virtually none would have failed on successive tests, since the standard error for a score of 99 on a test with 100 questions is about 1, and 93, the highest failing grade, is thus more than three standard deviations below 99. Similarly, of those who scored 85, only one half of one percent would have passed on successive tests, since the standard error for a score of 85 is 3.6, which is 2.5 standard deviations below 94. For these scores, the test had a very low error of measurement. But relatively few of the test-takers in Exam No. 8155 scored 85 or 99. In contrast, each score in the 92 to 97 range, where the error of measurement was greatest, was achieved by over 2,000 test-takers. And the selected cutoff score was right in the middle of that range.

Upon consideration of the evidence presented at the liability and relief stages of this case, and consideration of the briefs and oral arguments of the parties and amicus curiae United States of America and Policewomen's Endowment Association of New York City, Inc., and entry of findings of fact and conclusions of law, it is hereby ORDERED:

Defendants, their officers, officials, agents, employees, successors, and all persons in active concert or participation with them or any of them are hereby permanently enjoined from engaging in any act or practice with respect to the selection of candidates for appointment to and training for the position in the New York City Police Department of entry-level police officer, which act or practice has the purpose or effect of discriminating against such persons because of race or national origin

The defendants shall seek to achieve as a long-term goal black and hispanic (hereinafter "minority") representation in the sworn ranks of the Police Department comparable to that of the minority composition of the labor force in the relevant hiring area. As of 1978, the labor force of the relevant hiring area was at least 30% black and hispanic

To achieve the long-term goal set forth in paragraph 3, supra, the defendants shall as an interim goal appoint 50% of their entry level police officers from among qualified black and hispanic applicants. The interim hiring goal for minorities shall remain in effect until the minority representation in the sworn ranks of the Police Department is at least equal to the percentage of minorities in the labor force of the relevant hiring area as described in paragraph 3, supra, or until this court has found, after a hearing, that all proposed selection procedures for police officer positions have been validated in accordance with the Uniform Guidelines on Employee Selection Procedures, 28 C.F.R. § 50.14, 29 C.F.R. § 1607, effective September 25, 1978 ("Uniform Guidelines"), and that no further interim goals are appropriate. Nothing herein shall preclude plaintiffs from advocating the continuance of the interim goals on the basis that the continuing effects of past discrimination have not been eliminated

To satisfy the goals set forth above in paragraphs 3 and 4, defendants may use an eligibility list derived from Examination No. 8155 as the pool from which it selects police officers. At such time as that eligibility list or any future eligibility list for the position of police officer does not contain sufficient minority candidates to meet the interim goals set out in paragraph 4, supra, the City shall take whatever steps are necessary to achieve the interim hiring goals

Defendants shall make reasonable efforts to develop within a reasonable period of time a procedure for the selection of candidates for the entry level position of police officer which shall be lawful and validated in accordance with the Uniform Guidelines or successor guidelines similarly promulgated, and which is consistent with generally accepted psychological standards as defined by the American Psychological Association from time to time, and which has the least adverse impact on minority applicants. Consistent with this requirement defendants (a) shall examine all reasonably available alternative selection procedures on the subject of testing of police officer applicants, and (b) shall consult with industrial psychologists, psychometricians and/or others who have experience in the field of selection testing, and preferably who have performed or have knowledge of analyses of the job of police officers

Defendants may continue to use the current qualifications and selection criteria for police officer positions. However, no such qualification or selection criterion shall be a valid basis for or defense for failure to meet the interim hiring goals set out in paragraph 4, supra, unless the court has ruled that such qualification or selection criterion has been validated in accordance with the Uniform Guidelines and has determined that there is no further basis for continuing the interim goals

Plaintiffs are entitled to their court costs and reasonable attorneys' fees to date. The amount of such costs and fees shall be set by the court after a hearing. Costs and attorneys' fees for work done in the future shall be fixed in such manner as the court may determine

Within thirty (30) days after the entry of this order, and every six (6) months thereafter, the defendant city shall submit to the plaintiffs the following reports:

(a) A list of its then current uniformed employees in the Police Department showing for each person: name, address, race or national origin, police station or other place of assignment, date of appointment, rank, and date such rank was achieved.

(b) The total number of uniformed personnel employed by the Police Department, by rank, race and national origin.

(c) A list of all minority applicants for all vacancies, including date of application, names, addresses, and telephone numbers, whether the applicant was accepted or rejected and reason(s) for rejection.

(d) The name, address, and telephone number of any minority employee involuntarily terminated prior to the completion of the probationary period, and reason(s) for termination.

(e) List of all hires, promotions and voluntary and involuntary terminations showing race and national origin.

The court retains jurisdiction of this action for such further relief or other orders as may be necessary or appropriate to enforce and insure rights to equal employment opportunity within the New York City Police Department

An example would be an interim minority hiring ratio of 30%, where the minority percentage of the applicant pool or the relevant work force is also 30%. In this case, the District Court's provision for interim hiring specifies a minority ratio of a remedial nature, i. e., greater than needed simply to avoid a disparate racial impact. We therefore consider this provision as part of affirmative relief

The formulation of remedies for discrimination in promotion of employees requires even greater caution than is appropriate for entry-level hiring cases, since such remedies inevitably impact adversely identifiable employees who have committed themselves to a particular career track. See Kirkland, supra, 520 F.2d at 429; Bridgeport Guardians, supra, 482 F.2d at 1341

This requirement of reaching a target should be contrasted with the far more modest use of a target figure merely to limit the extent of interim relief. The latter occurs when a remedy provides that the interim relief will continue until an acceptable selection procedure is developed or until a particular target figure for minority employment is achieved. In such a case, the target figure does not function as an absolute requirement; it simply serves as a means of assuring that the interim requirements will end at some point, even if development of a valid selection procedure is unduly delayed. The use of a target figure for this limiting purpose does not require the same compelling justification as a target figure prescribed as an absolute requirement, since it imposes no additional obligation on the defendant

Though the District Court's opinion refers to affirmative action as an "interim" measure, it prescribes that it will last either until "discrimination has been totally eliminated" or until valid selection procedures are being used. (484 F. Supp. at 799). The meaning of the first phrase is not entirely clear; however, even if it means a long-term objective of minority police employment equal to the minority percentage in the work force, the opinion still requires no more than an interim remedy because of the provision in the second phrase that the affirmative action can be terminated once a valid selection procedure has been approved

The order unfortunately refers to the 50% quota as an "interim goal." It is not a goal at all. It is a procedure that the District Court has required to be used at least until the true interim goal approval of a valid selection procedure has been reached, and perhaps until the long-term goal minority employment percentage equal to minority work force percentage has been reached

In 1973 the defendants gave still another exam whose validity has not been adjudicated. See 431 F. Supp. at 545 n.36. It is interesting to note, however, that even plaintiffs' well-known testing expert, Dr. Richard Barrett, acknowledged on cross-examination during the first layoff policy case that he could not be certain whether or not this exam is content valid. Id. n.37

The only evidence referred to by the District Court, in addition to the invalidity of the '68-'70 exams, is the testimony of some police officers who warned the test makers that the test would disfavor minorities. (484 F. Supp. at 798). Such a caution would be significant if the test makers had made no effort to satisfy Guideline standards. But when test makers have undertaken the elaborate process of job analysis and test construction revealed by this record, their willingness to disregard a prediction of disparate racial impact does not indicate that they lacked a good faith belief that their exam could nonetheless be shown to be adequately job-related

It is true that the City's previous unsuccessful attempt to design an exam on its own renders somewhat questionable its apparent enthusiasm for another "in-house" effort. However, the decision to produce Exam No. 8155 "in-house" may well have been motivated by a bureaucratic preference for internal procedures, a need to save money, a naive self-confidence, or simply a desire to try again. None of these motives is a basis for inferring a conscious intention or even a reckless willingness to violate the law

Of course, an unrepresentative exam-as is the case with exams at other stages of the qualification process for New York City police work-can be administered on a pass/fail basis as part of a larger qualification process provided the cut-off score represents that level of skills shown to be the level below which an applicant is disqualified for police work. There is, in other words, no objection on grounds of discriminatory impact to a test which eliminates those whose skills and abilities in the areas actually tested for are so low as to disqualify the applicant no matter how well he or she might perform in other untested areas