27. David Bell, Permanent Secretary at the DCSF,
has set out the Department's view of the key purposes of national
tests:

We want them to provide objective, reliable information
about every child and young person's progress. We want them to
enable parents to make reliable and informative judgments about
the quality of schools and colleges. We want to use them at the
national level, both to assist and identify where to put our support,
and also, we use them to identify the state of the system and
how things are moving. As part of that, both with national tests
and public examinations, we are very alive to the need to have
in place robust processes and procedures to ensure standards over
time.[28]

28. The written evidence of the DfES similarly set
out a variety of purposes of testing, stating that National Curriculum
testing was developed to complement existing public examinations
for the 16+ age group and that it is geared towards "securing
valid and reliable data about pupil performance, which is used
for accountability, planning, resource allocation, policy development
and school improvement".[29]

29. The DfES elaborated on the uses to which data
derived from examination results are put. National performance
data are used to develop government policy and allocate resources.
Local performance data are used for target-setting and to identify
"areas of particular under-performance". School performance
data form the basis for the findings of inspectors and interventions
from School Improvement Partners. Parents make use of school data
to make choices about their children's education. The DfES considered
that school performance data is an important mechanism for improving
school performance and for assisting schools to devise their own
improvement strategies. Finally, the DfES stated that examination
results for each individual child are "clear and widely-understood
measures of progress", which support a personalised approach
to teaching and learning and the realisation of each child's potential.[30]

1. to generate a particular kind of result,
such as ranking pupils in terms of end-of-course level of attainment;

2. to enable a particular kind of decision,
such as deciding whether a pupil has learned enough of a particular
subject to allow them to move on to the next level;

3. to bring about a particular kind of educational
or social impact, for example, to compel pupils to learn
a subject thoroughly and to compel teachers to align their teaching
with the National Curriculum; or the study of GCSE science to
support progression to a higher level of study for some pupils
and to equip all pupils with sufficient scientific literacy to
function adequately as 21st century citizens.[32]

31. Clearly, interpretations of the purposes of assessment
may be very wide or very narrow, but the important point is that
there are a large number of possible purposes. The QCA asks us
to consider the uses to which assessment results are put (interpretation
2 above) and distinguishes the four uses set out in the classification
scheme established by the Task Group on Assessment and Testing
in its 1988 report[33],
which are described by the QCA in the following manner:

32. This classification scheme has been used widely
in evidence submitted to this inquiry and we, likewise, rely on
it extensively in our Report. It should be noted that these categories
are not necessarily discreet and the QCA notes many examples of
uses to which the results of the national testing system are put
which may fall under more than one of the headings of the broad,
four-limb classification. The QCA's non-exhaustive list of examples,
reproduced at Figure 1, sets out 22 possible uses of assessment
results.

Figure 1 Some examples of the uses to
which assessment results can be put[35]

Source: QCA.

33. Each one of these possible uses of assessment
results can, in itself, be seen as a purpose of assessment, depending
on the context. Where an assessment instrument is designed and
used only for one purpose, the answer to the question "is
it fit for purpose" is the result of a relatively straightforward
process of evaluation. However, the government's evidence, set
out in paragraphs 27-29 above, highlights the fact that national
tests are used for a wide variety of purposes at a number of different
levels: national, local, school and individual.

34. Each instrument of assessment is (or should be)
designed for a specific purpose or related purposes. It will only
be fit (or as fit as a test instrument can be) for those purposes
for which it is designed. The instrument will not necessarily
be fit for any other purposes for which it may be used and, if
it is relied upon for these other purposes, then this should be
done in the knowledge that the inferences and conclusions drawn
may be less justified than inferences and conclusions drawn from
an assessment instrument specifically designed for those purposes.[36]

35. The DfES recognised that an assessment system
inevitably makes trade-offs between purposes, validity, reliability
and manageability. However, the evidence from the DfES and the
DCSF has been consistent: that the data derived from the current
testing system "equips us with the best data possible to
support our education system".[37]
David Bell, Permanent Secretary at the DCSF, told us that:

I think that our tests give a good measure of
attainment and the progress that children or young people have
made to get to a particular point. It does not seem to be incompatible
with that to then aggregate up the performance levels to give
a picture of how well the school is doing. Parents can use that
information, and it does not seem to be too difficult to say that,
on the basis of those school-level results, we get a picture of
what is happening across the country as a whole. While I hear
the argument that is often put about multiple purposes of testing
and assessment, I do not think that it is problematic to expect
tests and assessments to do different things.[38]

36. Dr Ken Boston of the QCA told us that the current
Key Stage tests were fit for the purpose for which they were designed,
that is, "for cohort testing in reading, writing, maths and
science for our children at two points in their careers and for
reporting on the levels of achievement".[39]
The primary purpose of Key Stage tests was "to decide the
level that a child has reached at the end of a Key Stage".[40]
He explained that Key Stage tests are developed over two and
a quarter years, that they are pre-tested and run through teacher
panels twice and that the marking scheme is developed over a period
of time. He considers that Key Stage tests are as good as they
can be and entirely fit for their design purpose.[41]
Dr Boston noted, however, that issues were raised when, having
achieved a test which is fit for one purpose, it is then used
for other purposes. Figure 1 above lists 22 purposes currently
served by assessments and, of those, 14 are being served by Key
Stage tests.

My judgment is that, given that there are so
many legitimate purposes of testing, and [Figure 1 above] lists
22, it would be absurd to have 22 different sorts of tests in
our schools. However, one serving 14 purposes is stretching it
too far. Three or four serving three or four purposes each might
get the tests closer to what they were designed to do. [ ]
when you put all of these functions on one test, there is the
risk that you do not perform any of those functions as perfectly
as you might. What we need to do is not to batten on a whole lot
of functions to a test, but restrict it to three or four prime
functions that we believe are capable of delivering well.[42]

37. Similarly, Hargreaves et al argue that one test
instrument cannot serve all the Government's stated purposes of
testing because they conflict to a certain extent, so that some
must be prioritised over others. According to them, the purpose
of assessment for learning has suffered at the expense of the
other stated purposes whereas, in their view, it should have priority.[43]
The conflicts between the different purposes are not, perhaps,
inherent, but arise because of the manner in which people change
their behaviour when high-stakes are attached to the outcomes
of the tests. Many others have raised similar points, claiming
that two purposes in particular, school accountability on the
one hand and promoting learning and pupil progress on the other,
are often incompatible within the present testing system.[44]
The practical effects of this phenomenon will be discussed further
in Chapter 4. However, we have been struck by the depth of feeling
on this subject, particularly from teachers.

38. The GTC (General Teaching Council for England)
argues that reliance on a single assessment instrument for too
many purposes compromises the reliability and validity of the
information obtained. It claims that the testing system creates
tensions that "have had a negative impact upon the nature
and quality of the education" received by some pupils. It
concludes that "These tensions may impede the full realisation
of new approaches to education, including more personalised learning".[45]

39. The NUT (National Union of Teachers) stated that
successive governments have ignored the teaching profession's
concerns about the impact of National Curriculum testing on teaching
and learning and it believes that this is "an indictment
of Government attitudes to teachers' professional judgment".[46]
The NUT argues further that:

It is the steadfast refusal of the Government
to engage with the evidence internationally about the impact of
the use of summative test results for institutional evaluation
which is so infuriating to the teaching profession.[47]

40. An NUT study, published in 2003, found that the
use of test results for the purpose of school accountability had
damaging effects on teachers and pupils alike. Teachers felt that
the effect was to narrow the curriculum and distort the education
experience of pupils. They thought that the "excessive time,
workload and stress for children was not justified by the accuracy
of the test results on individuals"[48].

41. Others have argued that the use of national testing
for the twin aims of pupil learning and school accountability
has had damaging effects on children's education experience. Hampshire
County Council accepts that tests are valuable in ascertaining
pupil achievement but is concerned that their increasingly extensive
use for the purposes of accountability "has now become a
distraction for teachers, headteachers and governing bodies in
their core purpose of educating pupils".[49]
The Council continues:

Schools readily acknowledge the need to monitor
pupil progress, provide regular information to parents and use
assessment information evaluatively for school improvement. The
key issue now is how to balance the need for accountability with
the urgent need to develop a fairer and more humane assessment
system that genuinely supports good learning and teaching.[50]

42. It is not a necessary corollary of national testing
that schools should narrow the curriculum or allow the tests to
dominate the learning experience of children, yet despite evidence
that this does not happen in all schools there was very wide concern
that it is common. We return to these concerns in Chapter 4.

43. The NUT highlighted evidence which suggests that
teachers feel strongly that test results do not accurately reflect
the achievements of either pupils or a school.[51]
The NAHT considers that Key Stage tests provide one source of
helpful performance data for both students and teachers, but that
it is hazardous to draw too many conclusions from those data alone.
They argue that "A teacher's professional knowledge of the
pupil is vital statistics are no substitute for professional
judgment".[52]
On the subject of school performance, the NAHT states that Key
Stage test results represent only one measure of performance amongst
a wide range, from financial benchmarking through to full Ofsted
inspections. It considers that self-evaluation, taken with other
professional educational data, "is far more reliable than
the one-dimensional picture which is offered by the SATs".[53]
The Association of Colleges stated that performance tables constructed
from examination results data do not adequately reflect the actual
work of a school and that the emphasis on performance tables risks
shifting the focus of schools from individual need towards performance
table results.[54]

44. The evidence we have received strongly favours
the view that national tests do not serve all of the purposes
for which they are, in fact used. The fact that the results of
these tests are used for so many purposes, with high-stakes attached
to the outcomes, creates tensions in the system leading to undesirable
consequences, including distortion of the education experience
of many children. In addition, the data derived from the testing
system do not necessarily provide an accurate or complete picture
of the performance of schools and teachers, yet they are relied
upon by the Government, the QCA and Ofsted to make important decisions
affecting the education system in general and individual schools,
teachers and pupils in particular. In short, we consider that
the current national testing system is being applied to serve
too many purposes.

[ ]there is considerable obligation on the
designer of tests or assessments to make them as efficient and
meaningful as possible. Assessment opportunities should be seen
as rare events during which the assessment tool must be finely
tuned, accurate and incisive. To conduct a test that is inaccurate,
excessive, unreliable or inappropriate is unpardonable.[56]

46. Although there is no consensus in the evidence
on the precise meanings of the terms 'validity' and 'reliability',
we have had to come to a working definition for our own purposes.
'Validity' is at the heart of this inquiry and we take
it to refer to an overall judgment of the adequacy and appropriateness
of inferences and actions based on test scores or other modes
of assessment. This judgment is based on the premise that the
tests in fact measure what it is claimed that they measure or,
as the NFER (National Foundation for Educational Research) puts
it, "the validation of a test consists of a systematic investigation
of the claims being made for it".[57]

47. Our definition of validity is a broad definition
precisely because it includes the concept of reliability: an assessment
system cannot be valid without being reliable. 'Reliability'
we define as the ability to produce the same outcome for learners
who reach the same level of performance.

The tests do have limited coverage of the total
curriculum: the English tests omit Speaking and Listening, the
science tests formally omit the attainment target dealing with
scientific enquiry (though questions utilising aspects of this
are included) and mathematics formally omits using and applying
mathematics. Outside of these the coverage of content is good.
The fact that the tests change each year means that the content
is varied and differing aspects occur each year.[59]

The NFER stated that the current tests adequately
serve the accountability purposes of testing. They may not meet
so successfully the standards of validity necessary for the purpose
of national monitoring, although the NFER believed that the tests
are as good as they can be for this purpose. The NFER said that,
in principle, if there was to be an assessment system with the
sole purpose of national monitoring of standards using comparable
measures, then a low-stakes, lightly-sampled survey was probably
the most valid form of assessment.

49. The validity of the current testing system has
elsewhere been repeatedly challenged in the evidence to this inquiry.
Whilst asserting that the Key Stage tests are fit for purpose,
the QCA has acknowledged that:

Like any tests, however well designed, they can
measure only a relatively narrow range of achievement in certain
subjects on a single occasion and they cannot adequately cover
some key aspects of learning.[60]

50. Many witnesses are less content than the NFER
with coverage of the National Curriculum and have challenged the
validity of national tests on grounds that they test only a narrow
part of the set curriculum and a narrow range of a pupil's wider
skills and achievements.[61]
It is also argued that existing tests measure recall rather than
knowledge[62] and neglect
skills which cannot easily be examined by means of an externally-marked,
written assessment.[63]
Furthermore, to enhance the ability of pupils to recall relevant
knowledge in an examination, thereby improving test scores, teachers
resort to coaching, or 'teaching to the test',[64]
and to teaching only that part of the curriculum which is likely
to be tested in an examination.[65]
The Government does not intend it, but it is undeniable that
the high stakes associated with achieving test benchmarks has
led schools and teachers to deploy inappropriate methods to maximise
the achievement of benchmarks. This is examined in Chapter 4.
For now, we note that these phenomena affect the validity of the
examination system as a whole, not just test instruments in particular,
because the education experience of a child is arguably directly
affected by the desire of some teachers and schools to enhance
their pupils' test results at the expense of a more rounded education.

[ ] I simply do not accept that there is
anything approaching that degree of error in the grading of qualifications,
such as GCSEs and A-levels. The OECD has examined the matter at
some length and has concluded that we have the most carefully
and appropriately regulated exam system in the world.[71]

[ ] I can say to you without a shadow of
a doubtI am absolutely convincedthat there is nothing
like a 30% error rate in GCSEs and A-levels.[72]

52. We suspect that the strength of this denial stemmed
from a misunderstanding of the argument made by Black et al. In
their argument, they make the assumptions that tests are competently
developed and that marking errors are minimal.[73]
The inherent unreliability of the tests stems from the limited
knowledge and skills tested by the assessment instrument and variations
in individuals' performance on the day of the test.[74]
This does not impugn the work of the regulator or the test development
agencies and very little can be done to enhance reliability whilst
maintaining a manageable and affordable system. The NFER gave
similar evidence that the current Key Stage tests:

[ ] have good to high levels of internal
consistency (a measure of reliability) and parallel form reliability
(the correlation between two tests). Some aspects are less reliable,
such as the marking of writing, where there are many appeals/reviews.
However, even here the levels of marker reliability are as high
as those achieved in any other written tests where extended writing
is judged by human (or computer) grades. The reliability of the
writing tests could be increased but only by reducing their validity.
This type of trade off is common in assessment systems with validity,
reliability and manageability all in tension.[75]

53. Black et al identify that reliability could theoretically
be enhanced in a number of ways:

Narrowing the range of question
types, topics and skills tested; but the result would be less
valid and misleading in the sense that users of that information
would have only a very limited estimate of the candidates' attainments.

Increasing the testing time to augment the sample
of topics and skills tested; however, reliability increases only
marginally with test length.[76]
For example, to reduce the proportion of pupils wrongly classified
in a Key Stage 2 test to within 10%, it is estimated that 30 hours
of testing would be required. (The NFER expressed the view that
the present tests provide as reliable a measurement of individuals
as is possible in a limited amount of testing time.[77])

Collating and using information that teachers
have about their pupils. Teachers have evidence of performance
on a range of tasks, in many different topics and skills and on
many different occasions.

54. Black et al conclude this part of their argument
by stating that, when results for a group of pupils are aggregated,
the result for the group will be closer to the 'true score' because
random errors for individualswhich may result in either
higher or lower scores than their individual 'true score'will
average out to a certain extent.[78]
The NFER went further, stating that aggregated results over large
groups such as reasonably large classes and schools give an "extremely
high" level of reliability at the school level.[79]
Nevertheless, Black et al argue that not enough is known about
the margins of error in the national testing system. Professor
Black wrote to the QCA to enquire whether there was any research
on reliability of the tests which it develops:

The reply was that "there is little research
into this aspect of the examining process", and [the QCA]
drew attention only to the use of borderline reviews and to the
reviews arising from the appeals system. We cannot see how these
procedures can be of defensible scope if the range of the probable
error is not known, and the evidence suggests that if it were
known the volume of reviews needed would be insupportable.[80]

55. Black et al go on to argue that it is profoundly
unsatisfactory that a measure of the error inherent in our testing
system is not available, since important decisions are made on
the basis of test results, decisions which will be ill-judged
if it is assumed that these measures are without error. In particular,
they argue that current policy is based on the idea that test
results are reliable and teachers' assessments are unreliable.
They consider that reliability could, in fact, be considerably
enhanced by combining the two effectively and that work leading
in this direction should be prioritised.[81]
Black et al conclude that:

[ ] the above is not an argument against
the use of formal tests. It is an argument that they should be
used with understanding of their limitations, an understanding
which would both inform their appropriate role in an overall policy
for assessment, and which would ensure that those using the results
may do so with well-informed judgement.[82]

56. Some witnesses have emphasised what they see
as a tension between validity and consistency in results. The
argument is that, over time, national tests have been narrowed
in scope and marking schemes specified in an extremely detailed
manner in order to maximise the consistency of the tests. In other
words, candidates displaying the same level of achievement in
the test are more likely to be awarded the same grade since there
is less room for the discretion of the examiner. However, it is
argued further that this comes at the expense of validity, in
the sense that the scope of the tests are narrowed so much that
they test very little of either the curriculum or the candidate's
wider skills.[83] Sue
Hackman, Chief Adviser on School Standards at the DCSF, recognised
this trade-off. However, she also told us that in relation to
Key Stage tests the Department, together with the QCA, has tried
to include a range of questions in test papers, some very narrow
and others rather wider. In this way, she considered that a compromise
has been reached between "atomistic and reliable questions,
and wide questions that allow pupils with flair and ability to
show what they can do more widely".[84]

57. Many witnesses have called for greater emphasis
on teacher assessment in order to enhance both the validity and
the reliability of the testing system.[85]
A move towards a better balance between regular, formative teacher
assessment and summative assessments the latter drawn from
a national bank of tests, to be externally moderatedwould
provide a more rounded view of children's achievements, and many
have criticised the reliance on a 'snapshot' examination at a
single point in time.[86]

58. We consider that the over-emphasis on the
importance of national tests, which address only a limited part
of the National Curriculum and a limited range of children's skills
and knowledge has resulted in teachers narrowing their focus.
Teachers who feel compelled to focus on that part of the curriculum
which is likely to be tested may feel less able to use the full
range of their creative abilities in the classroom and find it
more difficult to explore the curriculum in an interesting and
motivational way. We are concerned that the professional abilities
of teachers are, therefore, under-used and that some children
may suffer as a result of a limited educational diet focussed
on testing. We feel that teacher assessment should form a significant
part of a national assessment regime. As the Chartered Institute
of Educational Assessors states, "A system of external testing
alone is not ideal and government's recent policy initiatives
in progress checks and diplomas have made some move towards addressing
an imbalance between external testing and internal judgements
made by those closest to the students, i.e. the teachers, in line
with other European countries".[87]

[ ] is now a complex system, which has developed
many different purposes over the years and now meets each to a
greater or lesser extent. It is a tenet of current government
policy that accountability is a necessary part of publicly provided
systems. We accept that accountability must be available within
the education system and that the assessment system should provide
it. However, the levels of accountability and the information
to be provided are open to considerable variation of opinion.
It is often the view taken of these issues which determines the
nature of the assessment system advocated, rather than the technical
quality of the assessments themselves.[89]

60. Cambridge Assessment criticised agencies, departments
and Government for exaggerating the technical rigour of national
assessment. It continued:

[ ] any attempts to more accurately describe
its technical character run the risk of undermining both the departments
and ministers; '[ ] if you're saying this now, how is it
that you said that, two years ago [ ]'. This prevents rational
debate of problems and scientifically-founded development of arrangements.[90]

Cambridge Assessment stated further that international
best practice dictates that information on the measurement error
intrinsic to any testing system should be published alongside
test data and argues that this best practice should be adopted
by the Government.[91]
Professor Peter Tymms of Durham University similarly argued that:

[ ] it would certainly be worth trying providing
more information. I think that the Royal Statistical Society's
recommendation not to give out numbers unless we include the uncertainties
around them is a very proper thing to do, but it is probably a
bit late.[92]

61. We are concerned about the Government's stance
on the merits of the current testing system. We remain unconvinced
by the Government's assumption that one set of national tests
can serve a range of purposes at the national, local, institutional
and individual levels. We recommend that the Government sets out
clearly the purposes of national testing in order of priority
and, for each purpose, gives an accurate assessment of the fitness
of the relevant test instrument for that purpose, taking into
account the issues of validity and reliability.

62. We recommend further that estimates of statistical
measurement error be published alongside test data and statistics
derived from those data to allow users of that information to
interpret it in a more informed manner. We urge the Government
to consider further the evidence of Dr Ken Boston, that multiple
test instruments, each serving fewer purposes, would be a more
valid approach to national testing.