Assessing speaking in Japanese junior high schools:Issues for the senior high school entrance examinations

Tomoyasu Akiyama
(Dept. of Linguistics & Applied Linguistics, The University of Melbourne)

This paper has three purposes. First, it discusses three assessment contexts in relation to the
notion of "usefulness" by Bachman and Palmer (1996). Those contexts
are (1) the 2001 Tokyo senior high school entrance examination, (2) a proposal to include of speaking tests in that examination, and
(3) a proposal to assess speaking skills in Tokyo junior high schools.
This work also identifies some concerns by Japanese junior high school EFL teachers
and students through various statistical procedures.
Finally, it argues for the need to build up a "task bank," as
suggested by Brindley (2001), for the speaking components used in senior high school entrance examinations.

Evaluation of Usefulness of 3 Assessment Contexts

Let us begin by consider three assessment contexts.

"any high school entrance examination that does not include the assessment of speaking skills could be said to lack construct validity
in light of the Ministry of Education, Culture, Sports, Science and Technology's 1998 revised guidelines"

1) The 2001 Tokyo Metropolitan Senior High School Entrance Examination

The 2001 Tokyo Metropolitan Senior High School English Entrance Examination [Toukyou-tou Koutou Gakkou Eigo Nyuugaku Shiken] focused on reading skills and grammar knowledge and nearly 80% of the test had a multiple-choice format. Figure 1 indicates the way that the four skills are covered in this test. A point of concern is that any high school entrance examination that does not include the assessment of speaking skills could be said to lack construct validity in
light of the Ministry of Education, Culture, Sports, Science and Technology's 1998 revised guidelines. The current entrance examination also appears to lack authenticity,
since recent high school English curriculum guidelines by the Japanese ministry seek to develop speaking and writing skills as well as reading and grammar.

[ p. 2 ]

For the same reason, the current English test (which does not assess speaking skills) could be said to lack authenticity. "Indirect" speaking tests are low on interactiveness
because examinees are only required to select the English sentence which fits a given scenario most appropriately.
This paper reports how the inclusion of the speaking tests in the entrance examination may have some positive influence in junior high
schools according to a survey of junior high school teachers. In terms of practicality, the current English examination test rates well.
The English section of the 2001 Tokyo Metropolitan Senior High School Entrance Examination also rates well in well in terms of reliability
and practicality. Its main problems involve construct validity, impact, and authenticity as well as lack of interactiveness.

2) What impact would the introduction of speaking tests in entrance examinations have on teaching?

If speaking tests became a component of high school entrance examinations in Tokyo what would happen?
Such a move might result in less reliability. The reason is that speaking tests have inherently many variables, such as rater behavior
and interlocutors' variations (McNamara, 1996). The inclusion of speaking tests would represent a positive increase in authenticity,
however, because the test would better reflect the curriculum content. Moreover, including speaking tests could engage students to complete tasks interactively, and such tests would be more interactive than the current examination.
Introducing speaking tests in the entrance examination would also have great impact on teachers and students, as several studies
(e.g. Shohamy, Donitsa-Schmidt, and Ferman, 1996; Cheng, 1997) suggest. As speaking tests require many resources such as administrators and raters, the inclusion of speaking tests might present problems in terms of practicality.

3) Assessment of speaking skills in junior high schools

How should speaking skills be assessed in Japanese junior high schools? Studies by Brindley (1999) point out how the reliability of school-based assessments tend to be low. The construct validity could potentially be high, as Hamp-Lyons (1996) claims. Hamp-Lyons (1996) argues that portfolio assessment is much more valid than traditional one-shot tests. The reason that authenticity and interactiveness could be high is because school-based assessment provides ample opportunity to conduct speaking tests. However, these judgments need to be made with caution because they also involve issues about preferred teaching styles. Since entrance exams significantly determine how and what many junior high school students study, the impact of in-school speaking assessments would probably be lower than having speaking tests in the current junior high school entrance examinations.
Practicality may also be a problem, because the revised curriculum has decreased English instruction time from 4 to 3 hours per week.

[ p. 3 ]

While discussing these three assessment contexts in detail, many issues need to be considered to maximize the usefulness of any proposed speaking tests.

Research questions

Based on discussions for the three assessment contexts above, five questions are addressed in this paper.
The first two involve a standard survey analysis and the remaining three questions involve Rasch analyses.

How do public junior high school teachers in Tokyo assess their students' speaking skills?

What impact would the introduction of speaking tests in entrance examinations have on teaching?

To what extent do tasks (speech, role-play, description and interview) differ in terms of perceived difficulty?

To what extent do the previous items fit Rasch measurement?

To what extent do students' performances as measured by four tasks fit Rasch measurement?

Methodology

Instrument 1

Please refer to Appendix 1 for an abridged copy of the questionnaire survey.
This survey was designed to address research questions 1 and
2. Approximately 600 questionnaires were distributed to the public
junior high school English teachers in Tokyo. The questionnaire was
completed by 199 junior high school teachers (a response rate of 33%).

Instrument 2

Four of the five the most popular tasks according to the survey in Appendix 1 were used for a test trial
(speech, role-play, description, and oral interview).
Information gap tasks were not used because of difficulty in administration. All tasks had a duration of 5
minutes, including explanations of the test procedures.

Test-takers and interlocutors

The test-takers were all Japanese junior high school students and they ranged in
age from 14 (second year students) to 15 (third year students) years.
219 students at twelve schools participated in the test trial. All
students at each school undertook two of the four tasks (in total 438 students' performances).

[ p. 4 ]

The 13 interlocutors (12 Japanese English teachers at participants'
school and the researcher) administered different tasks to the students.

Raters and scoring criteria

Five independent Japanese English senior high school teachers, with more
than 10 years' teaching experience, rated students' performances from
tape recordings. Scoring criteria consisted of 5 items (fluency,
vocabulary, grammar, intelligibility and overall task fulfillment). The
items were rated on a 0 to 5 points scale according to different levels
of performance described for each item.

Results

Questionnaire survey

Research Question 1 ascertained how English teachers
assessed students' speaking ability using direct speaking tests. Those
who said they conducted direct speaking tests amounted to 57.3% of the same (n = 114).
42.7% (n = 85) of teachers said they did not administer speaking
tests. However, further analysis shows that the combination of other
assessment methods, such as class observation and pencil-and-paper
tests were frequently used. Results revealed that the majority of
English teachers assessed students' speaking skills based on classroom
observation with a combination of pencil-and-paper tests and speaking tests.

Research Question 2 investigated what impact the introduction of
speaking tests would have on Japanese English teachers. Figure 2
suggests that more than 75% of the teachers reported that speaking
tests would impact the way they teach, while 20% stated that little
or no impact would occur in terms of their teaching. Responses to this question
showed that the introduction of speaking tests in entrance examinations
would likely have a positive impact on teachers and their teaching activities,
in that the majority of teachers would change their teaching styles
to focus more on improving students' communicative skills.

[ p. 5 ]

Rasch analysis of the student test scores

Difficulty of items and tasks

Research question 3 investigated the difficulty of tasks (items) on each
task. As indicated in the fifth column in Table 1, the description task
is the most difficult and the interview task the easiest. The difference
between the most difficult and the easiest tasks is approximately 1.5
logits.

[ p. 6 ]

Research question 4 examined the quality of items, and the extent to
which data patterns derived from the Rasch model differ from those of
the actual data. Unexpected items that the Rasch model identifies are
called either "misfit" or "overfit" items. The acceptable range of IMS
here is from 0.70 to 1.30. As can be seen, only Item 15 is identified as
a "misfit," indicating a larger than the acceptable range of IMS in the
sixth and seventh columns. This shows that the actual data patterns from
item 15 (Description: task fulfillment) varied unacceptably in comparison
with data patterns estimated by Rasch measurement. Thus the items on
four tasks appeared to produce relatively similar response patterns,
suggesting that the items across tasks assessed the similar construct.

Person fit indexes

The last question focuses on students' scores across the four tasks.
This is particularly important, since this question leads to issues of
accountability for students. As can be seen in Table 2, 5.4% of the
students were identified as misfit students. This indicates that the
percentage of misfit students exceeds the limit of the acceptable
percentages of misfit students. It is important to investigate why this happened.

[ p. 7 ]

Figure 3 shows which combination of tasks tended to produce misfitting
students. Two combinations that seemed to produce misfitting students were:
(1) speeches and interviews (S/I) and descriptions and interviews (D/I).
Other task combinations
produced fewer misfit students than the above two combinations. One
possible explanation for this is that differences of task difficulty in
combinations might have an impact on increasing misfit students.
If we look at Figure 4, we can see how speaking skills are assessed in
considerably different ways by high school teachers in Japan. Over 22%
of the nearly two hundred teachers responding to this survey indicated that
they relied of a combination of speech analysis (SP), class observation
(OB) and pencil-and-paper tests (PE) to assess speaking skills. However,
it is worth noting that over 17% of the teachers relied solely on classroom
observations to assess speaking skills.

Discussion

"the inclusion of the speaking tests has the potential to assist in bridging the gap between skills taught in classes and skills tested in entrance examinations, and between goals of the guidelines and assessment policy. "

Results of the questionnaire survey revealed that teachers' assessment
methods varied, suggesting that it would be difficult to compare
students' speaking ability across schools. The introduction of speaking
tests would have a positive impact on approximately 80% of public
English junior high school teachers in Tokyo, and most teachers
maintained that they would change to a more communicative style of
teaching. Thus, it can be argued that the inclusion of the speaking
tests has the potential to assist in bridging the gap between
skills taught in classes and skills tested in entrance examinations, and
between goals of the guidelines and assessment policy.

[ p. 8 ]

Results from test trials undertaken by junior high school students
showed that all items except one fit Rasch measurement, indicating
that items on each task were effective in assessing the target
construct. However, results also showed that the four tasks frequently
used by English teachers were different in terms of difficulty. This
means that students who undertake a variety of difficulties of tasks
might not be assessed appropriately. Given that variables, including
rater behavior and interlocutors, are inherent in performance tests,
difficulty of tasks needs to be relatively equal in order to reduce
variables. The concept of task banks, presented by Brindley (2001), and
item banks by Ikeda (2000) could have important implications for the
introduction of formal speaking tests in entrance examinations:

Conclusion

Implications for this study are that speaking tasks used in a classroom
need to be trialed, and also investigated with Rasch measurement, given
that school-based assessment represents half of the selection
procedures for students who wish to enter senior high schools. In junior high school
contexts, a role play task bank, such as shopping situation, inviting
friends to a party, or giving directions to a stranger could be
developed. In order to not only administer speaking tests in a high
stakes context, but also to enable teacher implemented assessment to be
comparable across schools, it would be necessary to investigate tasks
with Rasch techniques, based on empirical data, and to build up a task
bank with a relatively consistent quality of tasks.

References

Akiyama, T. (2001). The application of G-theory and IRT in the analysis of data from speaking tests administered in a classroom context.
Melbourne Papers in Language Testing. 10 (1), 1 - 22.