Editorial
Advances in translating and adapting educational and psychological tests Ronald K. Hambleton University of Massachusetts, Amherst, USA and John H.A.L. de Jong Language Testing Services, The Netherlands
I Introduction Today, there is substantial evidence indicating the great need for multi-language educational and psychological tests (see, for example, Hambleton, 2002; van de Vijver and Hambleton, in press). For example, more than 40 countries participated recently in the Third and Fourth International Mathematics and Science Studies (TIMSS and TIMSS-R) and achievement tests and questionnaires were prepared in over 30 languages. In the OECD/PISA project to assess the yield of national education systems 15-year-old students were tested in reading, mathematics, and science literacy in 32 languages and cultures in the rst cycle (2000) and in the second cycle (2003) well over 40 countries are participating. A number of popular personality and intelligence tests such as Spielberger’s Trait-State Anxiety Inventory and the Wechsler Intelligence Scale for Children are now available in more than 50 languages each, and many other achievement tests, credentialing examinations, personality measures, and quality of life measures have been adapted into 10 languages or more. The list of translated and adapted tests is long and growing rapidly. Substantially more test translations and adaptations can be expected in the future because: · ·
international exchanges of tests have become more common; credentialing exams produced by companies such as Novell and Microsoft are being made available in many countries; and
Address for correspondence: Ronald K. Hambleton, Department of Educational Policy, Research, and Administration, 152 Hills South, University of Massachusetts, Amherst, MA 01003–4140, USA; email:
[email protected] Language Testing 2003 20 (2) 127–134
10.1191/0265532203lt247xx Ó 2003 Arnold
128 ·
Advances in translating tests
interest in cross-cultural psychology and international comparative studies of achievement has grown.
The purposes of this special issue are: · ·
to focus attention on some of the emerging methodological developments for conducting studies to increase the quality of test translations and test adaptations; and to highlight several practical studies of test translation and adaptation.
It should be recognized at the outset that translated and adapted tests and instruments are nearly always going to be different than the source language versions. The goal then should be to minimize the differences to acceptable semantic, psychometric, linguistic, and psychological levels. The articles in this special issue are all about nding those acceptable levels of differences to insure valid uses of the tests and instruments. Contributors to the special issue are from seven countries: Australia, Belgium, Canada, Israel, Korea, The Netherlands, and the USA. In the remainder of this introduction, two topics are addressed. First, guidelines developed by the International Test Commission (ITC) for translating and adapting tests are introduced. In the ITC’s work, test translation is viewed as part of the test adaptation process, but it is only one of several steps that must be carefully carried out to produce a test or instrument that is equally valid in two or more languages and cultures (for more on steps, see Hambleton and Patsula, 1999). Secondly, remarks about the articles in this special issue are provided in order to set a context for readers. II International Test Commission guidelines for translating and adapting tests In 1992 the International Test Commission (ITC) began a project to prepare guidelines for translating and adapting tests and psychological instruments, and establishing score equivalence across language and/or cultural groups. Several organizations assisted the ITC in preparing the guidelines: · · · · · ·
the European Association of Psychological Assessment; the European Test Publishers Group; the International Association for Cross-Cultural Psychology; the International Association of Applied Psychology; the International Association for the Evaluation of Educational Achievement; the International Language Testing Association; and
Ronald K. Hambleton and John H.A.L. de Jong ·
129
the International Union of Psychological Science.
A committee of 12 representatives from these organizations worked for several years to prepare 22 guidelines, and later these guidelines were eld-tested (see, for example, Hambleton et al., 1999; Tanzer and Sim, 1999; Hambleton, 2001; Hambleton et al., in press). Later, these guidelines were approved by the ITC for distribution to national psychological societies, test publishers, and researchers. The guidelines, organized into four categories, appear below: 1 Context C.1 C.2
Effects of cultural differences which are not relevant or important to the main purposes of the study should be minimized to the extent possible. The amount of overlap in the construct measured by the test or instrument in the populations of interest should be assessed.
2 Test development and adaptation D.1
D.2
D.3 D.4 D.5
D.6
D.7
D.8
Test-developers/publishers should insure that the adaptation process takes full account of linguistic and cultural differences among the populations for whom adapted versions of the test or instrument are intended. Test-developers/publishers should provide evidence that the language use in the directions, rubrics, and items themselves as well as in the handbook are appropriate for all cultural and language populations for whom the test or instrument is intended. Test-developers/publishers should provide evidence that the choice of testing techniques, item formats, test conventions, and procedures are familiar to all intended populations. Test-developers/publishers should provide evidence that item content and stimulus materials are familiar to all intended populations. Test-developers/publishers should implement systematic judgmental evidence – both linguistic and psychological – to improve the accuracy of the adaptation process and compile evidence on the equivalence of all language versions. Test-developers/publishers should ensure that the data collection design permits the use of appropriate statistical techniques to establish item equivalence between the different language versions of the test or instrument. Test-developers/publishers should apply appropriate statistical techniques to (1) establish the equivalence of the different versions of the test or instrument, and (2) identify problematic components or aspects of the test or instrument which may be inadequate to one or more of the intended populations. Test-developers/publishers should provide information on the evaluation of validity in all target populations for whom the adapted versions are intended.
130
Advances in translating tests D.9 D.10
Test-developers/publishers should provide statistical evidence of the equivalence of questions for all intended populations. Non-equivalent questions between versions intended for different populations should not be used in preparing a common scale or in comparing these populations. However, they may be useful in enhancing content validity of scores reported for each population separately.
3 Administration A.1
A.2
A.3 A.4 A.5 A.6
Test-developers and administrators should try to anticipate the types of problems that can be expected, and to take appropriate actions to remedy these problems through the preparation of appropriate materials and instructions. Test administrators should be sensitive to a number of factors related to the stimulus materials, administration procedures, and response modes that can moderate the validity of the inferences drawn from the scores. Those aspects of the environment that in uence the administration of a test or instrument should be made as similar as possible across populations of interest. Test administration instructions should be in the source and target languages to minimize the in uence of unwanted sources of variation across populations. The test manual should specify all aspects of the administration that require scrutiny in a new cultural context. The administrator should be unobtrusive and the administrator–examinee interaction should be minimized. Explicit rules that are described in the manual for administration should be followed.
4 Documentation/ score interpretations I.1 I.2
I.3 I.4
When a test or instrument is adapted for use in another population, documentation of the changes should be provided, along with evidence of the equivalence. Score differences among samples of populations administered the test or instrument should not be taken at face value. The researcher has the responsibility to substantiate the differences with other empirical evidence. Comparisons across populations can only be made at the level of invariance that has been established for the scale on which scores are reported. The test-developer should provide speci c information on the ways in which the socio-cultural and ecological contexts of the populations might affect performance, and should suggest procedures to account for these effects in the interpretation of results.
The guidelines and suggestions for implementing them can be found in van de Vijver and Hambleton (1996), Muniz and Hambleton
Ronald K. Hambleton and John H.A.L. de Jong
131
(1997), van de Vijver and Tanzer (1997), and Hambleton et al. (in press). These guidelines have become a frame-of-reference for many psychologists working in the test translation and adaptation area, and more general adoption of the guidelines can be expected in the coming years as the guidelines are more widely disseminated and the standards for translating and adapting tests are increased. Particularly noteworthy about the guidelines is their emphasis on the importance of compiling both judgmental as well as empirical evidence to support the validity of a test or instrument translation and adaptation. No amount of care in the selection of translators or the design for using translators can compensate for the value of rst hand empirical evidence from persons for whom the translated and adapted test or instrument is intended. The seven articles that follow highlight the role of both judgmental and empirical procedures in compiling validity evidence to support the use of a test or instrument in a second language and culture. From a practical point of view, two major contexts can be distinguished for applying the ITC guidelines: · ·
the translation/adaptation of existing tests and instruments; and the development of new tests and instruments for international use.
The rst context refers to the situation where tests and instruments that have originally been developed in a particular language for use in some national context are to be made appropriate for use in one or more other languages and/or national contexts. Often in such cases the aim of the translation/adaptation process is to produce a test or instrument with comparable psychometric qualities as the original. Even for nonverbal tests, adaptations are necessary not only of the accompanying verbal materials for administration and score interpretation but also of graphic materials in the test proper to avoid cultural bias (see van de Vijver and Hambleton, in press). Growing recognition of multiculturalism has raised awareness of the need to provide for multiple language versions of tests and instruments intended for use within a single national context. The second context refers to the development of tests and instruments that from their conception are intended for international comparisons. The advantage here is that versions for use in different languages and or different national contexts can be developed in parallel, i.e., there is no need to maintain a pre-existing set of psychometric qualities. The problem here often lies in the sheer size of the operation: the large number of versions that need to be developed and the many people involved in the development process.
132
Advances in translating tests
III Overview Two articles that advance test adaptation methodology appear rst in the special issue. Bruno Zumbo from the University of British Columbia in Canada demonstrates the inadequacy of judging the suitability of a test adaptation by looking only at the factor structure of a test in two or more language versions. It has been common to assume that if the factor structure of a test remains the same in a second language version, then the test adaptation was successful. Zumbo provides convincing evidence that item level bias can still be present when structural equation modeling of the test in two languages reveals an equivalent factorial structure. Since it is the scores from a test or instrument that are ultimately used to achieve the intended purpose, the scores may be contaminated by item level bias and, ultimately, valid inferences from the test scores become problematic. Stephen Sireci from the University of Massachusetts in the USA and Avi Allalouf from the National Institute for Testing and Evaluation in Israel continue the theme introduced by Zumbo. They describe several straightforward methods to detect item level aws that can arise in the test adaptation process. They illustrate the methods by describing their recent research to detect aws in a college admissions language test adapted from Hebrew into Russian. Especially important in Sireci and Allalouf’s work are the steps they take to explain the aws detected using item bias review procedures. If the causes can be identi ed, it may be possible to reduce the problems in the future either through informing item writers about what content, item format, and test conditions to avoid, or by looking, speci cally, for similar problems in other tests at the test adaptation stage. Jan Lokan and Marianne Fleming from the Australian Council for Educational Research provide an interesting example of the issues and approaches for adapting materials from one country to another in the same language. They describe their efforts to adapt a computerassisted career guidance system developed by the Educational Testing Service in the USA for use in Australia. It is a fascinating study of the issues that can arise and how they were resolved. Even when language translations are not involved, it is seen that the issues of adapting a test from one culture to another can be complex. And, in this instance, the cultures on the surface appear to be relatively similar. Sunhee Chae from the Korea Institute of Curriculum and Evaluation extends the examples of test adaptation work to assessments involving pictures, and with tests administered to pre-school children.
Ronald K. Hambleton and John H.A.L. de Jong
133
Her article extends the information about translation/adaptation to populations where very little research has been carried out. Charles Stans eld from Second Language Testing, Inc. in the USA describes some of his own rst hand experiences to translate and adapt tests. His article deals with a wide range of problems that arise in practice, such as issues concerning legal requirements, deciding whether or not to translate a test, political implications, and even typesetting. Stans eld illustrates these issues with examples from his survey work on the use of translated and adapted tests in the USA. One additional feature of this article is the focus on test publishers and their perspective on the process. Joy McQueen and Juliette Mendelovits from the Australian Council for Educational Research describe the steps implemented in the OECD/PISA/2000 Project to adapt the reading literacy assessment into 32 languages and cultures. McQueen and Mendelovits present the perspective of test-developers and provide an overview of all procedures implemented in the PISA/OECD project to insure the test materials would be appropriate for international student comparisons. Their account starts with the beginning steps of collecting materials, continues with many intermediate steps of international exchange on test development and cultural reviews, and concludes with the nal evaluation of eld trial data. They evaluate the procedures as they were followed against the ITC guidelines for test adaptation. The article presents a number of interesting examples of problems that emerged in the practice of their translation/adaptation work. Aletta Grisay, recently retired from the University of Lie`ge, Belgium considers in more detail the procedures and standards implemented in the OECD/PISA/2000 Project for translation/ adaptation and their reports on their effectiveness. Grisay builds on her experience in previous surveys of educational achievement such as TIMSS and other studies. In the PISA test adaptation procedures, the rst step is to produce near equivalent French and English versions and then to produce additional language and cultural adaptations from both the French and English versions. Further steps include adaptations being prepared by separate translators, and then the products of these translators being judged by other translators ultimately to produce a single adaptation that represents the very best translation possible from multiple translators. Because at the national level several participating countries deviated from the prescribed ideal procedures, Grisay was able to analyse and compare the effectiveness of different procedures. Grisay’s work highlights the level of commitment to the test adaptation process that is essential when national governments intend making policy decisions based on the ndings from international studies.
134
Advances in translating tests
We hope this collection of seven articles furthers the advancement of sound test translation and adaptation practices around the world. While language testing is not the focus of each article, the methodological advances and the practical work that is described will surely be relevant and interesting to researchers working in the language testing eld. Moreover, as Stans eld suggests in his contribution to this special issue of Language Testing, it may be highly appropriate that the speci c expertise of language testers be involved to some degree in any test translation/adaptation activity. IV References Hambleton, R.K. 2001: The next generation of the ITC test translation and adaptation guidelines. European Journal of Psychological Assessment 17, 164–72. —— 2002: Adapting achievement tests into multiple languages for international assessments. In Porter, A. and Gamoran, A., editors, Methodological advances in large-scale cross-national education surveys. Washington, DC: National Academy of Sciences, 58–79. Hambleton, R.K., Merenda, P. and Spielberger, C., editors, in press: Adapting educational and psychological tests for cross-cultural assessment. Hillsdale, NJ: Lawrence Erlbaum. Hambleton, R.K. and Patsula, L. 1999: Increasing the validity of adapted tests: myths to be avoided and guidelines for improving test adaptation practices. Journal of Applied Testing Technology 1, 1–16. Hambleton, R.K., Yu, J. and Slater, S.C. 1999: Field-test of the ITC guidelines for adapting psychological tests. European Journal of Psychological Assessment 15, 270–76. Muniz, J. and Hambleton, R.K. 1997: Directions for the translation and adaptation of tests. Papeles del Psicologo, August, 63–70. Tanzer, N.K. and Sim, C.O.E. 1999: Adapting instruments for use in multiple languages and cultures: a review of the ITC guidelines for test adaptations. European Journal of Psychological Assessment 15, 258–69. van de Vijver, F.J.R. and Hambleton, R.K. 1996: Translating tests: some practical guidelines. European Psychologist 1, 89–99. —— in press: Adapting educational tests for multicultural assessment. Educational Measurement: Issues and Practice. van de Vijver, F.J.R. and Tanzer, N.K. 1997: Bias and equivalence in cross-cultural assessment: an overview. European Review of Applied Psychology 47, 263–79.