1 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
The following response is a collaborative effort of Vincent Gordon, Kindra Jones, William Molnar, and Maria Newton-Tabon
IEA’s Trends in International Mathematics and Science Study (TIMSS) gives a lot of information about students’ science and math achievement in an international framework. TIMSS tests students in grades four and eight and also gathers a wide ragne of data from their schools and teachers about curriculum and instruction in mathematics and science. TIMSS findings have been used by many countries around the world in their efforts to develop better methods in teaching science and mathematics.Involving more than 60 countries, “TIMSS 2007 is the most recent in the four-year cycle of studies to measure trends in students’ mathematics and science achievement”. TIMSS one was in 1995 in 41 countries, the second in 1999 involving 38 countries. TIMSS 2003 consisted of more than 50 countries. The majority of countries participating in TIMSS 2007 will have data going back to 1995. TIMSS Advanced assesses students who are leaving school for preparation in advanced physics and mathematics. Since the 1995 assessment, however, TIMSS has not assessed children who are nearing the end of high school. Recognizing the strong link between scientific competence and economic productivity,and given the relatively long time period since the 1995 assessments, countries around the world have expressed interest in participating in TIMSS Advanced. They want internationally comparative data about the achievement of their students enrolled in advanced courses designed to lead into science-oriented programs in university. By joining TIMSS Advanced, ”countries that participated in 1995 can determine whether the achievement
2 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
of students having taken advanced coursework has changed over time. Countries participating in TIMSS Advanced for the first time can assess the comparative standing in mathematics and physics in an international context”. Introduction TIMSS uses the program mostly defined, as “the major organizing concept in considering how educational opportunities are provided to students, and the factors that influence how students use these opportunities”. To begin the process of defining the topics to be assessed in TIMSS Advanced, this document built on the composition of the 1995 Advanced Mathematics and Physics assessments to draft a Framework for TIMSS Advanced 2008. The description of cognitive domains also benefited from the TIMSS developmental project, funded by a number of countries, to enable reporting TIMSS 2007 results according to cognitive domains. The first draft of this document was thoroughly reviewed by participating countries, and updated accordingly. Countries provided comments about the subjects incorporated in their subject matter in advanced mathematics and physics courses, and made recommendations about the desirability and suitability of assessing particular topics.TIMSS, including TIMSS Advanced, is a major undertaking of theIEA. The IEA has taken full responsibility for the management of the project. The “TIMSS International Study Center” correlates with the IEA Secretariat in Amsterdam on the translation ,the “IEA Data Processing Center” in Germany on construction of the documents for the database, “Statistics Canada” on sampling, and “Educational Testing Service in New Jersey on the psychometric scaling of the data”
.
The question that is needed to be answered is this appropriate to use TIMSS to evaluate the 56 differing state or state-like education entities in the United States? This
3 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
author chooses to focus on the pros of validity and reliability with regards to this particular assessment. Based on research, this assessment is reliable and valid due to the content domains which “define the specific mathematics subject matter covered by the assessment, and the cognitive domains which define the sets of behaviors expected of students as they engage with the mathematics content. The cognitive domains of mathematics and science are defined by the same three sets of expected behaviorsknowing, applying, and reasoning” . In other words, although there could exist other variables factored into why this particular test may not be valid or reliable, the common factors that span continents and are shared with these age groups (fourth and eighth graders) that are tested are “the mathematical topics or content that students are expected to learn and the cognitive skills that students are expected to have developed” (2007). The IEA developed TIMSS to compare the educational achievement around the globe. TIMMS began in the 1990s with a desire to study international studies of students within the same age/or grade bracket. It was believed that math and science education would be effective for the economic development in the technological world of the future. The break-up of the Soviet Union brought about new countries wanting to be participants in this study to help provide them with data to guide their educational systems TIMMS contained a measurement of science and math in conjunction with a questionnaire for student and teacher. The measurement included topics in science and math students should receive by grades 4 and 8. The questionnaires used were to collect information on the background of students, their attitudes, and their belief system
4 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
about school and the learning process. The school and teacher questionnaires looked at class scheduling of science and math coverage, the policies of the school, the educational backgrounds of the teachers and their preparation. The use of a summated rating scale was used to measure the idea as common practice in the social sciences. The summated rating scale showed validity and reliability on the sample on which it used. Summated rating scales were derived for each construct. The construct pertained to student self-interest in mathematics and the belief that motivation is a vital role in predicting the present and future achievement of a student. The opportunity for students in both the fourth and eighth grades across continents to do well is fair and consistently assessed. Scores by subject and grade are comparable over time (2007). The overall idea of the TIMSS when it was created had a mean of 500 based on the number of countries that took part in the testing. This testing of both grades four and eight began in 1995 and continued until 2007. “Successive TIMSS assessments since 1995, 1999, 2003, and 2007 have scaled the achievement data so that scores are equivalent from assessment to assessment” (2007). The establishing of the “Trends in International Mathematics and Science Study (TIMSS)” assessment is the cognitive domains. In addition to assessing proficiency in science and mathematics, TIMSS collected data related to the teachers, the students, and the schools. To help the researchers comprehend the implementation of the students in their own country, this information was vital. A summated rating scale was developed since the possibility to divulge information item by item is almost impossible. TIMSS called this scale multi-item indicator.
5 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
There are advantages of combining several ideas into one variable such as “improved measurement precision, scope, and validity”. In instances of a collection of a huge amount of information being reported, a summated rating scale reduces data amount and causes an easier way to digest for the public. “For the advantages of summated rating scales over single item measures to hold, it is important that the scale can show reasonable score reliability and validity as an indicator of a latent variable for the sample on which the scale is used”. A collaboration of participating countries helped create TIMSS. These countries created item pools, assessment frameworks, and questionnaires including curriculum, measurement and educational experts. The basis of school curricula is the cause of TIMSS and is designed to examine the provisions of educational opportunities to students. TIMSS investigates two levels: “the intended curriculum and the implemented curriculum”. The “intended curriculum” is the science and math that society expects the learning to take place within the students, along with the process of the organizational system reaching its goal. The “implemented curriculum” is the content taught in class, how it is taught, and who teaches it. The assessment of 2003 which looked at student achievement in math and science has ambitious coverage goals, reporting not only overall science and math achievement scores, but also scores in important content areas in these subjects. Examples of the mathematical topical or content domains (as referred to in TIMSS) that are covered in the fourth grade are “numbers, geometric shapes, measures, and data display. In the eighth grade, the content domains are numbers, algebra, geometry, data and chance. The cognitive domains in each grade are knowing, applying, and reasoning “(2007). The five domains in science are “life science, chemistry, physics, earth science, and
6 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
environmental science that defined the specific science subject matter covered by the assessment. The cognitive domains, four in mathematics (knowing facts and procedures, using concepts, solving routine problems, and reasoning) and three in science (factual knowledge, conceptual understanding, and reasoning and analysis) defined the sets of behaviors expected of students as they engaged with the mathematics and science content”. Students achievement was reported in terms of performance in each content area as well as in science and math overall. To encourage the US with the opportunity to reach the equivalents of high performing students of other countries, the “International Study Center at Boston College”, “The National Science Foundation” and “The Center for Education Statistics” established the TIMSS 1999 benchmarking research. The TIMSS achievements tests were given to students in spring 1999 in conjunction with the administering of TIMSS in other countries. “Participation in TIMSS benchmarking was intended to help states and districts understand their comparative educational standing, assess the rigor and effectiveness of their own mathematics and science programs in an international context, and improve the teaching and learning of mathematics and science”. With regards to test reliability, statistically “the median reliabilities ranged from 0.62 in Morocco to 0.86 in Singapore. The international median, 0.80 is the median of the reliability coefficients for all countries. Reliability coefficients among benchmarking participants were generally close to the international median ranging from 0.82 to 0.86 across states, and from 0.77 to 0.85 across districts”. An example of the validity and reliability of TIMSS is the method of assessment for both fourth and eighth grades. This method of assessment is equal across continents. “TIMSS provides an overall
7 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
mathematics scale score as well as content and cognitive domain scores at each grade level. The TIMSS mathematics scale is from 0-1,000 and the international mean score is set at 500, with a standard deviation of 100” (2007). Regarding test validity, the following questions were posed by TIMSS for comparative validity: “Is curriculum coverage, instrument translations, and target populations comparable? Was sampling of populations and scoring of constructed-response items conducted correctly? Were the achievement tests administered appropriately?” The variance between teachers on evaluative judgments can vary between teachers and is part of the assertion that is unquestionable. Another assumption is with regards to the idea of validity, defined as “the accuracy of assessment-based interpretations”. What standardized measures lose in validity gain in reliability. “Our concern for the reliability of teachers’ evaluative judgments must be tempered by the realization that particular classroom assessment tasks are vital to provide a comprehensive picture of student and school success”. Comparative validity is a trademark of TIMSS and is the main reason that international data has become acceptable as an instrument of educational policy analysis. The question often raised in relation to international testing is if these results have meaning? International testing programs such as TIMSS have been criticized for two reasons; first, “other nations have not tested as large a percentage of their student population causing their scores to be inflated; and second, our best students are among the world’s best, with our average being brought down by a large cohort of low-
8 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
achievers”. Due to the qualities of the test, the validity of the inferences has become dependent. “The degree to which the test items adequately represent the construct and whether the number of items administered is enough to provide scores with sufficient reliability. In addition, the scores should be reasonably free from construct-irrelevant variance”. In simpler terms, observing scores reveal the indication of an awareness and are not influenced by contributions that are not relevant to the construct. “One important potential threat to the validity of score-based inferences is the degree effort devoted by examinees to the test” When an individual takes a test, the examiner assumes the individual will want to get items correct. But there are instances when the test taker does not try his best. As a result, this leads to false underestimation of what the test take can do. A low effort results in a negative biased estimate of the individual’s proficiency. “Whenever test-taking effort varies across examinees, there will be a differential biasing effect, which will introduce construct-irrelevant variance into the test score data”. Low test scores can be caused by the test giver having low proficiency or it could be the test taker has a higher proficiency and is not trying his best on the test. If personal consequences such as grades were affected by the test, then low effort would not be a major validity threat. If the examiner decides not to give his best, it will not be seen as a danger to the measure of score-based conclusions. Many measurements exist where scores will have an impact on test givers, but no impact on test takers. An example of this is the TIMMS. There are two sides to everything. The sword is double edged and the TIMMS assessment is not exempt from their being cons to all the pros. Prais (2007) explains that “when international tests were first introduced nearly two generations ago,
9 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
it was widely understood that their main objective was not-as it seems to have become today- to produce international league table of countries’ schooling attainments, but to provide broader insight into the diverse factors leading to success in learning”. Countries commit to large expenditures of money and time to participate in international assessments. The educational profile of the country is raised by participation and shows the level of commitment a country has towards improved global education, the other side of the sword for this is that these large sums of money and time could be used to improve the educational systems (Robertson, 2005). There has been much research completed presenting very impressive points to be considered when determining the reliability and validity of the TIMMS assessment. Robertson (2005) states, “overall findings from past international studies have been accused of overshadowing particular aspects of each country’s performance and clouding rather than clarifying issues relevant to the implementation of policies”. We are going to look at some of the cons of TIMMS assessment and to what extent the implementation of such an assessment in the 56 United States educational systems could be effected by the same factors as the European countries were. These factors include: student age, baseline data, motivation, curriculum mismatch including cultural differences and translation of the assessment. One factor that has been found to create problems in interpreting scores with validity is the age of the students when they start school. It is important to understand at what age the students were when they began school. Start times for schools within and between countries varies which can make interpreting the assessment scores difficult. As well as knowing the age of the students, Tymms, Merrell, & Jones (2004) maintain that baseline data is needed to be able to
10 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
interpret the data properly. The TIMMS assessment only assesses the students in the middle and the end of their school career. These types of testing practices are only measuring the student’s level at the particular time of the assessment, versus measuring student progress over time. This same consideration can apply when considering a nationwide implementation of a TIMMS like assessment in the United States. As it stands now, across the country there are different age ranges for when a child can start school. Many states implement a head start program and pre-K program to create opportunities to get children ready for school. Since these are not mandated, attendance is voluntary and not all students glean the benefits from attending. This leaves students entering Kindergarten at varying levels of proficiency. What could be considered the primary factor in the validity, reliability, and differences in assessment scores is a curriculum mismatch. This mismatch is not limited to, but includes, curriculum subject matter, translation, or academic vocabulary, unfamiliar context and/or cultural context, and item formats. A curriculum match has been determined to be the most serious concern of the validity of international testing. How well an assessment measure matches the curriculum of the country will determine the success factor of the individual countries participating in the assessment. The translation of the TIMMS test has translated to poor assessments scores for some students. Even though the test goes through rigorous translation practices, the vocabulary used in the context of the test questions proved to be difficult to some students. Item formatting has also posed a problem. While the format might have been easily identified by some, it was not comprehensible by others. Cultural differences also
11 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
created problems with assessments. When comparing the different countries, it is important to understand their differences. It was found that the cultural emphasis placed on education had a direct correlation to student’s success. These differences also carry into test item interpretation and successful answering. These same assessment considerations can be made in America. While, the United States historical data could offer some homogenous research finding, our current classroom demographics and research findings are widely varied. U.S. schools are filled with students of different socio-economic backgrounds, as well as, cultural backgrounds. These differences include a wide range of possibilities and limitations for students when we implement a blanket test without consideration for these differences. These differences include language background, cultural background, life experiences, and education background of the second language learners coming to the U.S. The United States could be compared to a small world There are populations of people represented in our populations from all over the world. So while the specific countries have their own issues with the TIMMS, the U.S. faces all of these problems within our own country. The admirable goal of universal success, which is implicit in the No Child Left Behind requirements, is simply not realistic (Holland, 2009) Lastly, motivation and low-stakes testing go hand in hand in having an effect on assessment results. Low-stakes testing is testing implemented for the purpose of collecting data. The students do not get any feedback of their performance on the test. Their scores do not have any impact of their educational experience. Lowstakes nature of the TIMMS causes an under achievement amongst its assessment candidates. This lack of motivation and ultimate low achievement could create a biases
12 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
guess of the skill of the student creating a risk to the validity of the outcome of the test (Elk, 2007). This is an area that lacks in research and a place on the TIMMS test battery.
Motivation is a human characteristic that does not have
national boundaries. No matter where or what kind of assessment is administered, the test taker’s level of motivation will have an effect on their score. If we were to now implement a low-stakes test at this stage in the game in the U.S., it would provide some interesting results. With our students now being exposed to high stakes testing, a lowstakes test probably would create the same anxieties as does the state competency tests. The nature of the test and its ramifications would need to be fully explained to the students.
In present literature the TIMSS project is most often
valued due to its “rich comparative data about educational systems, curriculums, school characteristics, and instructional practices. One strength of TIMSS is its attempt to link student performance to school improvement” (Rotherg, 1998). However, Rotherg poses several concerns about the validity of how the scores are ranked “because countries differ substantially in such factors as student selectivity, curriculum emphasis, and the proportion of low-income students in the test-taking population” (p.1030). “Evidence has been produced to justify concerns at the secondary level (Bracey, 2000; Rotberg, 1998). At the primary and middle school level researchers monitored the interpretation of TIMSS results”. Wang (2001) suggested that “one researcher believed that since TIMSS was not a controlled scientific study and did not measure the effectiveness of one teaching method against another”, the findings could not support certain reforms at a local school. Another issue that is a concern is that TIMSS failed to examine results of diverse population.
13 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
America actually compares favorably with other nations Caucasians, especially considering that 25% of the population is of under-performing African and Latino decent. The top nations are all East Asian. This study does not break down Americans by race, if they did, Asian Americans would likely score high as Asians in their home countries, and Whites would near top of the European nations. (Hu, 2000. p. 8)
Several technical problems became apparent which came from the database of TIMSS. Wang (2001) found additional technical problems which in the end can skew the comparative results which undercut the reliability of the TIMSS bench marking. Wang (2001) believes there are some disadvantages of TIMSS: (1) “The Format of TIMSS Test Items Is Not Consistent With the Goals of Some Education Reforms”, (2) The TIMSS Test Booklets Have Discrepant Structures and, (3)”Because of Grade-Level Differences and Content-Differences Among Countries, the TIMSS Tests Might Not Align With What Students Have Learned” and (4)”Several Problematic Age Outliers in the TIMSS Database Are Not Adequately Explained”. The formats of test items have been found to be inconsistent with the goals of various school reforms. “TIMSS test measures mostly lower learning outcomes by means of predominantly multiple choice format (Lange, 1997, p. 3). The test items are arranged by 429 multiple-choice, 43 short-responses, and 29 extended-response items (Lang, 1997)”. From the pool of questions students were tested on subsets of questions which would not reflect outcomes that came from any reform initiative. When issuing booklets for testing (Gonzalez & Smith, 1997), expressed how the rotation of booklets arranged in various clusters could reflect invalid results for students. A study should show booklet eight and booklets one through seven were structured by clusters. Booklet eight focused on the breadth cluster which could only
14 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
benefit students who had an in-depth study of the content. Whereas other booklets one through seven contained a more focus cluster. Due to the structure discrepancy systematic errors can occur across test booklets. TIMSS is usually administered to students in adjacent grades such as third and fourth graders for students in the United States which is considered primary grades. The adjacent grades resulted in grade gaps based on school experience. Each grade level comes with different learning levels and school experience. Because of the learning difference experience Wang, 2001 suggested that it would be unclear to say whether any testing instrument could measure what students have learned in any grade level. Martin and Kelly (1997) pondered “whether the meaning of student population should be defined by chronological age or grade level as it relates to student achievement”. The authors believed that “age outliers in the TIMSS database were not adequately explained”. For example student population ranged from adults as old as a 49.9 year old seventh grader from a foreign country and as young as a10 year old eighth grader from the United States. Since a students’ age is a factor in cognitive development, the age outliers are deemed essential when analyzing data in studies such as TIMSS. Wang (2001) supports that when interpreting results of TIMSS, the test score component will come with problems of technical nature. In conclusion, TIMSS is a vital examination that gathers important information readily available to worldwide researchers and may impact educational systems throughout the nation. It is important that science and mathematics measure both what they are designed to measure to be certain that the questionnaires are selected with care, scored, studied closely, and find meaning. It is
15 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
here researchers that use TIMSS have an significant task to pursue. Therefore, due to the content domain, cognitive domain, and assessment, “Trends in International Mathematics and Science Study (TIMSS)” has proven to be an both an assessment that is valid and reliable among International Educational Assessment System, but does have its drawbacks. Reference Eklof, H. (2007). Test-taking motivation and mathematics performance in TIMMS 2003. International Journal of Testing, 7, 311-326. Gonzalez, E. J., & Smith, T.A. (1997). Users guide for the TIMSS international database. Chestnut Hill, MA: TIMSS International Study Center Gonzales, Patrick; Williams, Trevor; Jocelyn, Leslie (2008). Highlights from TIMSS 2007: Mathematics and science achievement of U.S. fourth and eighth grade students i Hu, A. (2000). TIMSS:. in an international context. National Center for Education Statistics. Hu, A. (2000). TIMSS: Arthur Hu’s index www.leconsulting.com/arthurhu/index/timss.htm. Hussein, M.G. (1992). What does Kuwait want to learn from TIMSS? Prospects (22), 275-277. International Association for the Evaluation of Educational Achievement (IEA), Trends in International Mathematics and Science Study (TIMSS)(2007) Lange, J.D. (1997). Looking through the TIMSS mirror from a teaching angle. http://www.enc.org/topics/timss/additional/documents Martin, M. O., & Kelly, D.L. (1997). Technical report volume II: Implementation and analysis. Chestnut Hill, MA: TIMSS International Study Cen Martin, M.O. (Ed.) (2003). TIMSS 2003 User Guide for the International Database. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Martin, M.O. & Mullis, I.V.S (2006). TIMSS in perspective: Lessons learned from IEA’s four decades of international mathematics assessments. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.
16 William Molnar, Vincent Gordon, Kindra Jones, and Maria Newton-Tabon
Mullis, I.V. S., Martin, M.O., Gonzalez, E.J.,Gregory, K.D., Garden, R.A., O’Connor, K.M., Chrostowski,S.J. & Smith, T.A. (2000). TIMSS 1999 International Mathematics Report; findings from IEA’s repeat of the Third International Mathematics and Sciency Study at the eight grade, Chestnut Hill, MA, Boston College. Rose, L.C. (1998). Who cares? And so what? Responses to the results of the third international mathematics and science study, Phi Delta Kappan, 79(10), 722. Rotberg, I. (1998). Interpretation of international test scores comparisons. Science, 280, 1030-1031. Wang, J. TIMSS primary and middle school data:Some technical concerns, Educational Researcher, Vol.30, pp.17-21 www.minniscoms.com.au/educationtoday/articles.php?articleid=150 www.mackinac.org/article.asps?ID=6998 http://timss.bc.edu/timss1999b/sciencebench_report/t99bscience_A.html http://nces.ed.gov/tmiss/Results03.asp http://timss.bc.edu/timss2003.html http://timss.bc.edu/TIMSS2007/about.html www.iea.nl/timss2007.html www.asanet.org/footnotes/jan05/fn10.html www.ed.gov/inits/Math/silver.html